Text files are data files
18 images on this page
Overview
- Why you have to worry about text
- Text editor options
- Atom
- RStudio
- NotePad (WinPC)
- Find hidden characters with Notepad
- TextEdit (macOS)
- Find spaces with TextEdit
- Google Sheets not Google Docs
- Online text editor
- Browser text editor options
Why you have to worry about text
In bioinformatics, you work with all sorts of data, from DNA & RNA sequence reads to protein sequences. These data are supplied to any number of bioinformatics software. Because the bioinformatics software serves different purposes, there can be different requirements for how the data are presented to the software. These software formats have evolved over the years, and include
FASTA, CLUSTAL, PHYLO, NEWICK
NEXUS, GENBANK, Protein databank
EMBL, UniProt, and others
By convention, these file formats are noted with a file extension. For example, FASTA files may end with .fa or .fas, or even .fasta, while our familiar Newick files may have the extension .nwk or even .newick.
UGENE, the genome workbench we use in BI308L, serves as a GUI front-end for dozens of bioinformatics software applications. One advantage of using UGENE is that it helps streamline the exchange of data from one application to another. In other words, working with UGENE allows you to ignore the format requirements of each algorithm. Except when it can’t help you… Thus, at some point, you’ll have to work directly with the data files. All of these files are just text files. No frills, no formatting, just text (characters) only. So, you’ll need an app to help you. Fortunately, all computers come with a default application that can work with these varieties of bioinformatics files. How can this be? Because most bioinformatics files are just text files (or can be quickly converted to text only files).
Your text file contains hidden, non-printing characters
There’s quite a bit of history to this topic, actually, but I’ll simply point you to Wikipedia, then tell you why you need to know about this to do bioinformatics work with data files. When you copy and paste, either from a webpage or from text in another app, you’re likely to pick up formatting code used by the web developer to display the text. For report writing, this is unlikely to cause you much grief. However, as used for data sources, it can lead to many frustrations. Most bioinformatics routines expect data presented in a particular way. For example, sequence data can be stored in FASTA format (see ).
An online tool at https://www.soscisurvey.de/tools/view-chars.php gives a nice display. Figure 1 contains a screenshot from the site, I copied the FASTA formatted mystery sequence 1 from Bioinformatics I: BLAST
Figure 1. Screenshot from https://www.soscisurvey.de/tools/view-chars.php, non-printable characters of a FASTA file.
Two control characters are visible: CR and LF. CR stands for “carriage return” and LF stands for “line feed.” Together, CRLF is read by the computer as start a new line. This will work fine in BLAST etc.
Why use a text editor, why not use your word processor?
You can use word processor apps like Microsoft Word to work with text, of course. There are advantages — Word has great Find and Replace functions, which help when you’re looking for hidden control characters (see below). The problem is these apps want to add formatting, and can add substantially to the bulk of the file. You have to turn off all of these features in order to strip the app down to perform just as a text editor. Do not use your word processors, e.g., Apple Pages, Google Docs, Microsoft Word, WordPad, LibreOffice Writer, etc. While these programs do work with text files (although note that the online version of Miscrosoft Excel cannot import text files!), these programs are as much about formatting text as they are for writing and editing documents. They can be good text editors, provided you know how to work with them. For example, most come with an option to make hidden characters visible, which comes in handy when you are typing.
In the end, you’ll spend more time hunting down why your “text file” from your word processing app fails in some bioinformatics program then if you just used an app that specializes in plain text.
Text Editor options
Text only means just that — text only (see Glossary entry).
What you need is an app that just lets you type, adds nothing to the characters, and saves your work.
So, this means you need to choose and work with a text editor, an application devoted to working with text. An app that doesn’t try and add extra stuff. An app that allows you to see each character (including spaces!). You have many options.
Note: Many of the errors I help students with involve correcting text errors, including extra spaces.
Atom
The best option is to install an app that’s intended for the coding community, like atom, free from the GitHub folks. (When you install atom, you’ll be prompted to also download and install additional developer tools — decline, you don’t need those to use the editor.) Atom gives you extensive control over your text, including making hidden characters, e.g., spaces between words, visible. It takes a couple of minutes to setup Atom (settings), but once you have it done, it’s a good editor, although a bit overkill for our needs.
RStudio source editor
Another option, use RStudio’s source editor. To make hidden characters like spaces (called whitespace) visible,
- Select Tools → Options from the menu bar
- Select Editor from the context menu
- Display options for the Editor window are shown at right listed under Side-by-Side Editor
- To show whitespace characters as symbols, select the Show whitespace characters check box.
Of course, you’ll have to install R and RStudio, see How to install R.
There are options to access text editors in your browser. Or, go with the apps that are already installed on your computer! Let’s start there.
Microsoft Windows 10
I recommend beginners stick with Notepad, a default available on all Windows PC operating systems. Notepad is simple text editor on Microsoft Windows 10. Notepad has been around since Windows 95, and is still available on Windows 10, although tricky to find. The simplest way to find and open Notepad is by search. Click in the search box at the lower left portion of your screen and start typing notepad… (Fig. 2). As you type, matches appear in the upper portion of the search box; click on Notepad when it’s icon appears.
Figure 2. Screenshot of search on Windows 10.
After notepad starts, you have simple program for working with text (Fig. 3).
Figure 3. Screenshot of Notepad open on Windows 10
Find hidden characters with Notepad
Point cursor to start of the document, then open Find (Ctrl + F), and enter a space (single tap on space bar) in the search box (Fig. 4).
Figure 4. Screenshot of Notepad Find, a whitespace found at end of the line.
Apple macOS
The default text editor for the macos operating system, e.g., 10.13 High Sierra or 10.14 Mojave, is called TextEdit. You’ll find the TextEdit.app in your Applications folder, or you can call for it by typing Textedit… in Spotlight. After selecting the app, you’ll see the TextEdit window (Fig. 5)
Figure 5. Screenshot TextEdit on macos 10.11; note that the default settings are rich text formatting and not what we want for our bioinformatics data files!
But note something is wrong! TextEdit defaults to giving you FORMATTED text! Specifically, default settings for TextEdit are rtf or rich text format. Although rtf is simple formatting, it is formatting nonetheless, which will require you to take steps to strip the formatting before your text file can be used in your bioinformatics software.
You have two options, one temporary solution, the other makes the changes permanent (recommended).
Option 1. Save as text only, a temporary fix
To save your text file as text only, click on Format, then select Make Plain Text (Fig. 6).
Figure 6. Screenshot of menu option, Make Plain Text
After applying this option, your screen will look like the one in Figure 7; you’re now working with text only.
Figure 7. Screenshot of textedit, now in text only mode.
Option 2. Change TextEdit preferences so that the default is text only
This is by far the best choice. After all, you don’t typically use TextEdit to wrote your reports, do you? So change TextEdit so that it may work easily with text only files. After opening TextEdit, open preferences (Fig. 8).
Figure 8. Screenshot, shows selecting preferences for TextEdit
Within the Preferences setting, find Format and click on Plain text option (Fig. 8, Fig. 9). Recommended, but optional changes are also included in Figure 8 (e.g., change Plain text font to Courier New 12, which is more universal than Apple’s Menlo (cf. CSS Web safe fonts). I also recommend unchecking spell and grammar check (Fig. 9).
Figure 9. Screenshot of TextEdit Preferences, select Make Plain Text, uncheck spelling, grammar, and other options.
After leaving Preferences, your TextEdit window will look like the one in Fig 7; plus, it will look like this every time you use TextEdit — it’s now as it should be, only a text editor.
Find hidden characters in TextEdit
Use Find (Cmd F), and enter a space in the search box (Fig. 10).
Figure 10. Screenshot of find results for hidden space character.
Alternatively, click on the spyglass to bring up more search options. Select “White space” from the list (Fig. 11).
Figure 11. Screenshot TextEdit Find menu after clicking on the spyglass
Note: Additional hidden characters are available for search with this option, including typical “end of line” characters.
Google Sheets not Google Docs
Many students use the wonderful Office apps by Google available for free with a Google account. Google Docs seems like a natural for creating text files, but my experience is poor — Google Docs adds extra returns and, more importantly, used UTF-8 with BOM, not the strict UTF-8 format many bioinformatics apps require. However, Google Sheets will do the trick, at least if you just need a list of words one line at a time. For example, if I needed a text file with species common names
Cat Chicken Dog Human Mouse Pig
Then enter the names one row at a time in the A column. Next, create the text file by selecting File > Download > Comma separated values (.csv). In the popup window, change to .txt not .csv extension by selecting All Files (*.*). This will give you a text file in UTF-8.
Online text editor
Use an online text editor like https://texteditor.co/. This is a good solution — you have access to settings, which allow you to make hidden characters (like spaces) visible.
Add text editor to your browser
Additionally, if you are using Chrome or FireFox browsers (highly recommended!), then you can install plugins that allow your browser to work with text files. For FireFox, I use Text Editor plugin.
FireFox: To get the plugin, type in the FireFox browser search window: Firefox text editor plugin
From the list, select TE, a text editor by Sevina (screenshot, Fig. 12)
Figure 12. Screenshot of FireFox extension option
Click on the Add to Firefox, click OK to the popup request, and the extension will be installed in your browser. Look to the upper right corner of your browser window for the <TE>, that’s your text editor (Fig. 13).
Figure 13. Screenshot of FireFox browser, shows the text editor plugin has been installed.
Now, anytime you want to work on a text file you simply click on the <TE> link in the browser and open the text editor window (Fig. 14)
Figure 14. FireFox text edit add-in up and running.
update: April 2021, Chrome has discontinued browser extensions
Chrome browser: To get the plugin, type in the Chrome browser search window: Chrome text editor plugin. The chrome webstore will open; There will be several options. I selected the add-in from text.app
Figure 15. Screenshot of chrome webstore, text.app
Click on the blue button to install the add-in. After the app has installed, you’ll find it in your collection of Google apps — click on the rainbow grid in your browser, usually located next to your profile icon (Fig. 14). The rainbow grid is where you access Google docs, Calendar, etc.
Figure 16. Screenshot of portion of Chrome browser, the rainbow app grid.
Find the icon for the text app (Fig. 17), double-click to start.
Figure 17. Screenshot, magnified, of Google apps menu. Select <txt> icon to start the text application.
And finally, we have our text editor (Fig. 18).
Figure 18. Screenshot, Chrome text app in use.
There are other plug-ins for Chrome, including Text Editor by Sevina, available via Chrome store.
/MD