Text files are data files

18 images on this page

Overview

  • Why you have to worry about text
  • Text editor options
    • Atom
    • RStudio
    • NotePad (WinPC)
      • Find hidden characters with Notepad
    • TextEdit (macOS)
      • Find spaces with TextEdit
    • Google Sheets not Google Docs
  • Online text editor
  • Browser text editor options

Why you have to worry about text

In bioinformatics, you work with all sorts of data, from DNA & RNA sequence reads to protein sequences. These data are supplied to any number of bioinformatics software. Because the bioinformatics software serves different purposes, there can be different requirements for how the data are presented to the software. These software formats have evolved over the years, and include

FASTA, CLUSTAL, PHYLO, NEWICK

NEXUS, GENBANK, Protein databank

EMBL, UniProt, and others

By convention, these file formats are noted with a file extension. For example, FASTA files may end with .fa or .fas, or even .fasta, while our familiar Newick files may have the extension .nwk or even .newick.

UGENE, the genome workbench we use in BI308L, serves as a GUI front-end for dozens of bioinformatics software applications. One advantage of using UGENE is that it helps streamline the exchange of data from one application to another. In other words, working with UGENE allows you to ignore the format requirements of each algorithm. Except when it can’t help you…  Thus, at some point, you’ll have to work directly with the data files. All of these files are just text files.  No frills, no formatting, just text (characters) only. So, you’ll need an app to help you. Fortunately, all computers come with a default application that can work with these varieties of bioinformatics files. How can this be? Because most bioinformatics files are just text files (or can be quickly converted to text only files).

Your text file contains hidden, non-printing characters

There’s quite a bit of history to this topic, actually, but I’ll simply point you to Wikipedia, then tell you why you need to know about this to do bioinformatics work with data files. When you copy and paste, either from a webpage or from text in another app, you’re likely to pick up formatting code used by the web developer to display the text. For report writing, this is unlikely to cause you much grief. However, as used for data sources, it can lead to many frustrations. Most bioinformatics routines expect data presented in a particular way. For example, sequence data can be stored in FASTA format (see ).

An online tool at https://www.soscisurvey.de/tools/view-chars.php gives a nice display. Figure 1 contains a screenshot from the site, I copied the FASTA formatted mystery sequence 1 from Bioinformatics I: BLAST

Screenshot from https://www.soscisurvey.de/tools/view-chars.php, non-printable characters of a FASTA file

Figure 1. Screenshot from https://www.soscisurvey.de/tools/view-chars.php, non-printable characters of a FASTA file. 

Two control characters are visible: CR and LF. CR stands for “carriage return” and LF stands for “line feed.” Together, CRLF is read by the computer as start a new line. This will work fine in BLAST etc.

Why use a text editor, why not use your word processor?

You can use word processor apps like Microsoft Word to work with text, of course. There are advantages — Word has great Find and Replace functions, which help when you’re looking for hidden control characters (see below). The problem is these apps want to add formatting, and can add substantially to the bulk of the file. You have to turn off all of these features in order to strip the app down to perform just as a text editor. Do not use your word processors, e.g., Apple Pages, Google Docs, Microsoft Word, WordPad, LibreOffice Writer, etc. While these programs do work with text files (although note that the online version of Miscrosoft Excel cannot import text files!), these programs are as much about formatting text as they are for writing and editing documents. They can be good text editors, provided you know how to work with them. For example, most come with an option to make hidden characters visible, which comes in handy when you are typing.

In the end, you’ll spend more time hunting down why your “text file” from your word processing app fails in some bioinformatics program then if you just used an app that specializes in plain text.

Text Editor options

Text only means just that — text only (see Glossary entry).

What you need is an app that just lets you type, adds nothing to the characters, and saves your work. 

So, this means you need to choose and work with a text editor, an application devoted to working with text. An app that doesn’t try and add extra stuff. An app that allows you to see each character (including spaces!). You have many options.

Note: Many of the errors I help students with involve correcting text errors, including extra spaces.

Atom

The best option is to install an app that’s intended for the coding community, like atom, free from the GitHub folks. (When you install atom, you’ll be prompted to also download and install additional developer tools — decline, you don’t need those to use the editor.)  Atom gives you extensive control over your text, including making hidden characters, e.g., spaces between words, visible. It takes a couple of minutes to setup Atom (settings), but once you have it done, it’s a good editor, although a bit overkill for our needs.

RStudio source editor

Another option, use RStudio’s source editor. To make hidden characters like spaces (called whitespace) visible,

  • Select Tools → Options from the menu bar
  • Select Editor from the context menu
  • Display options for the Editor window are shown at right listed under Side-by-Side Editor
  • To show whitespace characters as symbols, select the Show whitespace characters check box.

Of course, you’ll have to install R and RStudio, see How to install R.

There are options to access text editors in your browser. Or, go with the apps that are already installed on your computer! Let’s start there.

Microsoft Windows 10

I recommend beginners stick with Notepad, a default available on all Windows PC operating systems. Notepad is simple text editor on Microsoft Windows 10. Notepad has been around since Windows 95, and is still available on Windows 10, although tricky to find. The simplest way to find and open Notepad is by search. Click in the search box at the lower left portion of your screen and start typing notepad… (Fig. 2). As you type, matches appear in the upper portion of the search box; click on Notepad when it’s icon appears.

Windows search for Notepad

Figure 2. Screenshot of search on Windows 10.

After notepad starts, you have simple program for working with text (Fig. 3).

Notepad open on Windows 10

Figure 3. Screenshot of Notepad open on Windows 10

Find hidden characters with Notepad

Point cursor to start of the document, then open Find (Ctrl + F), and enter a space (single tap on space bar) in the search box (Fig. 4).

Screenshot Notepad Find hidden character = space

Figure 4. Screenshot of Notepad Find, a whitespace found at end of the line.

Apple macOS

The default text editor for the macos operating system, e.g., 10.13 High Sierra or 10.14 Mojave, is called TextEdit. You’ll find the TextEdit.app in your Applications folder, or you can call for it by typing Textedit… in Spotlight. After selecting the app, you’ll see the TextEdit window (Fig. 5)

default TextEdit

Figure 5. Screenshot TextEdit on macos 10.11; note that the default settings are rich text formatting and not what we want for our bioinformatics data files!

But note something is wrong! TextEdit defaults to giving you FORMATTED text! Specifically, default settings for TextEdit are rtf or rich text format. Although rtf is simple formatting, it is formatting nonetheless, which will require you to take steps to strip the formatting before your text file can be used in your bioinformatics software.

You have two options, one temporary solution, the other makes the changes permanent (recommended).

Option 1. Save as text only, a temporary fix

To save your text file as text only, click on Format, then select Make Plain Text (Fig. 6).

change textedit format

Figure 6. Screenshot of menu option, Make Plain Text

After applying this option, your screen will look like the one in Figure 7; you’re now working with text only.

textedit, now text only

Figure 7. Screenshot of textedit, now in text only mode.

Option 2. Change TextEdit preferences so that the default is text only

This is by far the best choice. After all, you don’t typically use TextEdit to wrote your reports, do you? So change TextEdit so that it may work easily with text only files. After opening TextEdit, open preferences (Fig. 8).

open preferences

Figure 8. Screenshot, shows selecting preferences for TextEdit

Within the Preferences setting, find Format and click on Plain text option (Fig. 8, Fig. 9). Recommended, but optional changes are also included in Figure 8 (e.g., change Plain text font to Courier New 12, which is more universal than Apple’s Menlo (cf. CSS Web safe fonts). I also recommend unchecking spell and grammar check (Fig. 9).

Screenshot of Apple TextEdit preferences

Figure 9. Screenshot of TextEdit Preferences, select Make Plain Text, uncheck spelling, grammar, and other options.

After leaving Preferences, your TextEdit window will look like the one in Fig 7; plus, it will look like this every time you use TextEdit — it’s now as it should be, only a text editor.

Find hidden characters in TextEdit

Use Find (Cmd F), and enter a space in the search box (Fig. 10).

TextEdit_Find02.png

Figure 10. Screenshot of find results for hidden space character. 

Alternatively, click on the spyglass to bring up more search options. Select “White space” from the list (Fig. 11).

Screenshot TextEdit Find submenu, available after clicking spyglass

Figure 11. Screenshot TextEdit Find menu after clicking on the spyglass

Note: Additional hidden characters are available for search with this option, including typical “end of line” characters.

Google Sheets not Google Docs

Many students use the wonderful Office apps by Google available for free with a Google account. Google Docs seems like a natural for creating text files, but my experience is poor — Google Docs adds extra returns and, more importantly, used UTF-8 with BOM, not the strict UTF-8 format many bioinformatics apps require. However, Google Sheets will do the trick, at least if you just need a list of words one line at a time. For example, if I needed a text file with species common names

Cat
Chicken
Dog
Human
Mouse
Pig

Then enter the names one row at a time in the A column. Next, create the text file by selecting File > Download > Comma separated values (.csv). In the popup window, change to .txt not .csv extension by selecting All Files (*.*). This will give you a text file in UTF-8.

Online text editor

Use an online text editor like https://texteditor.co/. This is a good solution — you have access to settings, which allow you to make hidden characters (like spaces) visible.

Add text editor to your browser

Additionally, if you are using Chrome or FireFox browsers (highly recommended!), then you can install plugins that allow your browser to work with text files. For FireFox, I use Text Editor plugin.

FireFox: To get the plugin, type in the FireFox browser search window: Firefox text editor plugin

From the list, select TE, a text editor by Sevina (screenshot, Fig. 12)

get firefox plugin

Figure 12. Screenshot of FireFox extension option

Click on the Add to Firefox, click OK to the popup request, and the extension will be installed in your browser. Look to the upper right corner of your browser window for the <TE>, that’s your text editor (Fig. 13).

TE plugin installed

Figure 13. Screenshot of FireFox browser, shows the text editor plugin has been installed.

Now, anytime you want to work on a text file you simply click on the <TE> link in the browser and open the text editor window (Fig. 14)

text editor open in FireFox

Figure 14. FireFox text edit add-in up and running.

update: April 2021, Chrome has discontinued browser extensions

Chrome browser: To get the plugin, type in the Chrome browser search window: Chrome text editor plugin. The chrome webstore will open; There will be several options. I selected the add-in from text.app

chrome webstore

Figure 15. Screenshot of chrome webstore, text.app

Click on the blue button to install the add-in. After the app has installed, you’ll find it in your collection of Google apps — click on the rainbow grid in your browser, usually located next to your profile icon (Fig. 14). The rainbow grid is where you access Google docs, Calendar, etc.

look for your apps

Figure 16. Screenshot of portion of Chrome browser, the rainbow app grid.

Find the icon for the text app (Fig. 17), double-click to start.

select the chrome app

Figure 17. Screenshot, magnified, of Google apps menu. Select <txt> icon to start the text application.

And finally, we have our text editor (Fig. 18).

Chrome text app

Figure 18. Screenshot, Chrome text app in use.

There are other plug-ins for Chrome, including Text Editor by Sevina, available via Chrome store.

 

/MD