Introduction to Python for Bioinformatics – 2018 page

The environment we will use for the Python course is JupyterLab.

The notebooks for the first week of classes are Day1 (on the basics of Python), then two short notebooks showing how importing works math1 and math2. This was followed by an illustration of the power of pandas, a library for data analysis in Python, followed by illustration of plotting using pandas and matplotlib. We ended off the week with lists and loops. For the first week’s content, look at the Software Carpentry lesson on plotting and programming.

As an aside, the Jupyter Lab notebook allows cells to be written in Markdown, a simple markup language for writing rich text.

Starting on 26 February, I showed how you can use “magic” to combine multiple programming languages in a Jupyter notebook. And then we did some exercises on loops. This code was tested using the code simulator from Python Tutor.

The next topic after loops was functions. For those interested, there are also some notes on “aliasing” and how this impacts on functions in the notebook on update in place vs transform. Wednesday’s topic was conditionals, Booleans, boolean algebra and the if statement. Thursday we started on string methods. Finishing off the week, on Friday we started looking at Biopython’s SeqIO and some use of string splitting. The notebook is here. To use this you need to install the biopython package in your Conda environment with:

conda install -y biopython

this needs to be run in Anaconda Prompt on Windows or your Terminal on Mac or Linux. You also need the misnamed python.fasta.

Monday of week 3 started with string joining, at the end of last Friday’s notebook. And then we took a deep dive into regular expression, the powerful tool for pattern matching in Python. There is much written on this:

  1. The Python Regular Expression HOWTO.
  2. The Regular Expressions Tutorial on TutorialsPoint.
  3. Another tutorial on RegExOne.
  4. The awesome regexp debugger from Debuggex (switch it to Python style regular expressions).
  5. The Python Regex Cheatsheet.

The notebook on regular expressions with many examples!

Tuesday’s lesson was on retrieving data from the web using requests. Many biological databases like NCBI and EBI’s ENA allow us to retrieve data using web requests. To automate our work it can be useful to access these resources via programs we write. There are Python modules to access some of them (e.g. NCBI using the BioPython Entrez module), but an increasing number of databases are making their data accessible via RESTful APIs. What this means in practice is that every resource can be accessed using a correctly crafted URL (Uniform Resource Locator aka. link). Tuesday’s lesson used a file downloaded from EBI’s European Nucleotide Archive (ENA), truncated and stored on Github. This file was read from the web using requests. The .text() method of the Response handle was used to get the contents of the page, which were in turn converted into a “file like object” using Python’s StringIO, which allowed the text to be fed to SeqIO.parse().

The rest of the lesson considered extracting the descriptions of the BioPython sequence records and storing the species name from the description in a Python set. The set type in Python is a mutable type that always contains only one item of each type. For example it will never store "apple" twice. It is useful for producing a unique list of items.

As part of a multi-lesson practical we focused on the text of Charles Darwin’s book on Earthworms which we downloaded in text version from Project Gutenberg. To keep load off the Project Gutenberg servers a copy was stored on GitHub. The notebook based on this book is called earthworms.ipynb. We started our dive into this book with an exploration of loading the book from the Internet using requests and using StringIO to turn the text of the book into lines. This led to an exploration of line counting, and and the different iterable types in Python. To clarify, a for takes an iterable type and puts its elements in the loop variable, one at a time. Thus:

for loop_var in iterable:
  # do something here

The variable loop_var is the loop variable, and iterable is the thing that is being iterated over. Here are some examples of iterables and the types that are assigned to the loop variable in each case: range(10) -> numbers in the range, string -> characters in the string, SeqIO.parse(...) -> sequence records, StringIO(...) -> lines, list -> elements of the list, set -> elements of the set (but not in order).

After discussing iterables, we examined words, and what makes up a word in Python (including non-Western letters). Finally we started work on a function, is_word(), that returns whether a string is a word or not. E.g. “apple” is a word, “{2}” is not.

This work was continued in earthworms_count.ipynb with an examination of word counting in the earthworms book, using first the .split() method and then later filtering the list returned from .split() to only include real words. Combined with the set type this allowed for a count of the unique words in the book. At this stage a little block of code (the first cell in the earthworms_count.ipynb) was used to read in the latest version of the code from the Internet, so as to not keep cutting and pasting from the Etherpad. This was combined with the “magic” %load earthworm.py to incorporate the code into a notebook cell. The notebook cell was then run. After counting the unique words, the focus shifted to Python dictionaries.

The final code for the course was done in earthworm_word_count_dict.ipynb where dictionaries were combined with all the previous code to count how often unique words occurred in Darwin’s book. A little bit of sorting revealed how even this simple word counting is useful for textual analysis as words such as castings, burrows and mould showed up prominently as frequently used words. With that we concluded the Python part of our course.

P.S. if you are interested in using Python for text analysis there is an entire course online.
P. P. S. the introduction to git was based off the Software Carpentry git lesson. There are other tutorials online like try git and the ones from Atlassian.