Introduction to Python for Bioinformatics

Before we begin...

Python is a very commonly used programming language in the world of bioinformatics. Before diving into Python, let us examine some context: what tools are typically found in the "bioinformatics survival kit"?

  1. The Terminal, Shell and Command Line Interface. For a refresher on this topic consult the Software Carpentry lessons on the Unix shell.
  2. Working with remote computers: transfering files and logging in to computers over the network.
  3. Text transformation: filtering and transforming files in FASTA, VCF, TSV and other formats. Python and Shell tools are good for this, as are some of the tools available via Galaxy (see the Text Manipulation section on most Galaxy servers).
  4. Plotting and doing statistics. R and Python both have options for these tasks - ggplot for R, matplotlib and Altair for Python, for example.
  5. Dependency management: dependencies are tools and software modules that you need to use. This is a huge topic and there are two main approaches to know about:
    1. Package management with conda (see these tutorials 1, 2)
    2. Software containers with Singularity (see 1, 2)
  6. Scientific workflow languages and workflow systems that organise your work into re-useable units. Some examples are:
    1. Galaxy: there is a wealth of material from the Galaxy Training Network
    2. Nextflow: the core Nextflow language and nf-core
    3. CWL: the Common Workflow Language (https://www.commonwl.org/), a workflow language that emphasises standards and portability
  7. A software development environment:
    1. Notebook type interfaces like the one provided by RStudio or Jupyter Lab
    2. A programmer's text editor such as VS Code or Atom (you might also have encountered simpler editors like Nano)
  8. Familiarity with software version control and the git and Github (or Gitlab) systems.

This list can seem overwhelming. Remember this list is here to offer context, not with an expectation that you will master all of these topics at once. Finally two more resources for training material collections are:

  1. An introduction to skills for microbial bioinformatics
  2. The Swiss Institute of Bioinformatics (SIB) list of training material
  3. The ELIXIR TeSS training search engine

P.S. I am often asked: R or Python? My answer is: both! As you can see above they are both part of a well rounded approach to bioinformatics (as are other languages like Rust, C / C++, Javascript and even Java, all more specialised languages used in fields of bioinformatics that range from algorithm implementation to user interface building). While the order in which you focus on languages like Python and R depends on what problem you are approaching first, I would urge everyone to become familiar with both.

Python for Bioinformatics: Outline

We will be working with the book Python for Biologists by Dr Martin Jones. A early edition of Dr Jones' book is available as a PDF under a permissive license. Hard copies of the follow on books, "Advanced Python for Biologists" and "Effective Python Development for Biologists" are available in the UWC Library. The Jupyter Lab interface will be used for working with the examples from this book except where the subject matter focuses on writing stand alone scripts.

Setting up your Python environment

The best way to install Python is using conda. There are two options for setting up your conda install: Anaconda, an all-in-one installer that installs Python and many other packages (include Jupyter Lab and a lot more) and Miniconda, a more compact installer that installs Python and the conda package manager and then gives you freedom to install further packages yourself as you need them.

Please see the instructions on the page about setting up your Python environment.

Introduction to Python lessons