Introduction to Python for Bioinformatics – 2019 page

The environment we will use for the Python course is JupyterLab on the service that is running at Ilifu on https://bitc-jupyterlab.sanbi.ac.za/hub/login. We will use an Etherpad shared notepad.

Source material includes the Software Carpentry Programming with Python lesson, the Programming for Biologists course, the Python for Biologists book.

An aside: Why Python uses 0-based indexing: a blog post.

Software Carpentry section

Day 1 notebooks: session1, loadingdata and session3. These relate to this lesson.

Day 2 notebooks: session4_loops from this lesson, session5_lists from this lesson, session6_multiple_files from this lesson (what is a glob?), session7_making_choices from this lesson.

Day 3 (Monday – actually Tuesday) notebooks stated with functions based on this lesson. The lessons on errors and exceptions, defensive programming and debugging are left for self-study.

mDay 4 (Wednesday) we moved on to command line programs, so here is the code for code/version.py and code/readings_01.py based on this lesson, and we continued that study on Thursday. For more sophisticated command line argument processing, look at argparse.

Exercise for command line: Arithmetic on the Command Line from Software Carpentry lesson on the command line. A solution is in code/my_calculator.py and another solution is code/myarith.py. A demo on using myarith as a module is here. Also have a look at this multi-number version: multi_arg_arithmetic.py.

An aside: A notebook to download the necessary data is download_code_and_data.ipynb.

Python for Biologists section

Friday 1st March: Python for Biologists strings with answers. Monday 4th March: opening, reading and writing files.

Programming for Biologists has an exercise on processing bird count data which is rendered in this notebook: bird_problem.

From page 65 in Python for Biologists there there are a couple of exercises on writing FASTA files: exercise_writing_fasta.

Moving on to page 84 in Python for Biologists: splitting strings and data from files: splitting_file_data. And another notebook on splitting_files.

Some exercises on loops and using the module (%) operator: and loops_and_module (with a backup copy here loops_and_modulo). Then dictioniaries.

Monday 11 March 2019 exercises: monday_exercises. If Github is playing up and not rendering the notebook you can download the notebook by clicking on the “Raw” button and then using right-click and “Save page as” to save the file locally as “monday_exercises.ipynb”. Save the notebook in the folder where your other notebooks are and open it using Jupyter Lab.

Please read chapters 5 and 6 of Python for Biologists on your own time. We will return to the book with Chapter 8 (dictionaries) on Tuesday before covering Chapter 7 (regular expressions) on Wednesday.

The notebook on dictionaries. The notebook on regular expressions. The Debuggex site and its regular expressions cheat sheet. A guide to f-strings in Python.

Finally k-mer counting.

Additional exercises

An exercise for those who know Python: use ssh to log into il-slurmctl-ext.sanbi.ac.za (using your supplied username and password). There is a collection of data in /data/outbreak including a large number of .fastq.gz files containing short sequences and a .fasta.gz file containing a “reference genome” of a bacterial species. Use snippy to try and map the samples to the reference and observe the results. Do they all map equally well?

Project Rosalind

Project Rosalind, named after Rosalind Franklin, is a bioinformatics programming challenge site that grew out of the work of Philip Compeau and Pavel Pevzner, authors of Bioinformatics Algorithms, an Active Learning Approach.

Python Image Manipulation

Some playful image manipulation of the SANBI logo is in this image_manipulation notebook.

Plotting in Python

Python has many different options for plotting and the plotting_gc notebook illustrates 2: Matplotlib and Altair.

Statistics in Python

probability_distributions notebook and confidence_interval notebook.

Next Generation Sequencing

The Galaxy training materials on transcriptomics that you can do on usegalaxy.eu.

Cute Pythons with Hats

On Pinterest

Strange but perhaps useful

  • bashplotlib – a command line utility for plotting graphs using text.
  • ASCIIGenome – a genome browser that runs in your terminal