Python – What a useful language

I began learning python a couple years ago in an effort to diverse my biological skill set. Its been a difficult process, but with some tenacity and perseverance I’ve managed to learn and make great use of the language in my research training. The advantages of learning a programming a language are indeed great. The skills taken away from learning just a single language can empower one to complete daunting tasks in data management and processing – though the applications are probably beyond the sight of the entry level learner.

I began learning Java with an early edition of the ‘Java for Dummies’ series. The series seemed appropriate to me at the time. After completing a few introductory projects, I put the book down and perhaps a year later decided to reinvest myself in to programming, only this time following freely available tutorials online. I had been heavily recommended python as a starter language and later learned that it was not only a more easily digested introductory language to programming but was also as versatile as it was powerful. Today many biological tools are coded in Python and we’ll be discussing some of those tools throughout the series.

This is the first in a series of Python tagged (for organization!) blog entries where I’ll share my thoughts and maybe even some advice on learning python and making use of it in research, and, importantly, when NOT to make use of it. For the computational biologist, a priority aside from optimizing system and program parameters should be optimizing the use of their time. Through these entries, I hope to convince you of the utility of learning python while also sharing details on the language that I have learned during my self guided education, and even some that I have yet to learn at the time of writing this.

 

Next time, we’ll talk about my path so far learning python so that some of you might follow something similar.

 

 

How to convert between Genome Builds

As a computational biologist, I often implement computational tools that make use of pre-made information files, such as genome builds, blacklist files, chromosome size files, etc. However, computational biology is a growing and changing field, and our collective resources aren’t always completely up to date and dealing with incompatibilities between these resources is sometimes a harsh reality.

Take for example: There are currently 5 Human genome builds available on the UCSC genome browser, and 4 Mouse genome builds. Problems arise when you spend valuable computational time aligning, say 15 or so, ChIP-seq experiments to the mm10 mouse genome build, only to realize later that your blacklist region file is made using the mm9 build.

And just to be clear – interval annotations between genome builds are NOT the same!

Genome Differences

Notice how the the start locations for Sash1 differ by nearly 200kb!

With this in mind, I’ll introduce a nifty tool provided by those who use the UCSC genome builds in their work: LiftOver.

LiftOver

This tool allows you to upload an interval file – which is how you should deal with pretty  much any file that would be in need of converting between genome builds – and convert intervals between builds.

CASE STUDY

To illustrate how this works, I downloaded the mm9.blacklist.bed from ENCODE:

https://sites.google.com/site/anshulkundaje/projects/blacklists

These files describe locations in the genome that should be ignored during certain analysis (such as enrichment profiling during ChIP-seq analysis). The regions, for whatever reason, tend to produce a lot of noise. Perhaps due to chromatin configurations, the regions tend to precipitate during the IP of ChIP and produce pileups that represent very strong false positive signals.

Preserving these regions can give false hope to poorly executed experiments. (I’ll discuss cross strand correlation analyses in a later post.)

First, for reference (so as to not completely mislead you on what a proper cross strand correlation result should look like) – This is a CC analysis  on some data downloaded from GEO:

good-cc

 

And these are the sub par results generated from a pilot study (using mouse  tissue – ChIP notoriously does NOT work well on tissue, so pilot was exploratory  and we DID manage to salvage SOME data from the low input the resulted in the following graphs).

CC analysis

To convert the downloaded mm9.blacklist.bed.gz file to a mm10 version, follow these steps:

  1. Select the original model organism (in this case ‘Mouse’).
  2. Select the original genome (in this case, the ‘mm9’ build).
  3. Select the NEW model (which should generally be the same as the original…).
  4. Select the NEW assembly (in this case, mm10).
  5. The following options below are specific to your BED files. For details on what the different BED formats mean, you can visit the UCSC format page. These different BED versions simply indicate the number of information columns present in the tab delimited BED file. BED4 has the chr, start, stop, name columns, whereas BED6 contains two additional data columns.
  6. Finally, you may either upload your original file or you can paste in bed file data for conversion directly.

How to use liftOver

And Wa-LA! You’ve got a converted file.

You may find this useful for handling blacklist files like in the example, or you may need to convert a summary peak file from a ChIP-seq experiment for use with a particular tool that requires a specifically different genome build.

If you’ve got any useful ideas on how to use this tool, leave a comment below.