How to convert between Genome Builds

As a computational biologist, I often implement computational tools that make use of pre-made information files, such as genome builds, blacklist files, chromosome size files, etc. However, computational biology is a growing and changing field, and our collective resources aren’t always completely up to date and dealing with incompatibilities between these resources is sometimes a harsh reality.

Take for example: There are currently 5 Human genome builds available on the UCSC genome browser, and 4 Mouse genome builds. Problems arise when you spend valuable computational time aligning, say 15 or so, ChIP-seq experiments to the mm10 mouse genome build, only to realize later that your blacklist region file is made using the mm9 build.

And just to be clear – interval annotations between genome builds are NOT the same!

Genome Differences

Notice how the the start locations for Sash1 differ by nearly 200kb!

With this in mind, I’ll introduce a nifty tool provided by those who use the UCSC genome builds in their work: LiftOver.

LiftOver

This tool allows you to upload an interval file – which is how you should deal with pretty  much any file that would be in need of converting between genome builds – and convert intervals between builds.

CASE STUDY

To illustrate how this works, I downloaded the mm9.blacklist.bed from ENCODE:

https://sites.google.com/site/anshulkundaje/projects/blacklists

These files describe locations in the genome that should be ignored during certain analysis (such as enrichment profiling during ChIP-seq analysis). The regions, for whatever reason, tend to produce a lot of noise. Perhaps due to chromatin configurations, the regions tend to precipitate during the IP of ChIP and produce pileups that represent very strong false positive signals.

Preserving these regions can give false hope to poorly executed experiments. (I’ll discuss cross strand correlation analyses in a later post.)

First, for reference (so as to not completely mislead you on what a proper cross strand correlation result should look like) – This is a CC analysis  on some data downloaded from GEO:

good-cc

 

And these are the sub par results generated from a pilot study (using mouse  tissue – ChIP notoriously does NOT work well on tissue, so pilot was exploratory  and we DID manage to salvage SOME data from the low input the resulted in the following graphs).

CC analysis

To convert the downloaded mm9.blacklist.bed.gz file to a mm10 version, follow these steps:

  1. Select the original model organism (in this case ‘Mouse’).
  2. Select the original genome (in this case, the ‘mm9’ build).
  3. Select the NEW model (which should generally be the same as the original…).
  4. Select the NEW assembly (in this case, mm10).
  5. The following options below are specific to your BED files. For details on what the different BED formats mean, you can visit the UCSC format page. These different BED versions simply indicate the number of information columns present in the tab delimited BED file. BED4 has the chr, start, stop, name columns, whereas BED6 contains two additional data columns.
  6. Finally, you may either upload your original file or you can paste in bed file data for conversion directly.

How to use liftOver

And Wa-LA! You’ve got a converted file.

You may find this useful for handling blacklist files like in the example, or you may need to convert a summary peak file from a ChIP-seq experiment for use with a particular tool that requires a specifically different genome build.

If you’ve got any useful ideas on how to use this tool, leave a comment below.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s