ggplot2 2.2.0 coming soon!

ggPlot update brings another round of useful improvements.

RStudio Blog

I’m planning to release ggplot2 2.2.0 in early November. In preparation, I’d like to announce that a release candidate is now available: version Please try it out, and file an issue on GitHub if you discover any problems. I hope we can find and fix any major issues before the official release.

Install the pre-release version with:

# install.packages("devtools")

If you discover a major bug that breaks your plots, please file a minimal reprex, and then roll back to the released version with:


ggplot2 2.2.0 will be a relatively major release including:

The majority of this work was carried out by Thomas Pederson, who I was lucky to have as my “ggplot2 intern” this summer. Make sure to check out other visualisation packages: 

View original post 679 more words


Direct evidence for evolution does exist, and its been staring us in the face

In a simple, yet profound experiment researchers at Harvard Medical School have produced a visual aid that should hopefully help increase the frequency with which people understand and accept the theory of Evolution.

E.Coli evolve quickly

Here we see a population of E.Coli evolve antibiotic resistance over about 11 days. As bacteria work their way towards the center of the massive culture plate, population members produce mutations that allow them to survive the ever increasing antibiotic concentrations.



The ability for bacteria to evolve resistance to increasing concentrations of antibiotics is a serious problem for the global health community. The Australian government has dedicated  a webpage well worth reading to explaining the problem.

And its worth pointing out that these bacteria had to have evolved. Evolution is the result of inevitable genetic changes between related individuals across a population, and the only way for successive generations to literally survive across these different environments is to accumulate a mutation somewhere in the population that causes stronger resistance to the molecular effects of the antibiotic molecule.

I’ll wrap up this short post with the following recommendation on how to prevent the spread of resistant bacteria:

Ways to prevent antibiotic resistance

The most important ways to prevent antibiotic resistance are:

  • Minimize unnecessary prescribing and over prescribing of antibiotics. This occurs when people expect doctors to prescribe antibiotics for a viral illness (antibiotics do not work against viruses) or when antibiotics are prescribed for conditions that do not require them.
  • Complete the entire course of any prescribed antibiotic so that it can be fully effective and not breed resistance.
  • Practice good hygiene and use appropriate infection control procedures.

Let people know – don’t take random antibiotics, and certainly don’t take them to fight a viral infection such as the common cold!



Chip-Seq Part 3: Pre-alignment quality filtering

Chip-Seq Part 3:  Pre-alignment quality filtering

If you’ve already read through Parts ONE & TWO of this ChIP-seq survival guide, then we’re finally ready to begin having a look at some of our raw data. This is a survival guide where I’ll be pointing you to some commonly used and relied upon tools for performing ChIP-Seq (among many other) analyses. While this isn’t a strictly ‘bare bones’ guide, the reader is expected to have some working knowledge of the terminal interface (preferably on a unix based operating system), and full explanations of the tools presented here are a little beyond the scope of the guide. I can, however, go in to further detail on any tool presented here or otherwise. Tool requests are welcome.

Part 3 of this guide focuses on quality assessment of your data and what to make note of as you move forward with your analysis. We’ll end this part with the trimming and filtering of read data, which will produce the data files that are ready for alignment – which will be covered in part 4. I’ll share some basic tools that are freely available that you should consider using, and we’ll go over some ways to implement them.

Quality assessment

Before you bother aligning any of your sequence reads to your favorite genome, it’s important to take the time to assess data quality. If you want to perform a good analysis AND be able to talk about it with a commanding knowledge, you’ve got to understand all facets of your dataset.

The program we’ll be using for this step is FastQC. Using this tool, it is easy to analyze all of your raw data via the command line.

You can download and install FastQC from the following URL:

I’m not going to go in to too much detail on all of the ins and outs of this program since it is pretty straight forward (unless specifically requested in the comments :D), so I’ll point you to the website where you can feed yourself. I will, however, go over some key points.

When you run FastQC, a folder will be output containing individual pictures, and an html file. The html file can be opened in any web browser and it contains all of the information that you’ll need to assess the input file.

To get an idea of what the plots in this html mean, begin by reading through the documentation.

Frankly, the docs don’t look that great, but try and make it through it all. You won’t find the command line options here, but you will find useful information for interpreting the plots. The most important bits will be all of the files under folder “3 Analysis Modules”. This explains what each of the report means and will give you a pass/fail metric as well (that you are free to disregard depending on your data).

Consume all of the knowledge on this page, and then view the example reports. You can click here for a link to an example report using really good illumine data. Do not be surprised if your data does not look this good. If it doesn’t, that’s okay.

Once you have a basic understanding of the FastQC program, let’s just run it. Once you’ve installed the program, you can type the following to bring up all of the program options:

fastqc --help


./directory/path/fastqc –help

Most options probably won’t be useful or relevant to you. The one’s I’ve found useful are:

-o, --outdir     Create all output files in the specified output directory.

Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is 
created in the same directory as the sequence file which was processed.

** Use this when you specify all of the files for processing in the same program run**

-f –format      Bypasses the normal sequence file format detection and forces the program to use the specified format.  
                    Valid formats are bam, sam, bam_mapped, sam_mapped, and fastq

**This is useful for checking post-alignment read quality scores. E.g.if you perform trimming during alignments, as we’ll see in the next post**

-t –threads     Specifies the number of files which can be processed simultaneously.  Each thread will be allocated 250MB 
                    of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine.

**Very useful for speeding up the analysis if you have a powerful machine**


For those of you who like to automate things, there is a simple way to automate the analysis of all of your raw sequence data. After navigating to a directory with read files, (don’t bother unzipping them), run a similar command to this: (# =comment that isn’t run  as code)

#create an iterable variable of all the file names that you’re interested in using wildcard #syntax.

readfiles=$(ls *.fastq.gz)                        [Replace the extension to match your reads.]

#run a loop to apply the program run to each file name in the list of files called ‘readfiles’

for read in ${readfile}; do fastqc –o ${read%.*} –t 4 ${read}; done

Easy enough. For fastqc, the –o is optional, and you should run as many threads as you can using the –t option. The $(ls *.fastq.gz) assembles an iterable list of file names, and ${read%.*} prints the current ${read} file name but removes the extension.

Once you’ve run fastQC on all of your data files, lets download them, open them up, and have a look. Opening up the html file will provide you with all of the plots in a single window so disregard the individual image files – those are for use in any presentations you need to put together to show case your work (group meeting, lab meeting, etc.)

Per Base Quality

While all of the information is useful, the most useful is the read per-base quality chart. This tells you where to cut your reads (approximately) during the trimming process.


I’m partially colorblind, so it was difficult for me to see this at first, but on this chart there are three colors across three rows. From top to bottom: Green, Orange(yellow?), and Red.

Basically, you’ll want to find the farthest (to the right) base position that has reads that are on average mostly in the green. When the yellow bars start to move mostly in to the orange row, you’ll know this is where the quality of reads in this sample begin to degrade.

The chart is a box plot and the yellow bar represents range of quality that a given position contains across 75% of all of the reads analyzed. If you’ve got short yellow bars mostly in the green across the majority of length of your read, then you’ve got decent quality reads. If most of that yellow box sits outside of the green range starting from the beginning of your reads, you’ve got some mediocre data on your hands.

The implications of this plot are different for short read data (37-75bp reads) vs longer read data (100bp-200bp reads). For example, if you used short reads and half your sequence falls out of the green row range, this is a significant problem. If you use longer reads, the problem may be mitigated in the following trimming and filtering steps.

Sequence Duplication Plot

The next plot to consider can be the “Sequence Duplication Levels” plot. This will be most important for those who performed sequencing using low input. These experiments will generally utilize a PCR amplification to increase the amount of available sample for sequencing. Unfortunately this sometimes leads to high levels of duplicated reads that won’t actually contribute to downstream peak calling, and instead leads to low complexity peaks that actually make peaks more difficult to distinguish since they don’t fit the expected Poisson distribution of reads across true binding locations. This is, in general, a problem for low input high throughput sequencing experiments1.

duplication.pngIf you have duplication bias present in your library, you’ll see it manifest as spikes along the x-axis followed by a list of over represented sequences. There are options to deal with this, and you should probably attempt to de-duplicate your raw data using any number of programs. You can also align your data, and then deduplicate your aligned data. Choose whichever method is easiest to accomplish considering your resources. You can find a list of available software here. This list isn’t exhaustive, and samtools and picard tools also have tools for de-duplication of data.

Over Represented K-mers

Since we’re working through ChIP-seq, some of you are likely attempting to perform an analysis with low input. Sometimes this requires random priming during PCR amplification during library prep. This step can lead to alignment biases or failures which may reduce the integrity of your raw data.kmers.png


According to FastQC:

Libraries which derive from random priming will nearly always show Kmer bias at the start of the library due to an incomplete sampling of the possible random primers.

If you find any significantly over represented k-mers in your read population, consider attempting to trim reads to eliminate these positions. Don’t trim to the point that you lose the ability to map (for example, if you are using short read sequencing).

The rest of the fastqc output is important to assess, but is more informative than directive in terms of making decisions on how to pre-process your data.

Things to take note of are:

  • Per sequence GC content
  • Per base N content
  • Sequence length distribution
  • Adapter content

Trim And Filter

Finally – it is time for the final step: Trimming and filtering. Yes, in this day and age we still have to trim and filter reads. We will always have to do this! Especially with ChIP-seq or similar sequencing experiments (until some revolutionary new sequencing tech comes along).

There are, again, many options for dealing with trimming and filtering of reads. My favorite is, however, Trimmomatic. This program has the ability to handle singe ended and paired end data, it is written in Java making it able to run in any environment without the need for complicated installation, and it uses a sliding averaging window to determine when to chop reads. The advantage of this is that we don’t accidentally clip reads that have a low quality base called in the middle of a read. Furthermore, for paired end data, you retain pairs across output files and you also retain reads for which pairs were lost. This is how to really make the most of your data!

Have a look:

Lets say we want to trim reads when base quality drops below 20. I choose this because it is not super high, and it is also not super low. (Feel free to change it, I usually set my parameters somewhere between 15-25.)


Trimmomatic averages X number of bases (default =4) to determine the average base quality of the sliding window. If the quality is above your cutoff, there is no chopping action. So if you have a rogue base in your high quality read that is sub-par – the algorithm won’t clip your read. Keep in mind that your aligner will still have to deal with it, but that isn’t generally a problem.

Sometimes all we have to work with is low quality data, so we have to be a little more judicial with our trimming. If you are working with low quality data, maybe sit on the more stringent side of the fence – say towards 20. Of course, if you are too strict, you will lose too much data for any amount of reliable downstream analysis. If you have very high quality data, you can set this option much lower because you only want to remove the worst of the worst. This something you will need to work out by repeating your alignments using different trimming parameters to see what gives you the high degree of alignments with the highest confidence.

You can download Trimmomatic here.

The documentation is really straight forward on this one. You have a short list of options to set and output file names to choose. Don’t go under the 36bp minimum read length because you will reduce your ability to uniquely align reads during mapping (we’ll be filtering those as well in a later step after mapping). Also, if you are using paired data, retain any reads for use during alignment (this is done automatically – just give the output a good name that is easy to understand).

Finally – keep in mind that trimmomatic executes its steps in the order specified via the command line! If you need to do any uniform clipping of reads (for example every read needs to have 3bp clipped from the 5’ end without exception – due to library prep or whatever), then make sure to write that option FIRST on the command line.

With this, you should have some decently prepped data ready for alignment. Next time, we’ll discuss how to align the data really quickly, and what to do with the output from those alignments to prepare for use with peak calling algorithms to detect DNA occupancy locations.


  1. Xu H, Luo X, Qian J, et al: FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads. PLoS One 2012; 7: 1–6.
  2. Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
  3. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

Chip-Seq Part 2: Gathering MetaData

Chip-Seq Part 2: Gathering MetaData

Last time we went over collecting reference data to use when working through a ChIP seq analysis. Using reference data is very important considering the vast number of options that may be tweaked when analyzing different sets of sequence data. If you haven’t considered doing this and you haven’t read that post, you can find it here.

For part 2, we’ll be discussing what steps you can take to pre-process your data. Many of these steps are part of any standard computational pipeline that involves processing large read sequence data sets. In this post, I’ll try to relieve you of some of the anxiety these analyses may induce when trying to figure out whether or not you are doing things correctly.

Gather all the metadata you can find

Let’s begin with our reference data that you’ve collected. My assumption here is that you’ve retrieved your data from GEO, or some other reputable repository. It doesn’t matter where you’ve found your data, but what does matter is whether or not you have all of the associated Meta Data out there related to it. What do I mean by this? Let’s take a minute to consider.

The meta data is the information describing the sequencing. This includes…

  • General Sample information and stats
    • Number of sequenced reads
    • Read length
  • Library Prep information
    • Paired vs. unpaired sequencing methodology
    • Strand-ed-ness of the sequencing
    • Special library prep information
  • Sequencing platform information
    • Sequencing technology (Illumina vs Roche vs PacBio vs Ion vs Oxford)
    • Platform version
    • Chemistry version

Technology and metadata change with the times

This is important information. Sequencing technologies over the past 10 years have greatly diversified as they have been developed, and this diversification has led to the implementation of various different methodologies and chemistry. For example, the Roche 454 ‘long read’ sequencer used pyrosequencing with a powerful luminosity detector to assemble reads:

The ion Torrent used pH detectors to discover base identities:

Different technologies tend also to use different standards with regards to the information contained within individual sequencing reads, e.g. bass quality scores, or phred scores.

Phred Score Calculation – we’ll see this again

We’ll actually be interested in the mathematics behind calculating phred scores since we’ll be seeing this again when we start to assess the strength of our ChIP-seq peak scores when looking at DNA occupancy statistics.

Phred quality scores are defined as a property, Q, which is logarithmically related to the base-calling error probabilities P.

Q = -10log10(P)


These are encoded in to the read data in different ways depending on the sequencing technology used. Keep this equation in mind as we go forward. And make sure you have that metadata.

What if you don’t have any sequencing metadata?

If you don’t have any of this metadata information for the reference data, frankly, you should probably consider ditching the data set. At least proceed with caution and at least know which sequencer with which chemistry version was used.  For example, I recently analyzed a data set that explicitly stated the need to uniformly trim 3, 5’ base pairs due to the library prep method. Although I would have picked that up during the quality assessment steps we’ll discuss next time, it could have been missed in the absence of clear metadata leading to dramatic problems during read alignment.Don’t be surprised if you run in to program run failures during the analysis – some programs require accurate metadata. You can try and guess at certain parameters using the information that you have, but good science is built on knowledge and facts, not guesses.

You should also have all of this information handy for the experimental data of course!

Its okay to concatenate Gzip’ed files

And while you’re finishing up data collection and organizing, you should also concatenate any split read pools. Be cautious when concatenating multiple paired read files. These files generally are organized in an order that matches an “R1” file to its “R2” mate. E.g. line 2543 of the R1 file is the mate of the read found in line 2543 of the R2 file. Don’t concatenate your read files out of order. And just so you know, running the ‘cat’ command on gzip files is perfectly acceptable – gzip compressed files may be directly concatenated.


Next time we’ll discuss quality assessement and filtering methods that should be undertaken prior to read alignment.

Chip-Seq Part 1: Reference Data

Chip-Seq Part 1: Reference Data

In this post I’m going to kick off a series on analyzing transcription factor ChIPseq, and talk about some recent experiences I had while learning how to analyze ChIP-seq data to.  Just a heads up, in this post I’ll be sharing some command line knowledge and even some of that useful python I mentioned before.

So you’re new to ChIP-seq and you’ve got some data on your hands that you’re interested in analyzing it yourself. Let me begin by saying that to get a full appreciation of doing this, you’ll need a lot of access time to a powerful computer (for aligning sequence data), and lots of personal time to basically run several analyses about a dozen times each. You do this so that you can become intimately familiar with which parameter changes affect down stream results, and how those changes manifest in the final output. For this reason, I’d recommend instead seeking someone with that skill set (me) to do the analysis for you. Collaboration is a great for everyone! 😀

The analysis pipeline for ChIP-seq is pretty straight forward, but like any analysis in the sciences, the first big question isn’t whether or not you CAN perform the analysis, but whether or not you’ve performed the analysis CORRECTLY. This is a difficult question to answer without some reference data, so lets make that our first step.

STEP 1: Find and Download high quality reference data.

This step alone is worth a fair bit of discussion. And we’ll start from the beginning. Not everyone who analyzes ChIPseq data is familiar with what they are working with. So ask yourself a couple of quick questions.

  • What is your transcription factor?
  • Do you have any reference literature studying this protein?

There are at least a couple of approaches you can take to finding the positive control data that you’re looking for, so I’ll give you the two that I will generally use, and then you can feed yourself more. I’ll be using an estrogen receptor alpha ChIP seq study as reference throughout these posts (1).

First off, head on over to PubMed and try to find some literature on your transcription factor. If you’re lucky, someone will have already performed ChIP on this protein and deposited their data in to GEO. Try using keywords in your search such as ‘ChIP’ or ‘High Throughput Sequencing’. Grab any papers that use ChIPseq on your protein.

This is the best case scenario. There is a good chance you won’t be this lucky, and we’ll discuss some ways to deal with that. But IF you are – grab the article and scan it for a GEO accession number associated with their ChIP data. It should look something like this:


If you don’t see any papers coming up on your search, don’t lose hope yet. You may already be aware of GEO, perhaps you’ve even analyzed a published data set in a class. If not, GEO is a repository of ALL KINDS of high throughput data. Have a look at what they offer:


We’ll be looking for Genome binding/occupancy profiling data.

**WARNING**If you happen to try looking for these types of studies using the NCBI website under the drop down option ‘GEO Datasets’ using the filters on the left side of the page, you may run in to trouble.



Filtering the search results may return unexpected results. When I don’t filter the search, I find what I’m looking for:


When I filter those results, my desired search results disappear:


This may not be a universal problem, but just keep in mind that before you give up a search, you should try multiple different search avenues (and also attempt to make good use of the search syntax available to us. For example:

(WT[All Fields] AND V[All Fields] AND ER[All Fields] AND alpha[All Fields] AND ChIP-seq[All Fields]) AND "Mus musculus"[Organism]

(Most people don’t bother with this, and its largely handled by the search engines, so consider using it only if you can’t find what you’re looking for with the specificity you need, e.g. you can’t seem to return fewer than 1000 search results.)


To work efficienty with GEO, it is essential that you read the overview information at This will give you an understanding of how GEO is organized, and how to interact with the database. I’ll cover interacting with GEO directly using R in a later blog, so keep tuned for that.

Using that link, you can also navigate directly to the GEO website and search using keywords or accession numbers.


If you can’t find any ChIP-seq data on GEO, I’d suggest broadening your search to include other databases, though if you can’t find it on GEO, you’re likely not going to find it. This is, I admit, a mouse/human/popular model organism-centric blog, so I don’t know much (yet) about other lesser studied organisms and where their data is stored.

If worse comes to worst and there really are no data sets specific to your protein, that’s Ok! Take pride in knowing that you are exploring new territory. And after all, the point of this series is to help guide you as you explore this new data (assuming of course that you’re relatively new to the ChIP analysis process). As you explore binding location (peak) calling software, you’ll come to find that there are two types of algorithms for calling peaks: those that are tailored towards narrow (transcription factor) peaks, and those for broad (chromatin modification) peaks.

My recommendation is to go find some transcription factor chip-seq data and try to at least match the conditions of your experiment. Often there will be a Treatment/No Treatment setup to determine how TF binding changes under certain conditions. Search for data sets that are appropriate. For example, if you are studying the binding landscape of your protein in a particular cell line or tissue, collect the data that represents the same untreated conditions.

Okay! By now you’ve hopefully managed to find some suitable data to use as reference as you begin your analysis. Next time we’ll discuss what to do with that data, and how to begin the analysis on your own data.



1. Hewitt SC, Li L, Grimm SA, et al: Research Resource: Whole-Genome Estrogen Receptor α Binding in Mouse Uterine Tissue Revealed by ChIP-Seq. Mol. Endocrinol. 2012; 26: 887–898.

Python – What a useful language

I began learning python a couple years ago in an effort to diverse my biological skill set. Its been a difficult process, but with some tenacity and perseverance I’ve managed to learn and make great use of the language in my research training. The advantages of learning a programming a language are indeed great. The skills taken away from learning just a single language can empower one to complete daunting tasks in data management and processing – though the applications are probably beyond the sight of the entry level learner.

I began learning Java with an early edition of the ‘Java for Dummies’ series. The series seemed appropriate to me at the time. After completing a few introductory projects, I put the book down and perhaps a year later decided to reinvest myself in to programming, only this time following freely available tutorials online. I had been heavily recommended python as a starter language and later learned that it was not only a more easily digested introductory language to programming but was also as versatile as it was powerful. Today many biological tools are coded in Python and we’ll be discussing some of those tools throughout the series.

This is the first in a series of Python tagged (for organization!) blog entries where I’ll share my thoughts and maybe even some advice on learning python and making use of it in research, and, importantly, when NOT to make use of it. For the computational biologist, a priority aside from optimizing system and program parameters should be optimizing the use of their time. Through these entries, I hope to convince you of the utility of learning python while also sharing details on the language that I have learned during my self guided education, and even some that I have yet to learn at the time of writing this.


Next time, we’ll talk about my path so far learning python so that some of you might follow something similar.



How to convert between Genome Builds

As a computational biologist, I often implement computational tools that make use of pre-made information files, such as genome builds, blacklist files, chromosome size files, etc. However, computational biology is a growing and changing field, and our collective resources aren’t always completely up to date and dealing with incompatibilities between these resources is sometimes a harsh reality.

Take for example: There are currently 5 Human genome builds available on the UCSC genome browser, and 4 Mouse genome builds. Problems arise when you spend valuable computational time aligning, say 15 or so, ChIP-seq experiments to the mm10 mouse genome build, only to realize later that your blacklist region file is made using the mm9 build.

And just to be clear – interval annotations between genome builds are NOT the same!

Genome Differences

Notice how the the start locations for Sash1 differ by nearly 200kb!

With this in mind, I’ll introduce a nifty tool provided by those who use the UCSC genome builds in their work: LiftOver.


This tool allows you to upload an interval file – which is how you should deal with pretty  much any file that would be in need of converting between genome builds – and convert intervals between builds.


To illustrate how this works, I downloaded the mm9.blacklist.bed from ENCODE:

These files describe locations in the genome that should be ignored during certain analysis (such as enrichment profiling during ChIP-seq analysis). The regions, for whatever reason, tend to produce a lot of noise. Perhaps due to chromatin configurations, the regions tend to precipitate during the IP of ChIP and produce pileups that represent very strong false positive signals.

Preserving these regions can give false hope to poorly executed experiments. (I’ll discuss cross strand correlation analyses in a later post.)

First, for reference (so as to not completely mislead you on what a proper cross strand correlation result should look like) – This is a CC analysis  on some data downloaded from GEO:



And these are the sub par results generated from a pilot study (using mouse  tissue – ChIP notoriously does NOT work well on tissue, so pilot was exploratory  and we DID manage to salvage SOME data from the low input the resulted in the following graphs).

CC analysis

To convert the downloaded mm9.blacklist.bed.gz file to a mm10 version, follow these steps:

  1. Select the original model organism (in this case ‘Mouse’).
  2. Select the original genome (in this case, the ‘mm9’ build).
  3. Select the NEW model (which should generally be the same as the original…).
  4. Select the NEW assembly (in this case, mm10).
  5. The following options below are specific to your BED files. For details on what the different BED formats mean, you can visit the UCSC format page. These different BED versions simply indicate the number of information columns present in the tab delimited BED file. BED4 has the chr, start, stop, name columns, whereas BED6 contains two additional data columns.
  6. Finally, you may either upload your original file or you can paste in bed file data for conversion directly.

How to use liftOver

And Wa-LA! You’ve got a converted file.

You may find this useful for handling blacklist files like in the example, or you may need to convert a summary peak file from a ChIP-seq experiment for use with a particular tool that requires a specifically different genome build.

If you’ve got any useful ideas on how to use this tool, leave a comment below.