Chip-Seq Part 3: Pre-alignment quality filtering

Chip-Seq Part 3:  Pre-alignment quality filtering

If you’ve already read through Parts ONE & TWO of this ChIP-seq survival guide, then we’re finally ready to begin having a look at some of our raw data. This is a survival guide where I’ll be pointing you to some commonly used and relied upon tools for performing ChIP-Seq (among many other) analyses. While this isn’t a strictly ‘bare bones’ guide, the reader is expected to have some working knowledge of the terminal interface (preferably on a unix based operating system), and full explanations of the tools presented here are a little beyond the scope of the guide. I can, however, go in to further detail on any tool presented here or otherwise. Tool requests are welcome.

Part 3 of this guide focuses on quality assessment of your data and what to make note of as you move forward with your analysis. We’ll end this part with the trimming and filtering of read data, which will produce the data files that are ready for alignment – which will be covered in part 4. I’ll share some basic tools that are freely available that you should consider using, and we’ll go over some ways to implement them.

Quality assessment

Before you bother aligning any of your sequence reads to your favorite genome, it’s important to take the time to assess data quality. If you want to perform a good analysis AND be able to talk about it with a commanding knowledge, you’ve got to understand all facets of your dataset.

The program we’ll be using for this step is FastQC. Using this tool, it is easy to analyze all of your raw data via the command line.

You can download and install FastQC from the following URL:

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

I’m not going to go in to too much detail on all of the ins and outs of this program since it is pretty straight forward (unless specifically requested in the comments :D), so I’ll point you to the website where you can feed yourself. I will, however, go over some key points.

When you run FastQC, a folder will be output containing individual pictures, and an html file. The html file can be opened in any web browser and it contains all of the information that you’ll need to assess the input file.

To get an idea of what the plots in this html mean, begin by reading through the documentation.

Frankly, the docs don’t look that great, but try and make it through it all. You won’t find the command line options here, but you will find useful information for interpreting the plots. The most important bits will be all of the files under folder “3 Analysis Modules”. This explains what each of the report means and will give you a pass/fail metric as well (that you are free to disregard depending on your data).

Consume all of the knowledge on this page, and then view the example reports. You can click here for a link to an example report using really good illumine data. Do not be surprised if your data does not look this good. If it doesn’t, that’s okay.

Once you have a basic understanding of the FastQC program, let’s just run it. Once you’ve installed the program, you can type the following to bring up all of the program options:

fastqc --help

(or)

./directory/path/fastqc –help

Most options probably won’t be useful or relevant to you. The one’s I’ve found useful are:

-o, --outdir     Create all output files in the specified output directory.

Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each sequence file is 
created in the same directory as the sequence file which was processed.

** Use this when you specify all of the files for processing in the same program run**

-f –format      Bypasses the normal sequence file format detection and forces the program to use the specified format.  
                    Valid formats are bam, sam, bam_mapped, sam_mapped, and fastq

**This is useful for checking post-alignment read quality scores. E.g.if you perform trimming during alignments, as we’ll see in the next post**

-t –threads     Specifies the number of files which can be processed simultaneously.  Each thread will be allocated 250MB 
                    of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine.

**Very useful for speeding up the analysis if you have a powerful machine**

 

For those of you who like to automate things, there is a simple way to automate the analysis of all of your raw sequence data. After navigating to a directory with read files, (don’t bother unzipping them), run a similar command to this: (# =comment that isn’t run  as code)

#create an iterable variable of all the file names that you’re interested in using wildcard #syntax.

readfiles=$(ls *.fastq.gz)                        [Replace the extension to match your reads.]

#run a loop to apply the program run to each file name in the list of files called ‘readfiles’

for read in ${readfile}; do fastqc –o ${read%.*} –t 4 ${read}; done

Easy enough. For fastqc, the –o is optional, and you should run as many threads as you can using the –t option. The $(ls *.fastq.gz) assembles an iterable list of file names, and ${read%.*} prints the current ${read} file name but removes the extension.

Once you’ve run fastQC on all of your data files, lets download them, open them up, and have a look. Opening up the html file will provide you with all of the plots in a single window so disregard the individual image files – those are for use in any presentations you need to put together to show case your work (group meeting, lab meeting, etc.)

Per Base Quality

While all of the information is useful, the most useful is the read per-base quality chart. This tells you where to cut your reads (approximately) during the trimming process.

good-seq

I’m partially colorblind, so it was difficult for me to see this at first, but on this chart there are three colors across three rows. From top to bottom: Green, Orange(yellow?), and Red.

Basically, you’ll want to find the farthest (to the right) base position that has reads that are on average mostly in the green. When the yellow bars start to move mostly in to the orange row, you’ll know this is where the quality of reads in this sample begin to degrade.

The chart is a box plot and the yellow bar represents range of quality that a given position contains across 75% of all of the reads analyzed. If you’ve got short yellow bars mostly in the green across the majority of length of your read, then you’ve got decent quality reads. If most of that yellow box sits outside of the green range starting from the beginning of your reads, you’ve got some mediocre data on your hands.

The implications of this plot are different for short read data (37-75bp reads) vs longer read data (100bp-200bp reads). For example, if you used short reads and half your sequence falls out of the green row range, this is a significant problem. If you use longer reads, the problem may be mitigated in the following trimming and filtering steps.

Sequence Duplication Plot

The next plot to consider can be the “Sequence Duplication Levels” plot. This will be most important for those who performed sequencing using low input. These experiments will generally utilize a PCR amplification to increase the amount of available sample for sequencing. Unfortunately this sometimes leads to high levels of duplicated reads that won’t actually contribute to downstream peak calling, and instead leads to low complexity peaks that actually make peaks more difficult to distinguish since they don’t fit the expected Poisson distribution of reads across true binding locations. This is, in general, a problem for low input high throughput sequencing experiments1.

duplication.pngIf you have duplication bias present in your library, you’ll see it manifest as spikes along the x-axis followed by a list of over represented sequences. There are options to deal with this, and you should probably attempt to de-duplicate your raw data using any number of programs. You can also align your data, and then deduplicate your aligned data. Choose whichever method is easiest to accomplish considering your resources. You can find a list of available software here. This list isn’t exhaustive, and samtools and picard tools also have tools for de-duplication of data.

Over Represented K-mers

Since we’re working through ChIP-seq, some of you are likely attempting to perform an analysis with low input. Sometimes this requires random priming during PCR amplification during library prep. This step can lead to alignment biases or failures which may reduce the integrity of your raw data.kmers.png

 

According to FastQC:

Libraries which derive from random priming will nearly always show Kmer bias at the start of the library due to an incomplete sampling of the possible random primers.

If you find any significantly over represented k-mers in your read population, consider attempting to trim reads to eliminate these positions. Don’t trim to the point that you lose the ability to map (for example, if you are using short read sequencing).

The rest of the fastqc output is important to assess, but is more informative than directive in terms of making decisions on how to pre-process your data.

Things to take note of are:

  • Per sequence GC content
  • Per base N content
  • Sequence length distribution
  • Adapter content

Trim And Filter

Finally – it is time for the final step: Trimming and filtering. Yes, in this day and age we still have to trim and filter reads. We will always have to do this! Especially with ChIP-seq or similar sequencing experiments (until some revolutionary new sequencing tech comes along).

There are, again, many options for dealing with trimming and filtering of reads. My favorite is, however, Trimmomatic. This program has the ability to handle singe ended and paired end data, it is written in Java making it able to run in any environment without the need for complicated installation, and it uses a sliding averaging window to determine when to chop reads. The advantage of this is that we don’t accidentally clip reads that have a low quality base called in the middle of a read. Furthermore, for paired end data, you retain pairs across output files and you also retain reads for which pairs were lost. This is how to really make the most of your data!

Have a look:

Lets say we want to trim reads when base quality drops below 20. I choose this because it is not super high, and it is also not super low. (Feel free to change it, I usually set my parameters somewhere between 15-25.)

Trimmomatic.png

Trimmomatic averages X number of bases (default =4) to determine the average base quality of the sliding window. If the quality is above your cutoff, there is no chopping action. So if you have a rogue base in your high quality read that is sub-par – the algorithm won’t clip your read. Keep in mind that your aligner will still have to deal with it, but that isn’t generally a problem.

Sometimes all we have to work with is low quality data, so we have to be a little more judicial with our trimming. If you are working with low quality data, maybe sit on the more stringent side of the fence – say towards 20. Of course, if you are too strict, you will lose too much data for any amount of reliable downstream analysis. If you have very high quality data, you can set this option much lower because you only want to remove the worst of the worst. This something you will need to work out by repeating your alignments using different trimming parameters to see what gives you the high degree of alignments with the highest confidence.

You can download Trimmomatic here.

The documentation is really straight forward on this one. You have a short list of options to set and output file names to choose. Don’t go under the 36bp minimum read length because you will reduce your ability to uniquely align reads during mapping (we’ll be filtering those as well in a later step after mapping). Also, if you are using paired data, retain any reads for use during alignment (this is done automatically – just give the output a good name that is easy to understand).

Finally – keep in mind that trimmomatic executes its steps in the order specified via the command line! If you need to do any uniform clipping of reads (for example every read needs to have 3bp clipped from the 5’ end without exception – due to library prep or whatever), then make sure to write that option FIRST on the command line.

With this, you should have some decently prepped data ready for alignment. Next time, we’ll discuss how to align the data really quickly, and what to do with the output from those alignments to prepare for use with peak calling algorithms to detect DNA occupancy locations.


References

  1. Xu H, Luo X, Qian J, et al: FastUniq: A Fast De Novo Duplicates Removal Tool for Paired Short Reads. PLoS One 2012; 7: 1–6.
  2. Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]
  3. Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s