Last time we went over collecting reference data to use when working through a ChIP seq analysis. Using reference data is very important considering the vast number of options that may be tweaked when analyzing different sets of sequence data. If you haven’t considered doing this and you haven’t read that post, you can find it here.
For part 2, we’ll be discussing what steps you can take to pre-process your data. Many of these steps are part of any standard computational pipeline that involves processing large read sequence data sets. In this post, I’ll try to relieve you of some of the anxiety these analyses may induce when trying to figure out whether or not you are doing things correctly.
Gather all the metadata you can find
Let’s begin with our reference data that you’ve collected. My assumption here is that you’ve retrieved your data from GEO, or some other reputable repository. It doesn’t matter where you’ve found your data, but what does matter is whether or not you have all of the associated Meta Data out there related to it. What do I mean by this? Let’s take a minute to consider.
The meta data is the information describing the sequencing. This includes…
- General Sample information and stats
- Number of sequenced reads
- Read length
- Library Prep information
- Paired vs. unpaired sequencing methodology
- Strand-ed-ness of the sequencing
- Special library prep information
- Sequencing platform information
- Sequencing technology (Illumina vs Roche vs PacBio vs Ion vs Oxford)
- Platform version
- Chemistry version
Technology and metadata change with the times
This is important information. Sequencing technologies over the past 10 years have greatly diversified as they have been developed, and this diversification has led to the implementation of various different methodologies and chemistry. For example, the Roche 454 ‘long read’ sequencer used pyrosequencing with a powerful luminosity detector to assemble reads:
The ion Torrent used pH detectors to discover base identities:
Different technologies tend also to use different standards with regards to the information contained within individual sequencing reads, e.g. bass quality scores, or phred scores.
Phred Score Calculation – we’ll see this again
We’ll actually be interested in the mathematics behind calculating phred scores since we’ll be seeing this again when we start to assess the strength of our ChIP-seq peak scores when looking at DNA occupancy statistics.
Phred quality scores are defined as a property, Q, which is logarithmically related to the base-calling error probabilities P.
Q = -10log10(P)
These are encoded in to the read data in different ways depending on the sequencing technology used. Keep this equation in mind as we go forward. And make sure you have that metadata.
What if you don’t have any sequencing metadata?
If you don’t have any of this metadata information for the reference data, frankly, you should probably consider ditching the data set. At least proceed with caution and at least know which sequencer with which chemistry version was used. For example, I recently analyzed a data set that explicitly stated the need to uniformly trim 3, 5’ base pairs due to the library prep method. Although I would have picked that up during the quality assessment steps we’ll discuss next time, it could have been missed in the absence of clear metadata leading to dramatic problems during read alignment.Don’t be surprised if you run in to program run failures during the analysis – some programs require accurate metadata. You can try and guess at certain parameters using the information that you have, but good science is built on knowledge and facts, not guesses.
You should also have all of this information handy for the experimental data of course!
Its okay to concatenate Gzip’ed files
And while you’re finishing up data collection and organizing, you should also concatenate any split read pools. Be cautious when concatenating multiple paired read files. These files generally are organized in an order that matches an “R1” file to its “R2” mate. E.g. line 2543 of the R1 file is the mate of the read found in line 2543 of the R2 file. Don’t concatenate your read files out of order. And just so you know, running the ‘cat’ command on gzip files is perfectly acceptable – gzip compressed files may be directly concatenated.
Next time we’ll discuss quality assessement and filtering methods that should be undertaken prior to read alignment.