Chip-Seq Part 1: Reference Data

Chip-Seq Part 1: Reference Data

In this post I’m going to kick off a series on analyzing transcription factor ChIPseq, and talk about some recent experiences I had while learning how to analyze ChIP-seq data to.  Just a heads up, in this post I’ll be sharing some command line knowledge and even some of that useful python I mentioned before.

So you’re new to ChIP-seq and you’ve got some data on your hands that you’re interested in analyzing it yourself. Let me begin by saying that to get a full appreciation of doing this, you’ll need a lot of access time to a powerful computer (for aligning sequence data), and lots of personal time to basically run several analyses about a dozen times each. You do this so that you can become intimately familiar with which parameter changes affect down stream results, and how those changes manifest in the final output. For this reason, I’d recommend instead seeking someone with that skill set (me) to do the analysis for you. Collaboration is a great for everyone! 😀

The analysis pipeline for ChIP-seq is pretty straight forward, but like any analysis in the sciences, the first big question isn’t whether or not you CAN perform the analysis, but whether or not you’ve performed the analysis CORRECTLY. This is a difficult question to answer without some reference data, so lets make that our first step.

STEP 1: Find and Download high quality reference data.

This step alone is worth a fair bit of discussion. And we’ll start from the beginning. Not everyone who analyzes ChIPseq data is familiar with what they are working with. So ask yourself a couple of quick questions.

  • What is your transcription factor?
  • Do you have any reference literature studying this protein?

There are at least a couple of approaches you can take to finding the positive control data that you’re looking for, so I’ll give you the two that I will generally use, and then you can feed yourself more. I’ll be using an estrogen receptor alpha ChIP seq study as reference throughout these posts (1).

First off, head on over to PubMed and try to find some literature on your transcription factor. If you’re lucky, someone will have already performed ChIP on this protein and deposited their data in to GEO. Try using keywords in your search such as ‘ChIP’ or ‘High Throughput Sequencing’. Grab any papers that use ChIPseq on your protein.

This is the best case scenario. There is a good chance you won’t be this lucky, and we’ll discuss some ways to deal with that. But IF you are – grab the article and scan it for a GEO accession number associated with their ChIP data. It should look something like this:


If you don’t see any papers coming up on your search, don’t lose hope yet. You may already be aware of GEO, perhaps you’ve even analyzed a published data set in a class. If not, GEO is a repository of ALL KINDS of high throughput data. Have a look at what they offer:


We’ll be looking for Genome binding/occupancy profiling data.

**WARNING**If you happen to try looking for these types of studies using the NCBI website under the drop down option ‘GEO Datasets’ using the filters on the left side of the page, you may run in to trouble.



Filtering the search results may return unexpected results. When I don’t filter the search, I find what I’m looking for:


When I filter those results, my desired search results disappear:


This may not be a universal problem, but just keep in mind that before you give up a search, you should try multiple different search avenues (and also attempt to make good use of the search syntax available to us. For example:

(WT[All Fields] AND V[All Fields] AND ER[All Fields] AND alpha[All Fields] AND ChIP-seq[All Fields]) AND "Mus musculus"[Organism]

(Most people don’t bother with this, and its largely handled by the search engines, so consider using it only if you can’t find what you’re looking for with the specificity you need, e.g. you can’t seem to return fewer than 1000 search results.)


To work efficienty with GEO, it is essential that you read the overview information at This will give you an understanding of how GEO is organized, and how to interact with the database. I’ll cover interacting with GEO directly using R in a later blog, so keep tuned for that.

Using that link, you can also navigate directly to the GEO website and search using keywords or accession numbers.


If you can’t find any ChIP-seq data on GEO, I’d suggest broadening your search to include other databases, though if you can’t find it on GEO, you’re likely not going to find it. This is, I admit, a mouse/human/popular model organism-centric blog, so I don’t know much (yet) about other lesser studied organisms and where their data is stored.

If worse comes to worst and there really are no data sets specific to your protein, that’s Ok! Take pride in knowing that you are exploring new territory. And after all, the point of this series is to help guide you as you explore this new data (assuming of course that you’re relatively new to the ChIP analysis process). As you explore binding location (peak) calling software, you’ll come to find that there are two types of algorithms for calling peaks: those that are tailored towards narrow (transcription factor) peaks, and those for broad (chromatin modification) peaks.

My recommendation is to go find some transcription factor chip-seq data and try to at least match the conditions of your experiment. Often there will be a Treatment/No Treatment setup to determine how TF binding changes under certain conditions. Search for data sets that are appropriate. For example, if you are studying the binding landscape of your protein in a particular cell line or tissue, collect the data that represents the same untreated conditions.

Okay! By now you’ve hopefully managed to find some suitable data to use as reference as you begin your analysis. Next time we’ll discuss what to do with that data, and how to begin the analysis on your own data.



1. Hewitt SC, Li L, Grimm SA, et al: Research Resource: Whole-Genome Estrogen Receptor α Binding in Mouse Uterine Tissue Revealed by ChIP-Seq. Mol. Endocrinol. 2012; 26: 887–898.

3 thoughts on “Chip-Seq Part 1: Reference Data

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s