A method for constructing gene lists

As developmental geneticists, we sometimes encounter the need to compile lists of genes associated with a particular condition or developmental process. An example of this need during my PhD was to construct genes lists against which we’d cross reference data from a time series RNA-seq study. In particular, these gene lists were useful for plotting using SeqPyPlot – an analysis program I developed for interrogating and visualizing pilot study data where there are no biological replicates available. The lack of replicates is strongly un-recommended when performing RNA-seq studies, however it is still done. Nonetheless, methods for constructing gene lists are still needed, so I’ll walk you through one possible way of assembling lists of genes using a bit of R and biomaRT. (I use RStudio, but this should work in whichever IDE or notepad you use.) The first thing to do is to download biomaRT if you don’t already have it. If you’ve never heard of or used biomaRT, prepare to have your mind blown. This is a database interface tool maintained by Steffen Durinck that provides access to all of the gene information you could ever need to acquire including access to sequence data (gene and flanking), annotation data, and every name that gene goes by. It is available in dozens of species with more coming available over time as well. Open up your R console and past this is:

source("https://bioconductor.org/biocLite.R")

biocLite("biomaRt")

Sometimes this fails due to incompatibility with https:// (secure version of http), so the fix is to delete the ‘s’ from the source call to give:

source("http://bioconductor.org/biocLite.R")

Once that has installed, you can load it up.

library(biomaRt)

With the library loaded, there are just two more steps to collecting gene data. First, set the species you want to work with and establish a connection with that database (remember, it is just a giant spread sheet with columns and rows!) In this case, we’ll use mouse, but you can easily load up a list of all available species. Generically, this command looks like this:

useMart("The biomart database you'd like to connect to", "The species you'd like to use")

In our case:

MyMart = useMart("ensembl", dataset="mmusculus_gene_ensembl")

**”ensembl” is generally going to be what you want to use for marts. To view all of the available marts (like ‘ensembl) species (i.e. marts), try:

listMarts()

With the mart loaded you can simply run a single command to save a data table extracted from the bioMart to a variable, which you can then either write out, or manually copy and paste to a spread sheet or text file. The command is getBM, and it takes the four characteristic arguments of bioMart: attributes, filters, values, and mart. Here is a generalization of what these mean:

gene.data = getBM(

attributes = 'The column names you want data from',

filters = 'The columns you're going to apply a filter to',

values = ‘The rows you want selected by the values that you list here’,

mart = ‘The species database you saved in step 1’

)

The way this will look when running a command will be:

gene.data = getBM(attributes=c('mgi_symbol', 'ensembl_transcript_id', 'go_id'), filters='go_id', values = 'GO:0008078', mart=MyMart)

In this example we are making gene lists based on GoTerm record numbers. As you can see, I’m filtering with the go_id column, so I need to pass a single GoTerm ID or a vector of GoTerm IDs. This will cause the database to return the rows that have those values in them. However, I’ve specified specific attributes (columns) that I’m interested in:

attributes = c('mgi_symbol', 'ensembl_transcript_id', 'go_id')

This means that instead of returning the entire row with every column, I’m only getting the columns I’ve listed in the vector. You can’t technically grab all attributes – the returned data set would include data linked to other pages (other organism’s data) which has to be collected separately. Instead, you can try this:

First list attributes usings listAttributes(MyMart). Next in RStudio, visualize the resulting dataframe and get the rows you’d you’d like to collect. Make sure that they all have the same value in the third ‘page’ column. Next you can create a list by indexing the data frame.

all_atts = listAttributes()my_atts = all_atts[1:47,1]

Finally, you can pass that in to he getBM call as attributes=my_atts. Note the 1:47 row index. The max number of attributes I’ve been able to retrieve is 47.

To collect the correct data, you may need to change the attributes and filters to match specifically what you’re looking for. If you already have the information used in each of the four arguments, great. Otherwise, you can simply look them up in the same way that you looked up available species datasets:

listAttributes()

listFilters()

listValues()

Using biomaRt, you can collect, convert, extract, or whatever else you need to do using the functions available with biomaRt. To get a full sense of the capabilities of biomaRt, have a look at the docs and read through each of the functions to see what they do.


https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf

In the example above, we are getting genes that have a particular GoTerm ID in their row under the go_id column. So where do we get those go_id GoTerm IDs? There are couple different ways you can do this, but the easiest way that I currently know of is to load up to the basic gene ontology database and search for terms in the browser. To load up the onotology, simply go to:

http://geneontology.org/ontology/go-basic.obo


From here you can use Ctrl-F to search for terms such as such as ‘cell’, ‘migration’, or something more complex, such as ‘alpha-1,3-mannosyltransferase activity’. Let’s say I want a list of genes assocated with ‘Limb Development’. The go-basic.obo database has the go_id GO:0060173 associated with Limb Development, let lets try replacing the go_id in the above getBM() call:

gene.data=getBM(attributes=c('mgi_symbol', 'ensembl_transcript_id', 'go_id'), filters='go_id', values = 'GO:0060173', mart=ensembl)

 

This call returns a data frame that looks like this:

head(gene.data)               

mgi_symbol      ensembl_transcript_id           go_id

1    Intu            ENSMUST00000061590    GO:0060173

2 Intu               ENSMUST00000091186    GO:0060173

3   Ift122           ENSMUST00000112925    GO:0060173

4   Ift122           ENSMUST00000038234    GO:0060173

5   Pam             ENSMUST00000097625    GO:0060173

6   Bmpr2         ENSMUST00000087435    GO:0060173

I’m interested in retrieving mouse gene symbols in this case, so I’m selecting mgi_symbol as one of my attributes (columns) to return. If I also wanted entrezgene IDs (for example if I’m running goatools to perform enrichment analysis locally), then I could add in entrezgene IDs as an attribute. Agains, use listAttributes() to see all available columns. You’ll notice that there are already some duplicates in my gene symbol column, so I’ll need to get rid of those. We’ll do this using the unique() built-in command and our gene.data, but selecting only the first column (mgi_symbol):

uniq.symbols = as.data.frame(unique(gene.data[,1]))

Unique() – removes duplicates. gene.data[,1] – ignores rows and returns only the first column for processing. [row,col] as.data.frame() – makes the variable clickable if you are using Rstudio. With this you should be able to do:

head(uniq.symbols)

1 Intu

2 Ift122

3 Pam

4 Bmpr2

5 Gli3

6 Ift172

Excellent! So that’s it. You now know how to grab entire lists of genes associated with specific go term processes based on their go_id using bioconductor’s biomaRt.