Matrix binarization one liner

In my previous post I talked a little bit about preparing text for a neural network. During my study of this problem, I came across a bit of python that I didn’t quite understand.

import pandas as pd
import numpy as np

labels = pd.read_csv('labels.txt', header=None)
Y = (labels=='positive').astype(np.int_)  # What does this do??

Given a list of words in a text file, either ‘Positive’ or ‘Negative’ read in to  a pandas data frame object with no header, binarize the labels and set to a new object reference.

In the last post, I gave an incredibly simple solution to reading the words and converting them to 1s and 0s. This solution is a little more elegant. And it turns out that thanks to Numpy, you can extend it to matrices and N-d arrays as well.

Basically, the line above iterates through the pandas ‘labels’ data frame, which  isonly asingle column (or Series) and performs a boolean operation on each element (thanks to the == operator). If the element is ‘positive’, then it gets assigned a value of True. If its not, it is assigned False.

When the .astype() function is called the True or False values will be converted to a new type, in this case the np.int_ type.

Numpy is THE python package for performing linear algebra operations and as such treats every 1d array as a vector, 2d array as sequence of vectors (matrix) and 3d+ array as a generic tensor. This means when we perform operations, we are performing vector math.

Lets see another example. The following is the case of a random matrix where instead of performing an == comparison, we’ll convert any number that is greater than 0.5 to a 1, and anything less to a 0.

>>> a = (a > 0.5).astype(np.int_)
>>> np.random.seed(0)
>>> np.set_printoptions(precision=3)

>>> a = np.random.rand(4, 4)

>>> a
>>> array([[ 0.549,  0.715,  0.603,  0.545],
       [ 0.424,  0.646,  0.438,  0.892],
       [ 0.964,  0.383,  0.792,  0.529],
       [ 0.568,  0.926,  0.071,  0.087]])

>>> a = (a > 0.5).astype(np.int_)  # Where the numpy magic happens.

>>> array([[1, 1, 1, 1],
           [0, 1, 0, 1],
           [1, 0, 1, 1],
           [1, 1, 0, 0]])

Whats going on here is that you are automatically iterating through every element of every row in the 4×4 matrix and applying a boolean comparison to each element.

If > 0.5 return True, else return False.

Again, by calling the .astype method and passing np.int_ as the argument, you’re telling numpy to replace all boolean values with their integer representation, in effect binarizing the matrix based on your comparison value.

I love it!

 

Prepping text for a neural network

I’ve been studying natural language processing  using deep neural networks. Neural networks are great for discovering hidden correlation within high dimensional data or correlation within data that exhibits nested structure such as images and text. But they don’t deal in English. So you can’t just feed a neural network a sentence and expect it to compute anything – after all, these networks are just computational algorithms.

These things are hella cool – and they look cool as well.

neural_network

The mathematical object (some vector containing n-dimensional arrays – 1d for the simple networks) is input on the input  layer, and then a series of linear transformations occurs by taking the dot product of the input value and the weight matrix (all the lines to the right). Then the result is passed through an activation function to get the hidden layer, which repeats the same process. Error is calculated using the training label, and the error is back propogated using partial derivatives and the chain rule (from multivariable calculus – you know, the class that sounded a lot harder than it was).

When dealing with natural language processing, you have to find a way to convert words to mathematical objects, so that the network can incorporate them in to its algorithm.

During my studies, I came across a training data set of movie reviews and their respective sentiment of ‘positive’ and ‘negative’ based on the star rating. In order to train a neural network to ‘learn’ which reviews were positive and negative I had to solve the problem of converting both the review in to a number or sequence of numbers and the sentiment in to a sequence of numbers.

It turned out that the way to do this was to convert the sentiment label in to a list of boolean integers, i.e. 0s and 1s. The input file consisting of lines of text that read either “POSITIVE” or “NEGATIVE”. So the question was framed as – “How do I convert ‘positive’ and ‘negative’ to 0s and 1s?

Okay, so this is NOT complicated. The first and obvious answer is to write a function that  will read the file, parse the lines, and append 0s and 1s to a list.

def binarize_input(file):
    
    with open(file, 'r') as infile:
        binarized_list = []
        for row in infile.readlines.strip():
        
            if str(row) == "POSITIVE":
                binarized_list.append(1)
            else:
                binarized_list.append(0)

        return binarized_list

Okay easy. When my neural network outputs its final classification, it will be given a probability of either being a 1, or a 0, a Positive, or a Negative.

Dealing with the input turned out to be not so easy. The weird thing about neural networks that I didn’t quite expect when I began was that you have to sort of encapsulate the entire scope of the data within each input. This is the case for simple neural networks anwyays. First you have to convert the input string of text to a mathematical object, but you can’t just input any ‘ol string. If this were the case, then this would totally screw up the computation occurring within the network. You’d have say, 30 input nodes for one sentence and then say 100 input nodes for the next, etc.. Having a dynamic input like this is not something I’m aware of working.

Instead, you have to take the entire set of words you’ll be considering in your network, and train on those words. This means you’ll have an input node for every word in your ‘vocabulary’, aka, your ‘bag of words’.  So what do you do? You have to input a vector with an integer representing each of the words. If you have 20,000 different words across all of your reviews, you’ll have 20,000 words in your vocab ‘bag of words’! And if thats the case, then you’ll have 20,000 input nodes! Wha!! Crazy!

So the solution, it turns out, is to vectorize all of the words and then either create a count representation or a binarization. In other words, we have a vector with 20,000 word positions that start at zero.

>>> bagowords = [0,0,0,0....0]
>>> len(bagowords)
>>> 20000

For each word in our sentence we go through and flip its position to a 1 (binarized) or we add 1.

from collections import Counter
import numpy as np

total_counts = Counter()
with open(review_file, 'r') as input:
    reviews = input.readlines()    
    for review in reviews.readlines.strip():
        for word in review.split(" "):
            total_counts[word] += 1

#dictionary comprehension like a ninja
word2index = {int(value):str(key) for (key, value) in enumerate(total_counts.keys())}

def get_input(review):
    vector = np.zeros(len(word2index.keys())):
    for word in review.split(" "):
        idx = word2index.get(word, None) # if the word is in our vocab
        if idx is not None:
            vector[idx] += 1 # or just = 1
        else:
            pass

With this we achieve a numpy array that is filled with 20000 positions, and if we see the word in our current review, we add 1 to that words position. We could just as well binarize this by setting the word equal to one, effectively eliminating any weight metric from the neural network’s correlation analysis.

Pretty cool!

 

Philosophically and physically, why do proteins function?

Every so often, I cruise the forums for questions to answer. Its usually late at night, and I usually should be working on something else. Nonetheless, I seek questions I feel like I can answer and I give them a crack. One of my favorite forums is ResearchGate where people ask a lot of biology research questions. Today I’m posting my response to a fairly poorly worded question on why is that proteins do what they do. I’ve linked the original and I’ve corrected the question’s English for this blog.

Philosophically and physically, why do proteins function?

Question: I’ve been doing bio research for many years, however in terms of physics and philosophy, I’m pretty confused why proteins function. e.g., How/Why do endonucleases know to cut DNA? What is the force to make it cut? Why do transcription factors know to bind specific DNA sequences? Still, what force make them bind? Is someone capable of explaining it in physical or logical terms? Since I studied bio major, teachers always taught me the rules (A can do this, B can do that), but why?

Response

Hi Suipwksiow,

This is such a great question! And it is one of my favorites.

First – Ke-Wei is absolutely correct – philosophy is utterly irrelevant to this question. Philosophy refers to the synthesis of knowledge from information extracted from data. That data can come from literature, biological observations, careful  observation of physical phenomena, etc. To understand your own question, its key to understand how all of these things/fields are related. In other words, how is biology related to chemistry, and how is that related to physics? They are all  just ways that we humans have categorized our  study of the same thing – the natural universe – and they are essentially different levels of focus. Here’s a completely made up story to illustrate how these fields are related…

The literature PhD studies the words created by other humans, and he is fascinated by how many different versions of the same story there could be. He thinks about it and concludes that all of these minds are similar in composition, but develop down different trajectories giving rise to the different versions of what appear to be relatively similar stories.

The neurobiologist reads a summary of this  work and thinks, fascinating – but if all of these minds are truly similar in composition, then I should be able to observe similar patters across the brain, which my predecessors showed to be comprised of neural cells. So they disect the brain and image brain activity, and measure brain chemicals as best they can to draw conclusions about the similarities between brains. They note that certain regions of the brain exhibit frequent excitation of electrical activity which leads to the release of a variety of chemicals – neurotransmitters.

The biochemist reads the neurobiologists latest discoveries in a journal over lunch and begins to speak with his good friend the general chemist about the possible structures of these neurotransmitters. Together they use what they learned from the neurobiologist and propose a theory that these chemicals are comprised of a variety of different atoms and have a specific shape, which then bind to proteins on the surface of the neural cells. They are able to define with great precision the interaction between the proteins and the neurotransmitters and determine that their shape is key to the binding process.

The chemist and biochemist go on after this publication to work with a theoretical physicists  to explain how these atoms bind to one another. They develop a theory based on the observations of fellow scientists that these proteins and molecules aren’t static, but vibrate with high frequency – causing the proteins to change conformation. They realized that 85% of the time the protein is in confirmation A, while the other 15% of the time it is in confirmation B. Astounding.

The physicist then reads this paper and thinks to themself – my goodness, these molecules behave in such an interesting way! Perhaps I can develop a theory to explain why these molecules vibrate to begin with! He starts to consider the very nature of the constituent parts of the molecules, the atoms themselves. He studies the behaviors of the electrons among all of the elements, and builds a giant magnetized ring to accelerate simple atoms close to the speed of light so that he can smash them together to see what they break apart in to. This work gains traction and before long 1000s of experimental and theoretical physicists are collaborating to understand what the fabric of  matter is and they realize that it is nothing more than intersecting fields produced by subatomic particle waves rippling through the very fabric of space-time.

So all of these scientists elect a spokesperson to condense all of the findings from the  neurobiologist, the chemists, the biochemist, the physicists both experimental and theoretical, to write a review article. The spokesperson brings the review article to the literature PhD and discusses it all over a cup of coffee.

So this is the end of the completely made up story – I hope you’ve made the connection.

All of these fields are studying different aspects of the same thing. 

If you can learn a little bit from each field, then the reason that macromolecules behave the way they do (for example, why transcription factors bind to DNA) becomes apparent. I’ll give you a leading summary, and with this you can start your journey to gaining a broader understanding for yourself.

The physicists and chemists taught us that each atom has a different configuration of atoms.When these atoms bind to one another, we call this a molecule and we know that molecules a huuuuge variety of charge characteristics. Some molecules are polar, some are neutral, some are negative, some are positive, etc. Some are stable,  others are highly unstable, the list goes on. Proteins are just molecules. They are a combination of atoms.

But here is the clincher:

Most atoms when bound together will form an unflexible lattice that is a repeating pattern over and over. Metals and rocks. There is no flexibility to a lattice that is repeating and these molecules either stay in the conformation forever if they are stable, or break down if they are unstable (as is the case of high atomic weight atom lattices). And actually, most of the atoms in the universe are just single proton atoms – hydrogen. However, when stars compress low atomic weight atoms such as hydrogen and helium in their cores (leading to never ending nuclear fusions explosions that make our sun what it is), they form a variety of different atoms now with different numbers of protons and electrons. When these stars go nova, they undergo one final massive compression, fusing together LOTS of atoms, and explode in a ridiculous display call a super nova (if the sun is big enough). These explosions innevitably lead to the create of the most important element (atom) to life. The CARBON atom.

The difference between a protein and piece of metal is that proteins, and all other molecules that form the basis for life, revolve around the all important carbon atom – which provides the atomic property of flexibility. Proteins are structured around the carbon atom and carbon atoms allow a large variety of different types of atoms to come together. There is an entire field dedicated to studying this flexibility and the endless combinations of atoms that revolve around carbon. This field is called ‘Organic Chemistry’.

So when it comes to a transcription factor binding to DNA, what is REALLY happening?  Well, DNA is an organic molecule, based around the carbon atom, rich in phosphorous, and therefore exhibiting an overall negative charge. But,the individual nucleotides that are ordered along the phosphate backbone have their own charge domains. So you can think of a piece  of DNA, a transcriptional start site for example, as a unique sequence of charges that are either exposed (homochromatic) or not exposed (homochromatic). Lets not  forget that a transcription factor is a protein, which is also a molecule with its own unique pattern of charges across its surface. What happens when a positive charge and a negative charge come in to contact? They bind. But the charge pattern across the sequence of DNA and the binding domain of the protein have to match in order for them to actually bind.

Its almost as if each molecule was encoding some kind of charge pattern allowing it to bind…

Indeed, this is the very high level essence of the genetic code. This molecule encodes every protein that exists in the cell – transcription factors, cell membrane signaling molecules, hell, even the histone proteins that the DNA wraps itself around are encoded. The entire cell is just a crazy complex combination of encodings that determine which molecules bind with which, when, and where. And its all based on fundamental physical properties of atoms.

To be absolutely clear, the complexity of these systems arises through a natural evolutionary process. And there is plenty of evidence to show that complex systems absolutely CAN arise from simple beginnings using an evolutionary mechanism. We have demonstrated this both in living systems and computationally. Google for evidence if you are unsure whether or not to believe me.

So, in summary – why do molecules like transcription factors behave the way they do? Philosophically that is an irrelevant question. Scientifically that is THE question that theoretical and experimental physicists are currently trying to answer. Guys like Sheldon from ‘The Big Bang Theory’, or Stephen Hawkin in real life, guys like Albert Einstein,  etc. How do molecules behave the way they do? Encoded charges spread through the molecule by their fundamental physical properties. Even the behavior of proteins as complex as endonucleases can be deconstructed in to sequences of behaviors that arise from their composition and fundamental physical properties.

Pretty neat!

 

A method for constructing gene lists

As developmental geneticists, we sometimes encounter the need to compile lists of genes associated with a particular condition or developmental process. An example of this need during my PhD was to construct genes lists against which we’d cross reference data from a time series RNA-seq study. In particular, these gene lists were useful for plotting using SeqPyPlot – an analysis program I developed for interrogating and visualizing pilot study data where there are no biological replicates available. The lack of replicates is strongly un-recommended when performing RNA-seq studies, however it is still done. Nonetheless, methods for constructing gene lists are still needed, so I’ll walk you through one possible way of assembling lists of genes using a bit of R and biomaRT. (I use RStudio, but this should work in whichever IDE or notepad you use.) The first thing to do is to download biomaRT if you don’t already have it. If you’ve never heard of or used biomaRT, prepare to have your mind blown. This is a database interface tool maintained by Steffen Durinck that provides access to all of the gene information you could ever need to acquire including access to sequence data (gene and flanking), annotation data, and every name that gene goes by. It is available in dozens of species with more coming available over time as well. Open up your R console and past this is:

source("https://bioconductor.org/biocLite.R")

biocLite("biomaRt")

Sometimes this fails due to incompatibility with https:// (secure version of http), so the fix is to delete the ‘s’ from the source call to give:

source("http://bioconductor.org/biocLite.R")

Once that has installed, you can load it up.

library(biomaRt)

With the library loaded, there are just two more steps to collecting gene data. First, set the species you want to work with and establish a connection with that database (remember, it is just a giant spread sheet with columns and rows!) In this case, we’ll use mouse, but you can easily load up a list of all available species. Generically, this command looks like this:

useMart("The biomart database you'd like to connect to", "The species you'd like to use")

In our case:

MyMart = useMart("ensembl", dataset="mmusculus_gene_ensembl")

**”ensembl” is generally going to be what you want to use for marts. To view all of the available marts (like ‘ensembl) species (i.e. marts), try:

listMarts()

With the mart loaded you can simply run a single command to save a data table extracted from the bioMart to a variable, which you can then either write out, or manually copy and paste to a spread sheet or text file. The command is getBM, and it takes the four characteristic arguments of bioMart: attributes, filters, values, and mart. Here is a generalization of what these mean:

gene.data = getBM(

attributes = 'The column names you want data from',

filters = 'The columns you're going to apply a filter to',

values = ‘The rows you want selected by the values that you list here’,

mart = ‘The species database you saved in step 1’

)

The way this will look when running a command will be:

gene.data = getBM(attributes=c('mgi_symbol', 'ensembl_transcript_id', 'go_id'), filters='go_id', values = 'GO:0008078', mart=MyMart)

In this example we are making gene lists based on GoTerm record numbers. As you can see, I’m filtering with the go_id column, so I need to pass a single GoTerm ID or a vector of GoTerm IDs. This will cause the database to return the rows that have those values in them. However, I’ve specified specific attributes (columns) that I’m interested in:

attributes = c('mgi_symbol', 'ensembl_transcript_id', 'go_id')

This means that instead of returning the entire row with every column, I’m only getting the columns I’ve listed in the vector. You can’t technically grab all attributes – the returned data set would include data linked to other pages (other organism’s data) which has to be collected separately. Instead, you can try this:

First list attributes usings listAttributes(MyMart). Next in RStudio, visualize the resulting dataframe and get the rows you’d you’d like to collect. Make sure that they all have the same value in the third ‘page’ column. Next you can create a list by indexing the data frame.

all_atts = listAttributes()my_atts = all_atts[1:47,1]

Finally, you can pass that in to he getBM call as attributes=my_atts. Note the 1:47 row index. The max number of attributes I’ve been able to retrieve is 47.

To collect the correct data, you may need to change the attributes and filters to match specifically what you’re looking for. If you already have the information used in each of the four arguments, great. Otherwise, you can simply look them up in the same way that you looked up available species datasets:

listAttributes()

listFilters()

listValues()

Using biomaRt, you can collect, convert, extract, or whatever else you need to do using the functions available with biomaRt. To get a full sense of the capabilities of biomaRt, have a look at the docs and read through each of the functions to see what they do.


https://bioconductor.org/packages/release/bioc/manuals/biomaRt/man/biomaRt.pdf

In the example above, we are getting genes that have a particular GoTerm ID in their row under the go_id column. So where do we get those go_id GoTerm IDs? There are couple different ways you can do this, but the easiest way that I currently know of is to load up to the basic gene ontology database and search for terms in the browser. To load up the onotology, simply go to:

http://geneontology.org/ontology/go-basic.obo


From here you can use Ctrl-F to search for terms such as such as ‘cell’, ‘migration’, or something more complex, such as ‘alpha-1,3-mannosyltransferase activity’. Let’s say I want a list of genes assocated with ‘Limb Development’. The go-basic.obo database has the go_id GO:0060173 associated with Limb Development, let lets try replacing the go_id in the above getBM() call:

gene.data=getBM(attributes=c('mgi_symbol', 'ensembl_transcript_id', 'go_id'), filters='go_id', values = 'GO:0060173', mart=ensembl)

 

This call returns a data frame that looks like this:

head(gene.data)               

mgi_symbol      ensembl_transcript_id           go_id

1    Intu            ENSMUST00000061590    GO:0060173

2 Intu               ENSMUST00000091186    GO:0060173

3   Ift122           ENSMUST00000112925    GO:0060173

4   Ift122           ENSMUST00000038234    GO:0060173

5   Pam             ENSMUST00000097625    GO:0060173

6   Bmpr2         ENSMUST00000087435    GO:0060173

I’m interested in retrieving mouse gene symbols in this case, so I’m selecting mgi_symbol as one of my attributes (columns) to return. If I also wanted entrezgene IDs (for example if I’m running goatools to perform enrichment analysis locally), then I could add in entrezgene IDs as an attribute. Agains, use listAttributes() to see all available columns. You’ll notice that there are already some duplicates in my gene symbol column, so I’ll need to get rid of those. We’ll do this using the unique() built-in command and our gene.data, but selecting only the first column (mgi_symbol):

uniq.symbols = as.data.frame(unique(gene.data[,1]))

Unique() – removes duplicates. gene.data[,1] – ignores rows and returns only the first column for processing. [row,col] as.data.frame() – makes the variable clickable if you are using Rstudio. With this you should be able to do:

head(uniq.symbols)

1 Intu

2 Ift122

3 Pam

4 Bmpr2

5 Gli3

6 Ift172

Excellent! So that’s it. You now know how to grab entire lists of genes associated with specific go term processes based on their go_id using bioconductor’s biomaRt.  

SeqPyPlot – Origin story and Data prep Guide

SeqPyPlot – Origin story and Data prep Guide

In this post, I’m going to provide a detailed walk-through for how to use SeqPyPlot, a module that I’m currently developing to aid with the analysis and visualization of multiple series high throughput data.

A common problem that researchers seem to have these days is the inability to manipulate and create meaninful plots using the results from next generation sequencing results. This is in part due to the rapid drop in sequencing costs. The Beijing Genomics Institute, for example, is offering a full sample processing pipeline for RNA-seq including ribo-depletion, Low-input Library Prep, and sequencing to a depth of 60-million reads for a mere $640 AUS per sample. That is less than $500US for a transcriptome. This means that the realm of ‘-omics’ is now available to anyone who can gain access to a heavyweight computer and a massive increase in the amount of data generated.

Despite this, there are surprisingly few numbers of computational savvy programmers out there. These skills include the ability to handle and manipulate large datasets, command line knowledge and familiarity with a large number of software, background information on the efficacy of specific statistical methods, and perhaps the ability to write short programs.

With high throughput sequence data, it hurts to not have these skills.

If you were to have a sudden need to compare expression across 150 genes, you’d be looking at hours of work and a really sore mouse clicking hand. With the right skills however, this becomes a painless task completed in the time it takes to drink a coffee. Unfortunately, the problem seems to be that learning how to program seems to be considered a poor investment of time, giving returns neither great enough nor fast enough. i.e. Five hours of mouse clicking is better than months of frustration to accomplish the same thing.

To help mitigate this, I am currently developing a pipeline to process normalized high through-put sequence data. The pipeline takes in normalized expression data, performs differential expression, and then plots a list of genes selected either manually, or from the output from differential expression testing. The pipeline is written in OOP python and sub-process multiple R programs for differential expression testing.

The Original Pipeline

During my PhD we decided to survey the transcriptomes throughout the entirety of development of the mouse genital tubercle and to the same with a mutant model. We thought knowing the approximate values of gene expression would yield collections of related genes which would then provide a good launching point for other experiments. More importantly, we thought that we could use a single pooled sample per time point and condition.

Unfortunately, we were wrong and were left with a data set that gave us a qualitative sense of the transcriptomes,  a list of genes of arbitrary length that were possibly mis-regulated, nearly all of which were apparently unrelated. And to interject a bit of personal history – it was this experience coupled with a difficult experience with the genome of the same mutant mouse that led me to actively pursue computational biology as and area of expertise. I took up programming in my personal time and after a great deal of self motivated study decided to write a program that could help me filter and plot data rapidly.

The original pipeline was designed to process output from cufflinks and test a series of differential expression (DE) cutoff thresholds, thus allowing the user to  rerun the pipeline with the best  possible guess-timate for which DE cutoff to use, remove far-outliers, and return all of the genes that were DE along with their expression data. The data would then be passed in to a plotting program that would  take in a list of an arbitrary number of genes,  retrieve the  data, and give the user a fast survey of comparative time series expression across all of the genes at once.

Here is a diagram showing the pipeline:

full-pipeline-chart

 

The filtering program, GetCuffData, is useful for quickly filtering normalized datasets that don’t include biological replicates. I should stress here that this program is not meant to perform any legitimate statistical tests. Its job is to filter data based on parameters that you manual set. The revised pipeline handles biological replicates, and also performs proper Differential Expression Testing.

The plotting program, SeqPyPlot.py, is a nice general use tool as well as a function part of this pipeline.

User Guide

The first thing to know is that the current version (v0.1) is only designed to handle data in the output format of the Cufflinks suite program ‘CuffNorm’. Future updates will handle output from more software, but for now, users of CuffNorm can take advantage of SeqPyPlot.

The key utility here is that we can visualize the expression of large numbers of genes across multiple conditions very quickly. This requires a data set that is actually internally comparable – this means that  the input  for  this program needs to come from a tool that normalizes data together. Cuffdiff does this, Cuffnorm of course does this, as well as many other tools for analyzing high throughput data sets.

Data Prep

The output of CuffNorm is basically a table, or data frame. The language of SeqPyPlot doesn’t implement Pandas, or any other software using ‘data frames’, it uses the csv module to parse a .txt file. So we’re going to format our data to  agree with this.

TL;DR – Format your data as shown below in excel, or similar program, and save as a Tab-deliminted {.txt} file.

Follow this link to download the sample data and test the program – SampleData.zip

If you run the main plotting program with the -h option, you’ll get the usage printed:

>python SeqPyPlot.py

SeqPlot - v0.1
usage: SeqPlot.py [-h] [-t] [-o] [-c] [-n] [gene_list] [plotter_data]

Plot time series expression data without replicates.

positional arguments:
 gene_list              Single Column gene list in txt file.
 plotter_data           Formatted plot-able data.

optional arguments:
 -h, --help             show this help message and exit
 -t, --time             A comma separated list of time points.
 -, --out               Output Folder Name
 -c, --condition        A comma separated list of conditions (max 2)
 -n, --num              Number of samples per plot.

The option -n, –num  defaults to (2) and indicates the number of series, or lines, that you  intend to  draw on the  plots. It is assumed that you’re using the program to plot multiple time series date. This option will affect how we set up the .txt file we’ll be using to load data.

There are basically three things you need to know to format your data:

  1. First Column = Gene Name
  2. Column 2 to n = First series follows
  3. Column n + 1 to end = Second time series

Here is what this will look like

dataformat1

You’ll notice that at the moment there is no space for any replicates or error calculations. This will be implemented in a future update with options to specify whether or not to account for that.

At the moment you need to make sure you don’t have any genes with identical names (which shouldn’t be the case anyways…).

T1 = Expression value for Time Series 1: Time point 1.

If you want to plot a single time series (because you like how the plots looks…) you can specify [-n 1] on the command line to only read in Time Series 1.

A nice feature this program is that you can plot uneven data sets. For example, If you have [T1, T2, T3, T4] for Time series 1, but  only [T1, T2, T4] for Time series 2. The way to handle this is to simple leave the missing time point blank by inserted an empty. The program will create a mask layer and your series line will be drawn right  past it.

Once your data is organized properly, you can save it to a .txt file. Don’t worry about encoding, this is handled.

Prep Gene List

With your  data formatted, we can select some genes and save them to another .txt file.

genelist

Save this also as a Tab-delimited {.txt} file. The  rule here is one gene per row. Don’t worry about formatting of the gene names, everything is converted at run time to the same format.

With your data prepared, you can set the rest of the options for the plot labels. The defaults are listed below, and you should follow the same formatting as these defaults.

You can also review the ReadMe.

optional arguments:
 -h, --help             show this help message and exit
 -t, --time             Default: 1,2,3,4,5
 -o, --out              Default: Default_out
 -c, --condition        Default: Series1,Series2
 -n, --num              Default: 2 Number of samples per plot.

Good Data Curation: The FAIR principles

A recent workshop held at the University of Melbourne brought up the topic of good data curation practices and this is something that I personally can’t stress enough. At the time of this writing I am in the process of completing an introductory survival guide to ChIP-seq analysis and one of the first topics that I cover is the importance of collecting your experimental meta-data. In fact, I dedicated an entire entry to the topic of meta-data.

While I gave a few practical reasons for keeping track of your metadata and what to do if you can’t find everything you need, Dr. Wilkinson published an article on The FAIR Guiding Principles for scientific data management and stewardship which seemed pertinent to this and worth bringing up here. The point of this article also  fits nicely with the need to obtain good quality, and more importantly, relevant control data.

To summarize the article, we  basically need to find a set of common rules by which to operate when generating high throughput data (or any kind, for that matter) and publishing that data in publicly available repositories. The argument is  made that without good record keeping, data deposited may as well be tossed in the trash since if you can’t describe how data was generated then you can’t use or publish results based on that data since  the analysis could potentially be invalidated.

I’m a little surprised that excellence in curation isn’t already a standard put in to practice, but it was made more or less clear that the repositories are willing to sacrifice good curation for participation. In other words, if we allow researchers to describe a bare minimum about the experiment, then more will participate in depositing their data. What alarms me is the apparent general sentiment that publicly or privately funded research is allowed to be handled in this way to begin with without repercussions from the funding bodies. After all, the output from a lab is a reflection on the choice made to fund that lab.

So to remedy  this problem and help spread the word on better data curation practices, lets have a look at the FAIR principles but forward by  Dr. Wilkinson (1).

The FAIR Guiding Principles

To be Findable:

F1. (meta)data are assigned a globally unique and persistent identifier

F2. data are described with rich metadata (defined by R1 below)

F3. metadata clearly and explicitly include the identifier of the data it describes

F4. (meta)data are registered or indexed in a searchable resource

To be Accessible:

A1. (meta)data are retrievable by their identifier using a standardized communications protocol

A1.1 the protocol is open, free, and universally implementable

A1.2 the protocol allows for an authentication and authorization procedure, where necessary

A2. metadata are accessible, even when the data are no longer available

To be Interoperable:

I1. (meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (meta)data use vocabularies that follow FAIR principles

I3. (meta)data include qualified references to other (meta)data

To be Reusable:

R1. meta(data) are richly described with a plurality of accurate and relevant attributes

R1.1. (meta)data are released with a clear and accessible data usage license

R1.2. (meta)data are associated with detailed provenance

R1.3. (meta)data meet domain-relevant community standards

 

To find the original article and more reading on the topic of research funding, follow the links below.


References and more reading

  1. Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data3:160018 doi: 10.1038/sdata.2016.18 (2016).
  2. http://www.thenewatlantis.com/publications/the-sources-and-uses-of-us-science-funding
  3. http://undsci.berkeley.edu/article/who_pays