SeqPyPlot – Origin story and Data prep Guide

SeqPyPlot – Origin story and Data prep Guide

In this post, I’m going to provide a detailed walk-through for how to use SeqPyPlot, a module that I’m currently developing to aid with the analysis and visualization of multiple series high throughput data.

A common problem that researchers seem to have these days is the inability to manipulate and create meaninful plots using the results from next generation sequencing results. This is in part due to the rapid drop in sequencing costs. The Beijing Genomics Institute, for example, is offering a full sample processing pipeline for RNA-seq including ribo-depletion, Low-input Library Prep, and sequencing to a depth of 60-million reads for a mere $640 AUS per sample. That is less than $500US for a transcriptome. This means that the realm of ‘-omics’ is now available to anyone who can gain access to a heavyweight computer and a massive increase in the amount of data generated.

Despite this, there are surprisingly few numbers of computational savvy programmers out there. These skills include the ability to handle and manipulate large datasets, command line knowledge and familiarity with a large number of software, background information on the efficacy of specific statistical methods, and perhaps the ability to write short programs.

With high throughput sequence data, it hurts to not have these skills.

If you were to have a sudden need to compare expression across 150 genes, you’d be looking at hours of work and a really sore mouse clicking hand. With the right skills however, this becomes a painless task completed in the time it takes to drink a coffee. Unfortunately, the problem seems to be that learning how to program seems to be considered a poor investment of time, giving returns neither great enough nor fast enough. i.e. Five hours of mouse clicking is better than months of frustration to accomplish the same thing.

To help mitigate this, I am currently developing a pipeline to process normalized high through-put sequence data. The pipeline takes in normalized expression data, performs differential expression, and then plots a list of genes selected either manually, or from the output from differential expression testing. The pipeline is written in OOP python and sub-process multiple R programs for differential expression testing.

The Original Pipeline

During my PhD we decided to survey the transcriptomes throughout the entirety of development of the mouse genital tubercle and to the same with a mutant model. We thought knowing the approximate values of gene expression would yield collections of related genes which would then provide a good launching point for other experiments. More importantly, we thought that we could use a single pooled sample per time point and condition.

Unfortunately, we were wrong and were left with a data set that gave us a qualitative sense of the transcriptomes,  a list of genes of arbitrary length that were possibly mis-regulated, nearly all of which were apparently unrelated. And to interject a bit of personal history – it was this experience coupled with a difficult experience with the genome of the same mutant mouse that led me to actively pursue computational biology as and area of expertise. I took up programming in my personal time and after a great deal of self motivated study decided to write a program that could help me filter and plot data rapidly.

The original pipeline was designed to process output from cufflinks and test a series of differential expression (DE) cutoff thresholds, thus allowing the user to  rerun the pipeline with the best  possible guess-timate for which DE cutoff to use, remove far-outliers, and return all of the genes that were DE along with their expression data. The data would then be passed in to a plotting program that would  take in a list of an arbitrary number of genes,  retrieve the  data, and give the user a fast survey of comparative time series expression across all of the genes at once.

Here is a diagram showing the pipeline:

full-pipeline-chart

 

The filtering program, GetCuffData, is useful for quickly filtering normalized datasets that don’t include biological replicates. I should stress here that this program is not meant to perform any legitimate statistical tests. Its job is to filter data based on parameters that you manual set. The revised pipeline handles biological replicates, and also performs proper Differential Expression Testing.

The plotting program, SeqPyPlot.py, is a nice general use tool as well as a function part of this pipeline.

User Guide

The first thing to know is that the current version (v0.1) is only designed to handle data in the output format of the Cufflinks suite program ‘CuffNorm’. Future updates will handle output from more software, but for now, users of CuffNorm can take advantage of SeqPyPlot.

The key utility here is that we can visualize the expression of large numbers of genes across multiple conditions very quickly. This requires a data set that is actually internally comparable – this means that  the input  for  this program needs to come from a tool that normalizes data together. Cuffdiff does this, Cuffnorm of course does this, as well as many other tools for analyzing high throughput data sets.

Data Prep

The output of CuffNorm is basically a table, or data frame. The language of SeqPyPlot doesn’t implement Pandas, or any other software using ‘data frames’, it uses the csv module to parse a .txt file. So we’re going to format our data to  agree with this.

TL;DR – Format your data as shown below in excel, or similar program, and save as a Tab-deliminted {.txt} file.

Follow this link to download the sample data and test the program – SampleData.zip

If you run the main plotting program with the -h option, you’ll get the usage printed:

>python SeqPyPlot.py

SeqPlot - v0.1
usage: SeqPlot.py [-h] [-t] [-o] [-c] [-n] [gene_list] [plotter_data]

Plot time series expression data without replicates.

positional arguments:
 gene_list              Single Column gene list in txt file.
 plotter_data           Formatted plot-able data.

optional arguments:
 -h, --help             show this help message and exit
 -t, --time             A comma separated list of time points.
 -, --out               Output Folder Name
 -c, --condition        A comma separated list of conditions (max 2)
 -n, --num              Number of samples per plot.

The option -n, –num  defaults to (2) and indicates the number of series, or lines, that you  intend to  draw on the  plots. It is assumed that you’re using the program to plot multiple time series date. This option will affect how we set up the .txt file we’ll be using to load data.

There are basically three things you need to know to format your data:

  1. First Column = Gene Name
  2. Column 2 to n = First series follows
  3. Column n + 1 to end = Second time series

Here is what this will look like

dataformat1

You’ll notice that at the moment there is no space for any replicates or error calculations. This will be implemented in a future update with options to specify whether or not to account for that.

At the moment you need to make sure you don’t have any genes with identical names (which shouldn’t be the case anyways…).

T1 = Expression value for Time Series 1: Time point 1.

If you want to plot a single time series (because you like how the plots looks…) you can specify [-n 1] on the command line to only read in Time Series 1.

A nice feature this program is that you can plot uneven data sets. For example, If you have [T1, T2, T3, T4] for Time series 1, but  only [T1, T2, T4] for Time series 2. The way to handle this is to simple leave the missing time point blank by inserted an empty. The program will create a mask layer and your series line will be drawn right  past it.

Once your data is organized properly, you can save it to a .txt file. Don’t worry about encoding, this is handled.

Prep Gene List

With your  data formatted, we can select some genes and save them to another .txt file.

genelist

Save this also as a Tab-delimited {.txt} file. The  rule here is one gene per row. Don’t worry about formatting of the gene names, everything is converted at run time to the same format.

With your data prepared, you can set the rest of the options for the plot labels. The defaults are listed below, and you should follow the same formatting as these defaults.

You can also review the ReadMe.

optional arguments:
 -h, --help             show this help message and exit
 -t, --time             Default: 1,2,3,4,5
 -o, --out              Default: Default_out
 -c, --condition        Default: Series1,Series2
 -n, --num              Default: 2 Number of samples per plot.