Data Analysis

Next Generation Sequencing Analysis

While no two sequencing projects have exactly the same goals, most projects fall into one of three categories:

  • DNA sequencing/SNP discovery- After the sequencing data set has passed preliminary QC, reads are mapped to the target genome using a variety of mapping tools including Star and Cufflinks. If the goal of the project is to identify variation, then Sam Tools will be used to identify SNPs or other polymorphisms within a user’s data set. At the conclusion of analysis, a user will receive a file that includes genomic location of all polymorphisms. This file can also be used to visualize the data set using a custom tract on the UCSC genome browser.
  • mRNA expression analysis – After QC, reads will be mapped and normalized using multiple tools including the Cufflinks software package. Normalized read counts will then be compared across samples and adjusted for multiple testing using the Cuffdiff package. Users can expect a list of genes that are differently expressed within their defined sample populations, plus a pathway analysis identifying targets of study.
  • miRNA analysis- After QC, short reads are mapped to the miRBase database of miRNAs. After normalizing read counts, comparison between groups of samples can be made. A user can expect a list of miRNAs that are differently expressed between experimental groups.
  • Other analyses- The genome core performs projects that do not fit into these standard analysis pipelines. To date, these have included de novo assembly of the transcriptome of an unsequenced species, analysis of the microbiome of the rhesus macaque, and other projects.

Please feel free to contact the core directors to learn more about how our core can help with your sequencing data analysis.

Microarray Analysis

To assist users with analyzing their microarray data the Gencore offers analysis services.

Identifying Differentially Expressed Genes – Statistical Testing

The goal of most microarray experiments is to identify genes that have changed during some biological process.

After a dataset has passed all preliminary QC and has been normalized, the next step in the process is to employ a statistical model to identify genes with altered expression due to biological process under study. Depending on the design of the experiment, different statistical test can be run to determine genes with significantly altered expression. For multivariate experiments, ANOVA may be the most suitable method. The GCL uses the software platform PARTEK Genomics Suite for performing most statistical analyses. In other cases, data may in pair-wise format, and in this instance it may be more suitable to employ pair-wise analyses using t-tests or Significance Analyses of Microarrays (SAM). Lastly, in cases where there is an inherently high degree of intersample variation, such as for rectal biopsies, we have had to employ Bayesian statistics using the Bioconductor package Limma. An important process in identifying genes with truly differential expression is to control for errors of inference during statistical testing by adjusting the level of significance used. We commonly employ the Benjamini-Hochberg False Discovery Rate (FDR) method for adjusting p-values. After assigning degrees of confidence in way of p-value, we typically will use a second filtering criteria to further reduce the amount of false positives; the most commonly used is to filter based on a minimum fold-change.

Data Visualization, Identification of Transcriptomic Signatures and Pathway Analysis

Producing a list of differentially expressed genes with significant differential expression is usually the first step of data analysis. To extract meaningful biological pathways represented by the data, several analysis strategies can be employed. To gain an overview of the data and identify any families of genes that sharing similar patterns of expression across time-points, phenotypes or other variables, it is common to employ a clustering method and display the data as a heat map. An example of a heat map is shown in Figure 2A. In this example, genes with similar expression patterns over time after SIV infection in sooty mangabeys and rhesus macaques were organized using hierarchical clustering, with Pearson dissimilarity (a metric that prioritizes similarity of expression patterns over absolute changes) calculating distances between individual genes. Depending on the biological question, various clustering methods and metrics that focus on differing aspects of the data (e.g. focusing on genes with the largest absolute fold-change; limiting the analysis to a preset number of clusters, etc) can be employed.

A central goal of transcriptome-wide microarray analysis is to extract the underlying biological pathways represented by the list of differentially expressed genes. This type of analysis is referred to typically as pathway analysis, ontology mapping or enrichment analysis. The Yerkes GenCore employs a number of different methodologies and tools to identify the major biological pathways represented in a dataset:

Pathway Analysis – Pathway analysis typically involves testing for statistically significant enrichment of families of genes within the DEG list using proprietary databases such as included in the Ingenuity Knowledgebase or GeneGo platforms. These tools also typically assist in visualization of pathways showing both genes from the input DEG list, and genes within the same biological pathway, having direct protein-interaction, or related gene expression. These tools allow for rapidly datamining of DEG lists by having established a searchable database of intergene relationships from the literature. The GeneGo platform are is employed in the Yerkes GenCore, and an example of an Ingenuity analysis being used to identify putative negative regulators of interferon signaling in SIV infected Sooty Manabeys is show in Figure 3B.

Ontology Mapping – is describing the analysis of a user gene list to detect the enrichment of genes representing established groups based on function, subcellular location or other categories. This categorization, or ontology, of known genes into pre-defined categories is maintained by the Gene Ontology (GO) consortium. Tools such as DAVID (Database for Annotation, Visualization and Integrated Discovery) utilize the GO hierarchy and test for signfifcant representation of families organized by molecular function, biological process or subcellular location. Whereas tools that perform ‘pathway analsysis’ tend to focus on well-know biological systems (e.g. ‘TLR4 signalling’), ontology analysis organizes genes into broader biological processes (e.g. ‘host defense’). The GenCore has effectively used DAVID in several publications to get an overview of molecular processes underlying a datasets in several instances.

GeneSet Enrichment Analysis (GSEA) – is an advanced bioinformatics technique developed by the Subramanian and colleagues at the Broad Institute. This method has several unique characteristics that differentiate it from other pathway/ontology tools: (1) GSEA identifies trancriptomic signatures within datasets using experimentally derived gene families, rather than gene-families defined by ‘prior knowledge’ databases (e.g. the GO database, or the Ingenuity or GeneGo databases). In this way, GSEA can identify novel patterns in a user’s dataset that may go undetected by other techniques. Several databases maintain experimentally derived gene set to screen again a user’s dataset (e.g. The Molecular Signatures Database, MSigDB, or CancerSigDB). (2) GSEA using a flexible ranking algorithm utilizing the Kolgomirov-Smirnov statistic to rank genes between phenotypes and calculate the cumulative enrichment of a gene set towards one phenotype or the other. Because it ranks the entire dataset, the GSEA does not rely on cut-offs in a user’s dataset (i.e. p-value and fold-change cut-offs) and examines the dataset as a whole, making it more robust. (3) GSEA is extremely flexible and can be used to compare datasets from two different microarray platforms or species to try and determine if there are shared transcriptomic signatures. As an example, Figure 2C shows an application of GSEA that was used to compare the gene expression responses of HIV-infected Viremic Non-Progressor patients to SIV-infected sooty mangabeys monkeys. This analysis showed that in addition to sharing phenotypic characteristics after HIV/SIV infection (high viral load, stable CD4+ counts, low immune activation) that they also share a transcriptomic response by reducing their expression of Interferon Stimulated Genes (ISGs).

Figure 2 – Data visualization and pathway analysis techniques for microarray data. (A) Gene expression in pathogenic infection in rhesus macaques (RMs) and non-pathogenic infection of sooty mangabey (SMs) reveals homeostatic regulation of the interferon stimulated gene (ISG) response. (B) Ingenuity Pathway analysis identifying potential regulators of the ISG response that are expressed in non-pathogenic infection. (C) Gene Set Enrichment Analysis of the ISG signature from macaques demonstrates its enrichment in transcriptomes of HIV-infected rapid progressors (RPs) compared to Viremic Non Progressors (VNPs), that demonstrate high viremia but no disease progression.

Please contact yerkes-nhpgenomics@emory.edu or email GenCore directors steven.bosinger@emory.edu or zpjohns@emory.edu for details on data analysis options and rates.