P. Universidad Católica de Chile

Facultad de Ciencias Biológicas
Departamento de Genética Molecular y Microbiología

Plant Systems Biology Lab

Step-by-step construction of gene co-expression networks from High-Throughput Arabidopsis RNA sequencing data.


The rapid increase in the availability of transcriptomics data generated by RNA sequencing represents both a challenge and an opportunity for biologists without bioinformatics training. The challenge is handling, integrating and interpreting these data sets. The opportunity is to use this information to generate testable hypothesis to understand molecular mechanisms controlling gene expression and biological processes. A successful strategy to generate tractable hypotheses from transcriptomics data has been to build undirected network graphs based on patterns of gene co-expression. However, this is not always easily done by biologists without bioinformatics training. In order to make the process of constructing a gene co-expression network more accessible to biologists, here we provide step-by-step instructions using published RNA-seq experimental data obtained from a popular public database. This guide includes basic instructions for the operation of widely used open source platforms such as Bio-Linux, R and Cytoscape. Even though the data we used in this example was obtained from Arabidopsis thaliana, the workflow developed in this guide can be easily adapted to work with RNA-seq data from any organism.
README




A tutorial on constructing simple biological networks for understanding complex high-throughput data in plants.


Technological advances in the last decade have enabled biologists to produce increasing amounts of information for the transcriptome, proteome, interactome and other -omics data sets in many model organisms. A major challenge is integration and biological interpretation of these massive datasets in order to generate testable hypotheses about gene regulatory networks or molecular mechanisms that govern system behaviors. Constructing gene networks requires bioinformatics skills to adequately manage, integrate, analyze and productively use the data to generate biological insights. In this chapter, we provide detailed methods for users without prior knowledge of bioinformatics to construct gene networks and derive hypothesis that can be experimentally verified. Step-by-step instructions for acquiring, integrating, analyzing and visualizing genome-wide data are provided for two widely used open source platforms, R and Cytoscape platforms. The examples provided are based on Arabidopsis data, but the protocols presented should be readily applicable to any organism for which similar data can be obtained.
TUTORIAL
README




Discriminative local subspaces in gene expression data for effective gene function prediction

Motivation: The massive amounts of genome-wide gene expression data available for many organisms has motivated the development of computational approaches that leverage this information to pre- dict gene function. Among the successful approaches, maximum margin machine learning methods such as a Support Vector Machines have shown superior classification accuracy. Biologists, however, often prefer simpler methods like coexpression networks or clustering, because they are based on biologically meaningful concepts that scientists can understand and interpret.


Results: In this work we present Discriminative Local Subspaces (DLS), a novel supervised machine learning method designed to analyze gene expression data and to predict discriminative functional networks for a biological process of interest. DLS uses the knowledge available in the Gene Ontology (GO) project to generate informa- tive training sets that guide the discovery of expression signatures: discriminative expression patterns exhibited by genes associated to a biological process of interest in a particular subset of experimental conditions. These signatures provide key information to predict new functional connections. Furthermore, to deal with the still incomplete list of gene annotations in GO, DLS incorporates a novel scheme to discover false negatives. Our results using an Arabidopsis thaliana dataset show that DLS outperforms the predictions of previous works based on Coexpression Networks or Support Vector Machines. But, beyond its power for function prediction, we argue that DLS stands out among similar functional prediction methods because it provides valuable insight to help biologists to understand the predictions and to guide future experiments.


Availability and Implementation:
The MATLAB code to run DLS is freely available here (for academic use only).
The complete data set to run DLS is available here.
You can download a small data set to test the code here.



VirtualPlant

VirtualPlant integrates genomic data and provides visualization and analysis tools for rapid and efficient exploration of genomic data. The goal of VirtualPlant is to provide the tools to aid researchers to generate biological hypotheses. Find out more here.



BioMaps

This program allows you to find gene ontology assignments (GO Consortium) or functional categories (MIPS) for one or more lists of genes. It reports back those terms that are found over-represented in the list(s) provided, as compared to the frequency of the term in the whole genome. Run BioMaps here



Sungear

Sungear is a software system that supports a rapid, visually interactive and biologist-driven comparison of large data sets. The data sets can come from microarray experiments (e.g. genes induced in each experiment), from comparative genomics (e.g. genes present in each genome), or even from non-biological applications (e.g. demographics or baseball statistics). Access Sungear from the VirtualPlant site or try our demo version here. Sungear was developed at New York University.



The Plant Specific Database

The main goal of PLASdb is to stimulate further research on some of the least-studied proteins of plants. To this end, we have compiled and integrated information from public data sources (e.g. The Institute for Genomic Research [TIGR], Munich Information center for Protein Sequences [MIPS], The Arabidopsis Information Resource [TAIR]) with links to the original information and provided links to external databases (e.g. Salk Institute Genomic Analysis Laboratory [SIGNAL]). In addition, we have performed predictions of subcellular localization and transmembrane helices, analyzed gene expression in organs based on expressed sequence tag (EST) frequencies and microarray data, and grouped protein families based on sequence similarity clustering (BLASTCLUST). Web-based interfaces to several search engines allow gene-driven or exploratory modes of data access. Information in PLASdb can be quickly downloaded in text or Excel format for further analysis. Explore the database and find out more about this project here