Predicting gene expression from sequence


We describe a systematic genome-wide approach for learning the complex combinatorial code underlying gene expression. Our probabilistic approach identifies local DNA-sequence elements and the positional and combinatorial constraints that determine their conext-dependent role in transcriptional regulation. The inferred regulatory rules correctly predict expression patterns for 73% of genes in Saccharomyces cerevisiae, utilizing microarray expression data and sequences in the 800 bp upstream of genes. Application to Caenorhabditis elegans identifies predictive regulatory elements and combinatorial rules that control the phased temporal expression of transcription factors, histones, and germline specific genes. Successful prediction requires diverse and complex rules utilizing AND, OR, and NOT logic, with significant constraints on motif strength, orientation, and relative position. This system generates a large number of mechanistic hypotheses for focused experimental validation, and establishes a predictive dynamical framework for understanding cellular behavior from genomic sequence.

Prevalent Themes and Biological Insights

The examples detailed above highlight the key discoveries of our approach. First, we find a great deal of redundancy in the modes of transcriptional regulation (OR logic). Second, many factors require at least one partner to be functional (AND logic). Third, one mode of combinatorial regulation is the absence of a factor that would cause a different mode of regulation (NOT logic). Finally, we can now account for a large fraction of the information required for the proper expression of genes in response to relevant physiological perturbations and developmental dynamics in the two model organisms. The fact that this information resides within their 5′ upstream regions provides a global statistical proof for this important dogma in molecular biology.

However, whether all the requisite information is resident in the local DNA, is an open question. Because of the statistical nature of our approach, we cannot correctly predict all genes. Higher-order combinatorial interactions may be difficult to learn, because they have few, or unique instances in the genome. Also, we may not be finding all of the relevant sequence features. Some relevant features may be downstream or within coding regions, or may be undetectable by standard motif finding algorithms. The proper description of some DNA regulatory elements may require nonadditive effects not included in our present position weight matrix description. Another potential limitation is that our heuristic learning algorithm may not be finding the optimal network. Finally, noise in the expression data may set a hard limit on our ability to learn the relevant sequence features and network structure.

But other failures may imply the existence of alternative regulatory mechanisms, e.g., because we learn the regulatory programs from local sequence, our failures may indicate genes where longer range interactions are important. A prominent cause of this type of failure may be silencing due to large scale chromatin modification near telomeres Gottschling et al. 1990 D.E. Gottschling, O.M. Aparicio, B.L. Billington and V.A. Zakian, Position effect at S. cerevisiae telomeres: reversible repression of Pol II transcription, Cell 63 (1990), pp. 751–762. Abstract | View Record in Scopus | Cited By in Scopus (616)(Gottschling et al., 1990) and mating loci (Aparicio et al., 1991), boundary elements which inhibit local DNA sequences from signaling nearby genes (Kellum and Schedl, 1991), or similar mechanisms which set up chromosomal domains of gene expression (Cohen et al., 2000). The fact that our failures are not spatially clustered more than would be randomly expected indicates that such chromatin domains, if responsible for our failures, appear to be of intermediate scale. What is the role of local chromatin modifications? Are all such modifications subservient to the local sequence features that recruit transcription factors, which in turn recruit chromatin modifying machinery? These are important questions to address in future work, and we are currently in the process of exploring these possibilities.

Unlike the genetic code, the cis-regulatory code is not universal, requiring for individual genes, heroic experimental efforts to elucidate (Davidson et al., 2003). We have developed a whole-genome computational framework for the systematic extraction of this combinatorial code and prediction of gene expression patterns from DNA sequence alone. The large number of combinatorial rules which pass our predictive validation criterion, provide the community with a rich source of high-yield hypotheses for experimental analysis. Our success with C. elegans indicates that our general approach is applicable to multicellular eukaryotes, but the larger regulatory regions in these genomes still present a significant challenge. Also, combinatorial regulation is likely to be much more elaborate. In this setting, successful motif detection and predictive modeling will undoubtedly benefit from cross-species comparisons of regulatory regions.

The results presented here clearly demonstrate that a sufficiently general and systematic whole-genome approach is able to infer predictive regulatory constraints from mRNA expression data and DNA sequence alone. Our ability to decipher more complex regulatory programs is currently limited by the availability of gene expression data. From physiological perturbations and temporal expression responses at the organismal level, we have identified the regulatory information in many previously uncharacterized genes in S. cerevisiae and C. elegans. With the increasing availability of high quality tissue specific expression data in model organisms (Kim et al., 2001) and humans, our method presents a framework for rapidly elucidating the transcriptional regulatory mechanisms that orchestrate diverse spatiotemporal processes in multicellular organisms.

1. Michael A. Beer and Saeed Tavazoie, Predicting gene expression from sequence, Cell, Vol. 117, 185–198, April 16, 2004, Copyright ©2004 by Cell Press
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License