Functional genomics to identify candidates for gene regulatory elements in long genomic DNA sequences

Ross Hardison
Professor of Biochemistry
Department of Biochemistry and Molecular Biology
The Pennsylvania State University

This essay summarizes our functional genomics approach to identifying sequences that are likely to be involved in gene regulation. It is also a progress report covering our work in 1997-1998.

Consider the following postulate:

Highly conserved DNA sequences are invariably involved in an important function.

This statement seems simple and perhaps self-evident to many, and it is the basis for our application of functional genomics to find gene regulatory sequences. We use computational tools to find "highly conserved sequences" and use these to guide experimental tests of their function, especially in control of gene expression.

The apparent simplicity of this statement belies several important issues that need to be addressed. Most importantly, has it been experimentally verified? One can consider it almost a tautology, i.e. Darwinian selection acting on natural sequence variation will preclude changes in truly important sequence, so if a sequence changes little over evolutionary time, it must be important. But is selection the only way to keep sequences from changing? Are there zones of the genome that are protected from sequence alteration? Those are highly speculative questions. One can address the issue directly (but not exhaustively, yet) by testing hypothesized roles for highly conserved sequences. This approach has a long and productive history. Sequence conservation played a key role in discovery of many gene regulatory sequences, such as the -10 box in bacterial promoters (Pribnow in the 1970s), several of the major sequence elements of eukaryotic promoters (e.g. Hogness and lots of others in the late 1970s and 1980s), and the immunoglobulin gene enhancer (Leder and colleagues, early 1980s). Curiously, despite the success of using sequence conservation as a factor in finding gene regulatory elements, it is often not included in contemporary analysis. I see many manuscripts and even published papers in which the authors search a new sequence for features such as binding sites for transcription factors, but pay no attention to whether these motifs are present in the homologous gene in a related species. Somewhere, there is a disconnect that needs to be ligated.

Several recent studies which we have initiated or participated in show that all the highly conserved sequences tested are involved in an important function. We focus on conservation in non-coding regions, and our tests are for effects on gene expression. The list includes E-boxes in an enhancer called HS2 in the beta-globin locus control region, or LCR, (Elnitski et al., 1997), cell-specific positive and negative elements in HS2 (Elnitski et al., 1999), protein-binding sites in another part of the LCR called HS3 (Shelton et al., 1997), and regulatory elements in the 5' flanking region of the adult delta-globin gene (Tang et al., 1997) and the embryonic epsilon-globin gene (Li et al., 1998). These studies lend strong support to our central postulate -- that strong conservation is a reliable indicator of function. Indeed, I argue that the hard part of this process is hypothesizing an appropriate function to test, and if any strongly conserved sequence fails to generate a phenotype upon mutation, then one has likely tested for the wrong phenotype.

A second critical issue is what does strongly conserved mean? We seek an objective definition that lends itself to computational analysis. If a homologous character is shared between two species that are themselves related by descent from a common ancestor, then that character is conserved. However, this does not mean, necessarily, that this character is under selective pressure. For instance, maybe the two species have not been apart long enough for neutral evolution to change nonfunctional sequences. For hemoglobin gene clusters, we have access to DNA sequences from as many as 5 related species, e.g. eutherian mammals such as human, the prosimian primate galago, rabbit, goat, and mouse. Thus we can ask whether a character (a string of nucleotides in our studies) is in common among these species.

In order to do this, we first had to develop a new program, called yama, to compute an alignment of multiple, very long DNA sequences. [Optimal multiple alignments cannot be computed, but useful ones can be. ClustalW is good for protein sequences or coding sequences, but yama is better for genomic DNA sequences.] Results for mammalian globin gene clusters are maintained at our web site, the Globin Gene Server (https://globin.bx.psu.edu/).

We now provide 5 different tools for finding conserved sequences in the aligned sequences of globin gene clusters. The simplest are column-agreement measures. For instance, if a string of columns (i.e. aligned nucleotides) is exactly the same (invariant) in all mammals examined, everyone agrees that it is highly conserved. The calculations get more challenging when one allows mismatches, or even gaps, in the blocks of conserved sequences. Our tools include novel methods to allow a user-defined number of mismatches in each row of aligned blocks. We also provide computations of "information content" as well as minimal evolutionary distance to find the blocks. Nick Stojanovic devoted his Ph.D. thesis to these issues, and the resulting tools are on our web site. We are in the process of finishing some interesting calibration tests of each of these tools against known regulatory regions from beta-globin gene clusters (most recently enhanced by insightful analysis by Liliana Florea). This manuscript should be submitted soon. The answer to the question of "What is highly conserved?" will be determined by the user, but we provide several independent ways to examine the issue, each with adjustable parameters which are being calibrated according to known regulatory elements. This will not produce unique answers, but it does provide objective answers.

If you don't have the luxury of three or more genomic DNA sequences, can you use the homologous sequence from two species to find conserved sequences? Yes, you can, but it is not obvious a priori which two to use. Fortunately for those working in mammals, human and mouse sequences are highly informative at most loci. We developed a compact display of annotated aligned sequences, which we call percent identity plots, or pips (Hardison et al., 1997). Application of these to several loci in mouse and human showed they were very useful for finding regulatory elements, and these plots are being used by more and more investigators. We have argued that this success provides a strong rationale for determining the entire genomic sequence of the mouse (Hardison et al., 1997), and we are gratified that this is now a stated goal of the human genome project. However, some loci are changing too slowly (e.g. those encoding T-cell receptor genes) and some are changing too rapidly (e.g. the DNA-repair gene ERCC1) for human-mouse comparisons to be informative about regulation. Such loci need to be sequenced in other comparison organisms. Current efforts (from Scott Schwartz and Cathy Riemer) to improve this analysis include Java applets that provide a user interface to explore pips and the underlying alignments; prototypes are on the Globin Gene Server for several loci in addition to the hemoglobin genes. We anticipate that tools such as this will be available for all model organisms. This will require sequence determination not only of each model genetic organism, but also its most-informative phylogenetic neighbor. These should be chosen carefully, but currently popular pairs include human and mouse for mammals, Drosophila melanogaster and Drosophila virilis for dipterans, and perhaps E. coli and Salmonella typhimurium for eubacteria. What are good yeasts and worms for comparison to sequenced species?

Along with the advances in analysis of genomic DNA sequences for potential regulatory elements, it is equally important to keep track of all the functional information about regulatory elements. We have made progress in two areas to help meet this major challenge. First, we recently published a database of experimental results on globin gene expression, called dbERGE (Riemer et al, 1998). This unique database supports entry of individual data from published papers in sufficient detail to allow sophisticated queries. For instance, DNA constructs are defined to nucleotide resolution with the segments in a defined order and copy number. Results of DNA transfers are recorded quantitatively in a manner that can support the many different ways of reporting results. This detailed database is built from a home-grown database management system that supports extensive nesting, lists and other features. It demonstrates that such a database is indeed possible. We have a forms-based query interface on the Globin Gene Server, and it allows one to formulate rather simple queries. Our database management system allows much more interesting queries, but limitations in Java have made it difficult to deploy complex forms-based interfaces over the WWW. However, recent improvements in Java's GUI toolkit look promising, and we hope to provide better user interfaces for this database before too long.

Second, we provide information about naturally occurring variants in the hemoglobin genes. Mutations in hemoglobin genes cause the most common genetic diseases in humans, including sickle cell disease and thalassemias. Decades of study have uncovered a large number of hemoglobin variants (not all of which have an abnormal phenotype) and a large number of mutations that affect the level of gene expression (leading to thalassemias and hereditary persistence of fetal hemoglobin). These have been compiled in books, but not in an electronic format. We joined efforts with Dr. Titus Huisman and Dr. David Chui to provide electronic versions of Dr. Huisman's syllabi of hemoglobin variants and thalassemia mutations on the World Wide Web (Chui et al., 1998, Hardison et al., 1998). These are widely used around the world now. Furthermore, we worked with Dr. Heikki Lehväslaiho to convert the variants from the free-text "syllabus" format into a rigorous database called HBVARS. The latter is available at the EBI SRS server (http://srs.ebi.ac.uk/), and supports much more sophisticated queries. We also have a prototype genotype-phenotype database that is being improved and expanded as part of an international collaboration.

The final topic under functional genomics is integration of information in these disparate formats to support more sophisticated queries. We consider this to be a major challenge in bioinformatics, and that the hemoglobin gene system will provide an important model system. For instance, we have information on (i) conserved sequences around fetal gamma-globin genes, (ii) experimental results that implicate some of these sequences in control of developmental timing of expression, and (iii) naturally occurring mutations around these genes that cause the genes to be expressed in adult life as well as fetal life. This information is recorded in different formats in different databases, and the information on conservation is not in a database. Can one develop a management system and query engine that would allow the computer to find all information relevant to a question such as "Find all DNA sequences conserved in simian primates (where gamma-globin genes are expressed fetally) but not in other mammals (where gamma-globin genes are expressed embryonically) that affect the level of expression of the gamma-globin genes when mutated and which have been implicated in hereditary persistence of fetal hemoglobin in humans"? It is true that one could do this by hand and eye now. But as we get more complete information, including data on loci on other chromosomes that are also implicated in controlling developmental timing of gamma-globin gene expression, automated searches will become increasingly important.