Tips and Examples for Using the PipTools Programs   (all-Perl version)


TABLE OF CONTENTS


Introduction

This page provides examples, from a biologist's perspective, of how the PipTools programs can be used in real-life situations. It does not attempt to describe every tool exhaustively, but rather to give a flavor for what the tools can do, and especially how they can be combined to accomplish more complex tasks.

We have designed this toolkit in a modular fashion, where each program performs one small function, in order to provide more flexibility to meet your needs. Some tasks can be handled in one step by a single program (these should be fairly straightforward after consulting the Program Reference page), while others will require a multi-step procedure that applies several tools in succession. The examples shown here illustrate a few common situations that can be dealt with through "recipes" of this sort; additional combinations and variations are left to your creativity.

Keep in mind that when running the Perl versions of the tools on certain systems, you need to add  perl -S  to the beginning of each command; see Syntax Conventions for more information. (This does not apply to sim4, which is written in C.)

If you have any suggestions for additional tools or improvements that would make your work easier, please contact us at pipmaster@bio.cse.psu.edu.

Building an Exons File from a GenBank Entry

Suppose you just finished assembling some human contigs and want to show how they align with the corresponding mouse sequence, or vice-versa. You could submit the sequences to PipMaker, but the output would show only the alignments, with no annotation -- not even where the known genes are. This leaves the reader with no useful information about the location of other interesting features in the region, such as putative regulatory elements. How can you build some annotation files so PipMaker will generate more informative results, and allow you to produce a nice figure for your paper?

One place to start is with the GenBank entry for a given sequence, which often contains annotations for known features, including the locations of genes and exons. The genbank2exons program can build an exons file directly from the GenBank flat file. However, this output identifies features with respect to the reference sequence for that GenBank entry, so the coordinates may not correspond to the correct positions of these genes and exons in your sequence. What to do?

If the reference sequence is an exact subsequence of your larger one (or vice-versa), one possibility is to use the shift-pos program, which can adjust all of the coordinates in the exons file by an offset you specify. So for instance, if your output file from genbank2exons was called "human_exons", you could fix it using the following command:

     shift-pos human_exons -add 144562 > new_exons

However, sometimes the relationship between the GenBank reference sequence and your genomic sequence is not so simple. For example, suppose you've assembled several overlapping contigs that each have their own numbering system (i.e., positions in each contig start with 1). In cases like this, the sim4 program (available separately) provides a more flexible and robust way to translate the coordinates. sim4 is an alignment program; it has an option to produce output in exons-file format, but you need to feed it mRNA sequences to align with the genomic sequence. You can get these using the exons2mrna tool, which turns the output from genbank2exons into putative mRNA sequences for sim4. So in this case, the steps would be:

  1. Run genbank2exons on each GenBank flat file to get a list of its annotated genes and exons. (You can obtain these GenBank files from the Entrez Nucleotide database by specifying "GenBank" and "Plain Text" for the format.) If your file was called "genbank_file" and you saved the output as "exons_file", this would look like:
         genbank2exons genbank_file > exons_file
    

  2. Next, run exons2mrna to generate a file of putative mRNA sequences corresponding to the genes in each exons file. You also need to provide the reference sequence from the GenBank entry (which you can obtain from Entrez by specifying "FASTA" and "Plain Text"). If your reference sequence was called "ref_seq" and you saved the results as "mrna_file", this would look like:
         exons2mrna exons_file ref_seq > mrna_file
    
    Note that if the endpoints of the coding sequences are available in the GenBank entry, they will be added to the FASTA header lines in the output.

  3. Now the mRNA sequences can be located in your genomic sequence using sim4. The "A=5" option produces output that is formatted for use as a PipMaker exons file. Note that alignment of files containing multiple sequences (e.g., mRNAs) requires that the genomic sequence is listed before the mRNA file, otherwise only the first mRNA sequence will be aligned. Both of the input files must be in FASTA format.
         sim4 genomic_seq mrna_file A=5 > exons_file2
    

  4. The individual exons files that represent the overlapping GenBank contigs can now be concatenated into one large exons file, e.g., using your favorite text editor. (In general you need to be careful when concatenating exons files to make sure you don't end up with extra title lines buried in the middle of the resulting file, but in this case it's no problem because sim4 doesn't put title lines in its exons-style output.)

Building an Exons File from Genscan Predictions

Now suppose that when you submit your sequences and exons file to PipMaker, the resulting pip shows some unexpected regions of conservation. Are you sure they are not genes? Perhaps they just hadn't been identified when the GenBank sequence was annotated.

You might want to try a Genscan analysis of these regions, to see if it thinks they look like genes. And to get the most out of the results, it would be helpful to map the coordinates of Genscan's predictions onto the pip. One way to do this is to use the genscan2exons program to convert your Genscan output into a PipMaker exons file:

     genscan2exons genscan_output > genscan_exons
This is also handy for generating a putative exons file when there are no GenBank annotations for your sequence.

Building an Underlay File

PipMaker's color underlay feature allows you to highlight arbitrary regions of a pip with colors of your choosing. Among other things, it can be used to compare similar information from different sources. For example, if you already have an exons file (perhaps one that you built from GenBank entries) and want to show Genscan predictions on the pip simultaneously for comparison, you can display the predictions as color underlays to avoid conflict with the exons file. The genscan2underlays program converts Genscan output into a PipMaker underlay file, using the colors you specify to represent promoters, exons, and polyA sites on the forward and reverse strands.

     genscan2underlays genscan_output -fprom LightGreen -fexon LightBlue
                       -fplya LightRed -rprom LightGreen -rexon LightBlue
                       -rplya LightRed > genscan_colors
(Note that these example commands have been wrapped for easier reading, but in practice you would type them all on one line.)

You can also generate an underlay file from an exons file, using the exons2underlays program. In this case you specify colors to represent exons, UTRs, and introns on the forward and reverse strands (unlike genscan2underlays, this program will assume that colors you omit should be the same as the ones you specify).

     exons2underlays exons_file2 -fexon LightBlue -futr LightOrange
                     -intron LightYellow > exon_colors

You may have noticed that there are two ways to generate an underlay file from Genscan output: (1) run genscan2exons followed by exons2underlays, or (2) run genscan2underlays directly. Generally the latter is preferable, since it allows you to highlight promoters and polyA sites in addition to exons. The exons2underlays program is primarily intended for use with exons files from other sources, especially those containing CDS information so the UTRs can be highlighted (e.g., exons files built from GenBank entries).

Converting Annotations from One Aligned Sequence to the Other

PipMaker draws exons, underlays, and other annotations relative to the first (top) sequence. This means that if you want to flip the whole comparison over and look at it with the mouse as the primary sequence, you need to map the positions in your annotation files (exons, underlays, etc.) from the human sequence to their aligning positions in the mouse sequence. A tedious task by hand, but the transform-pos program makes it much easier.

  1. Make a preliminary PipMaker run (unflipped) to get an alignment file in either concise or lav format.

  2. Run transform-pos on each of your annotation files, e.g.:
         transform-pos exons_file2 lav_file > mouse_exons
    
    The result may not have all of the annotations if some coordinates do not align with the other sequence, but the program will warn you about these trouble spots, and you'll have a good start toward an annotated sequence that you can refine to your liking.

  3. Run RepeatMasker on the mouse sequence and generate a new pip.

Note that if the original pip was made with a fragmented second sequence, you will have to split it up into separate pips, because (1) transform-pos does not support multi-contig alignments, and (2) when you exchange the sequences, PipMaker does not currently allow the first sequence to be fragmented anyway. This means that each separate alignment file from step 1 will be smaller, and so in step 2 there will be more annotations that cannot be translated using that file (because they align with some other contig). You will need to sift through the output to determine which "failures" are real (a dotplot view of the original pip, provided by PipMaker and/or Laj, could be very helpful here).

In a related situation, you might have annotation files for an alignment's second sequence, but not for the first. In this case you could use transform-pos with the "inverse" option to map the annotation coordinates from the second sequence onto the first one. In our example, you would supply a file of, say, exons in the mouse sequence:

     transform-pos mouse_exons lav_file -inverse > human_exons

Identifying Highly Conserved Regions

To complete the analysis of sequence conservation in a genomic alignment, you might like to know the positions of all elements that meet specified conservation thresholds. We have developed the strong-hits tool for this purpose. It examines an lav or concise alignment file from PipMaker, and reports all regions that meet the requirements you stipulate for the minimum length of ungapped alignment and percent identity. Furthermore, you can provide a file containing a list of particular regions, and the output will indicate if the conserved elements it finds overlap with any of the intervals in your file. So for example, you could supply an exons file in order to find out which of the strong hits are exons, and (of more interest) which are not exons and may therefore represent putative regulatory elements.

     strong-hits lav_file -len 100 -pct 70
                 -overlaps exons_file2 -exons-only > conserved_elements

You could then use this information to create a color underlay file, using different colors to indicate the percent identity levels, e.g., 70-79% LightPink, 80-89% LightRed, and 90-100% DarkRed. When this information is added to an underlay file describing genes, you will have a description of the alignment that is quite complete, highlighting the positions of both genes and conserved non-coding elements.



Cathy Riemer, Laura Elnitski, and Matt Weirauch, November 2001