Input File Formats for Laj


TABLE OF CONTENTS


Introduction

This page describes the input files supported by Laj, and their required formats. Except where noted, all information applies to both the stand-alone and applet modes of Laj. Any of these files can be compressed with GZIP, if the file name ends with ".gz".

All files must consist solely of plain text characters.
(For example, no Word documents.)

These formats are almost identical to those used by the PipMaker server, and you'll find that the utility programs in our PipTools collection can greatly facilitate the preparation of your files.  

Alignments

This is the main data file that contains the alignments you want to view, in lav format (e.g., output from blastz). The applet form of Laj is only intended to display one such file at a time, but the stand-alone form has the capability to display two of these simultaneously, and designates them as "primary" and "secondary" files. In this case, both files must be based on the same sequences and should also cover the same region.

You can obtain an alignment file in this format by submitting sequences to our PipMaker server and requesting "raw blastz output" on the Advanced PipMaker form. Note that PipMaker sends back your results as email attachments, in this case using a quoted-printable MIME format. Make sure this gets decoded into true plain text when saving the attachment, or Laj is likely to report errors due to the not-quite-right file format.  

Sequences

Laj also needs the original sequences that were aligned, in order to display the text view of the alignments in the bottom panel. By default these are "hidden" input files, in the sense that their names are specified within the alignment file rather than being supplied directly on the command line or as applet parameters. Laj will normally look for these files in the same directory as the alignment file. They should be in the same FASTA format expected by PipMaker, which looks like this:

     >Sequence name and arbitrary header text on one line
     ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG
     GCGATCGATGTGCTAGATCAGATGACA
     ... etc.
At the present time, our software supports only the letters  A ,  C ,  G ,  T ,  N ,  X  in the sequence (and their lowercase versions, if you are using Advanced PipMaker's user-controlled masking). For maximum interoperability, the sequence data should consist of short lines limited to about 70 characters, and it is generally best to keep the header line to a reasonable length as well. The first sequence must be contiguous, but the second file can contain multiple unordered contigs (each with its own header line).

If you obtained your alignment file from PipMaker, be aware that it will have substituted its own names for your sequence files in the blastz output. In stand-alone mode Laj will ask you for the correct names as needed, but if you want to set it up as an applet you will need to supply them in advance. Previous versions of Laj required editing the alignment file to fix this, but now it supports optional parameters that allow you to override the names in the alignment file.  

Exons

This file lists the locations of genes, exons, and coding regions in the first aligned sequence, in the same format expected by PipMaker. The directionality of a gene (">", "<", or "|"), its start and end positions, and name should be on one line, followed by an optional line beginning with a  "+"  character that indicates the first and last nucleotides of the translated region (including the initiation codon, Met, and the stop codon). These are followed by lines specifying the start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand ("<"). By default PipMaker and Laj will supply exon numbers, but you can override this by specifying your own name or number for individual exons. Blank lines are ignored, and you can put an optional title line at the top. Thus, the file might begin as follows:

     My favorite genomic region

     < 100 800 XYZZY
     + 150 750
     100 200
     600 800

     > 1000 2000 Frobozz gene
     1000 1200 exon 1
     1400 1500 alt. spliced exon
     1800 2000 exon 2

     ... etc.
Several of the PipTools programs (including genbank2exons, genscan2exons, and blest2exons) can help you build an exons file.  

Repeats

This file lists interspersed repeats and other features in the first aligned sequence. The first line tells PipMaker that this is a simplified repeats file (as opposed to RepeatMasker output); it is ignored by Laj, which only accepts this simplified format. Each subsequent line specifies the start, end, direction, and type of a particular feature.

     %:repeats

     1081 1364 Right Alu
     1365 1405 Simple
     ... etc.
The allowed types are:  Alu ,  B1 ,  B2 ,  SINE ,  LINE1 ,  LINE2 ,  MIR ,  LTR ,  DNA ,  RNA ,  Simple ,  CpG60 ,  CpG75 , and  Other . Of these, all except  Simple ,  CpG60 , and  CpG75  require a direction ( Right  or  Left ).

One of the PipTools programs, called rmask2repeats, translates the first part of the output from RepeatMasker to this format automatically. The input you supply to this program is the same mask file you would submit to PipMaker. Laj does not automatically locate CpG islands, but it can display them in the features panel like PipMaker does, if they are included at the end of your repeats file. You can create these entries easily using another PipTools program, find-cpg.  

Annotations

This file contains reference annotations, i.e., links to web sites providing information about particular regions in the first aligned sequence, which are drawn as colored bars. The applet form of Laj actually brings up these sites when the user clicks on the bars, but the stand-alone form does not, since there is no web browser involved. The format first defines various types of hyperlinks and associates a color with each of them, then specifies the type, position, description, and URL for each annotated feature. This is almost identical to the format accepted by PipMaker (see below for an exception), but it is a change from the format originally used by Laj, which is no longer supported.

     # annotations for part of the mouse MHC class II region

     %define type
     %name PubMed
     %color Blue

     %define type
     %name LocusLink
     %color Orange

     %define annotation
     %type PubMed
     %range 1 2000
     %label Yang et al. 1997.  Daxx, a novel Fas-binding protein...
     %summary Yang, X., Khosravi-Far, R. Chang, H., and Baltimore, D. (1997).
       Daxx, a novel Fas-binding protein that activates JNK and apoptosis.
       Cell 89(7):1067-76.
     %url http://www.ncbi.nlm.nih.gov:80/entrez/
     query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9215629&dopt=Abstract

     ... etc.
Here, for example, the first stanza requests that each feature subsequently identified as a PubMed entry be colored blue. The name must be a single word, perhaps containing underline characters (e.g.,  Entry_in_GenBank ), and the color must come from Laj's color list.

The third stanza associates a PubMed annotation with positions 1-2000 in the first sequence. The label should be kept fairly short, as it will be displayed on Laj's position indicator line when the user points at this annotation. The summary is optional; it is used only by PipMaker and will be ignored by Laj. Note that summaries and URLs (but not labels) can be broken into several lines for convenience; the line breaks are removed when the file is read, but they are not replaced with spaces. Thus a continuation line for a summary typically begins with a space to separate it from the last word of the previous line, while a URL continuation does not.

Also note that stanzas should be separated by blank lines, and lines beginning with a  "#"  character are comments that will be ignored. The annotations can appear in the file in any order, and several can overlap at the same position with no problem, since Laj will display them in multiple rows if necessary.  

The difference between the annotation formats supported by PipMaker and Laj is that PipMaker allows several summary/URL pairs within a single annotation, while Laj expects each field to occur at most once. If Laj encounters extra URLs, it will just use the first one and display a warning message.  

Underlays

This file specifies color underlays, i.e., colored bands to be painted on the percent identity plot. Currently there are two different formats for this information: the regular format accepted by both PipMaker and Laj, and an additional labeled one that is only used by Laj. The regular format looks like this:

     # sample underlays for the BTK region

     LightYellow Gene
     Green Exon
     Red Strongly_conserved

     35324 72009 Gene
     49781 49849 Exon
     51403 51484 Exon
     50350 50513 Strongly_conserved +
     52376 52603 Strongly_conserved
     ... etc.
The first group of lines describes the intended meaning of the colors, while the second group specifies the location of each band. Colors must come from Laj's color list, but the meaning of each color can be any single word chosen by you. A  "+"  or  "-"  character at the end of a location line will paint just the upper or lower half of the band, respectively. This allows you to differentiate between the two strands, or to plot potentially overlapping features like gene predictions and database matches. Note that if two bands overlap, the one that was specified last in the file appears "on top" and obscures the earlier one (except for the special  Hatch  color). Thus in this example, the green exons and red strongly conserved regions cover up parts of the long yellow band representing the gene. As in the annotations file, lines beginning with a  "#"  character are comments that will be ignored.

The second format is similar to the first one, but it allows you to specify a label for each color band which will be displayed on Laj's position indicator line when the user points the mouse at that band. The color definition lines are the same as for the regular format, but the location lines look like this:

     35324 72009 (Here is one label) Gene
     50350 50513 (Here is another one) Strongly_conserved +
An underlay file for Laj can contain a mixture of these two formats (i.e., the label is optional). The parentheses must be present if the label is, and the label itself cannot contain any additional parentheses. (Note that the  dummy  item formerly required by this format is no longer necessary; it is still supported for your old files, but its use is discouraged.)

One common use of the underlay feature is to mark predictions made by Genscan. To facilitate this, the PipTools collection includes a program called genscan2underlays that translates a Genscan output file to either of these underlay formats automatically. Another of the tools, exons2underlays, creates underlays that correspond to your exons file.  

Highlights

This file is analogous to the underlay file, but it specifies colored regions for the text view rather than for the pip. The format is nearly identical to the underlay file, except that instead of specifying  "+"  or  "-" , you give a number  "1" ,  "2" , or  "3"  to indicate the row of text where the highlight should be painted. If you don't provide a number, all three rows will be highlighted. Just as with underlays, labels can be included which will be displayed when the user points at the highlight, and highlights that are listed later in the file will cover up those that appear earlier. However, the  Hatch  color is not supported for highlights.

If you do not specify a highlight file, Laj will automatically provide default highlights based on the exons file. These will be placed in the top row (since the exons were specified for the first sequence), and different colors will be used to indicate the forward or reverse strand. If the exons file specifies a gene's translated region, then the 5´ and 3´ UTRs will be shaded using lighter colors. These default highlights make it easy to examine the putative start/stop codons and splice junctions, as well as providing a visual connection between the graphical and text views. But if for some reason you do not want any text highlights, you can suppress them by specifying an empty highlight file.  

Color List

For Laj, the available colors are:

    Black   White
    Gray    LightGray    DarkGray
    Red     LightRed     DarkRed
    Green   LightGreen   DarkGreen
    Blue    LightBlue    DarkBlue
    Yellow  LightYellow  DarkYellow
    Pink    LightPink    DarkPink
    Cyan    LightCyan    DarkCyan
    Purple  LightPurple  DarkPurple
    Orange  LightOrange  DarkOrange
These names are case-sensitive (i.e., capitalization matters). As of this writing, these are the same colors supported by PipMaker, but be aware that the appearance of the colors may vary between PipMaker and Laj, and from one printer or monitor to the next.

Hatch

In addition to the regular colors listed above, Laj supports a special "color" for underlays called  Hatch , which is drawn as a pattern of diagonal gray lines. Normally if two underlays overlap, the one that was specified last in the file appears "on top" and obscures the earlier one. However,  Hatch  underlays have the special property that they are always drawn after the other colors, and since the space between the diagonal lines is transparent, they allow the other colors to show through. Currently  Hatch  is only supported for underlays, not for highlights or hyperlink annotations.



Cathy Riemer, July 2002