Tffind Documentation (release version 1.1)

Matt Weirauch (weirauch@soe.ucsc.edu)
Nov 3, 2003


What is Tffind?

     Tffind is a computer program written in C that locates
patterns of transcription factor binding sites within a
MultiPipMaker multiple alignment file or a file in FASTA
format.  It is available for free download at
http://globin.cse.psu.edu, along with other programs for
manipulating MultiPipMaker input and output.

What Do I Need to Run Tffind?

     Tffind currently runs on Unix and Linux platforms.  To
run it, all you need is a Unix or Linux machine, a textual
alignment file from MultiPipMaker or a file in standard
FASTA format, and a transcription factor binding site matrix
file in either TransFac MATRIX or IMD format.  The formats
of these files are described below.

Patterns of Transcription Factor Binding Sites in Tffind

     Patterns in Tffind are built from two units: the motif
and the spacer.  A motif represents one transcription factor
binding site in the pattern.  Each motif is represented by a
position weight matrix as calculated from the given matrix
file.  From this PWM, a score is calculated for the current
position in the alignment or sequence.  If this score meets
a certain threshold, then the remainder of the pattern is
checked to see if the entire pattern matches at the current
position.  If the entire pattern matches, the position is
recorded as a "hit."
     The other part of a pattern is the spacer.  Spacers are
used to orient motifs in relation to one another along the
sequence or alignment.  Spacers can take on one of two
forms.  The spacer "<50>" means to look for the next motif
in the pattern exactly 50 bps away from the previous motif.
The spacer "<10,50>" means to look for the next motif
anywhere in the window of 10 to 50 bps away from the
previous motif.  By using motifs and spacers, patterns of
transcription factor binding sites can be constructed.  For
example, the pattern "MZF1 <0,100> AML-1a" searches for the
MZF1 binding site with the AML-1a binding site following
anywhere from 0 to 100 bps away.  The spacer "<0>" is the
same as having no spacer at all.  Patterns in Tffind can be
arbitrarily large or small.

Input of Tffind

     To run Tffind, just type "tffind AlignmentFile
MatrixFile" and you will be in interactive mode.  Provided
that there are no problems (i.e. the program has compiled
successfully,) you should be greeted with the main menu
screen.  Tffind can also be run in batch mode (just type
"tffind" at the command-line to see the syntax.)  If you
wish to redirect the output from the screen to a file,
append "> outputfile" to the end of this.  More about the
options for Tffind is discussed later.

The Alignment File

     Before discussing the program itself, we will first
briefly discuss the format of Tffind's alignment or sequence
inputs.  Tffind accepts two formats of input:  MultiPipMaker
textual alignment format and FASTA format.
     A textual alignment file contains the results of a
MultiPipMaker alignment between 2 or more sequences.  The
format of the alignment file is the exact same format as the
textual output of MultiPipMaker.  To receive output in this
format, check the box labeled "Generate very verbose text
(ASCII, compressed)" on the MultiPipMaker page, which can
also be reached via the Penn State Globin Gene Server,
http://globin.cse.psu.edu.  A sample textual alignment file
is shown in Figure B.1.

____________________________________________________________

3 57

Human
Mouse
Rat

AGATGAGAGTGATAAGAGAGAAAGA---GATAGGGGATGATAACACTTCCGGAAGTG
AGGGAGAGATGATAAGAGAGAAAGAAAAGATAGGGGATGATAACACTTCCGGAAGTG
AGGGAGAGATTTTTTGAGAGAGAGAAAAGATAGGGGATGATAACACTTCCGGAAGTG

Figure B.1: The format of a textual multiple alignment file.
The first two numbers indicate the number of sequences in
the alignment and the length (in base pairs) of the
alignments, respectively.  Next are the names of each
sequence, one per line.  Finally, the alignment itself is
given.  Gaps in the alignment are represented by the
character '-'.
____________________________________________________________


     The second format accepted by Tffind is standard FASTA
format.  Files in this format are accepted so that users can
search in single sequences in addition to alignments of
sequences.  FASTA format is essentially a single line
description, followed by the sequence itself.  See Figure
B.2 for a sample.
     For both types of input, only the characters A, C, G,
and T are recognized.  Other characters, such as (a, c, g, t,
N, n, X, x, -, etc.) are given a score of negative infinity.
Therfore, it is impossible for a single binding site to be
reported as a hit if it contains any of these characters.
To force Tffind to treat lowercase nucletides as if they were
uppercase, use option 9 (-lc for the command-line version.)

____________________________________________________________

>Description of Sequence
GAGAATCGATCGACTACGATCAGCTAGCACAGCGGCGGCCTAGCGATCAGCGACTACTAG
CGACTACGACGATCAGCATCAGCACTAGCAGCATCAGCTACAGCGACTACAGCGACTACT
AGCAGCTACGACTAGCACTACGACGGCCCTAGCATCGCA

Figure B.2: The format of a FASTA file.
A sequence in FASTA format begins with a single-line
description, followed by lines of sequence data. The
description line is distinguished from the sequence data by
a greater-than (">") symbol in the first column.
____________________________________________________________


The Matrix File

     A file containing the transcription factor binding site
position weight matrices is required to run Tffind.  Tffind
accepts files in both Transfac and IMD format.  The format
of IMD matrix files is shown in Figure C, and the Transfac
format is shown in Figure D.  When choosing between matrix
files, it is important to keep in mind that different files
may cause the program to report different results.  A
specific entry in one database may in fact have different
values in the other database, which in turn may alter the
results of the search.
____________________________________________________________

NF-kB 11.54 17.55  GGGANTTYCC  M00001
A |   3   0   0  60  25   4   0   0   5   0
C |   0   0   0   2  29   1  18  32  73  77
G |  71  78  76  16   6   0   0   0   0   0
T |   4   0   2   0  18  73  60  46   0   1
vHNF-1 11.02 11.04  GTTAAT  M00002
A |   0   0   0   3   3   0
C |   0   0   0   0   0   0
G |   3   0   0   0   0   0
T |   0   3   3   0   0   3

Figure C: The format of an IMD matrix database entry
Starting at the top line, the left-most entry is the name of
the T.F. binding site.  The next two values are not used by
Tffind.  Fourth is the binding site's consensus sequence,
and last is the IMD database accession number.  The next
four lines contain the count of each of the four nucleotides
at each position in the transcription factor binding site.
The first column gives the count of each nucleotide at the
first position in the binding site, and so on.
____________________________________________________________


____________________________________________________________

AC   M00077
XX
ID   V$GATA3_01
XX
DT   24.04.1995 (created); hiwi.
DT   16.10.1995 (updated); ewi.
XX
NA   GATA-3
XX
DE   GATA-binding factor 3
XX
BF   T00311 GATA-3; Species: human, Homo sapiens.
XX
PO      A      C      G      T
01     16     14     24      9      N
02     26     13      9     15      N
03      0      0     63      0      G
04     62      0      0      1      A
05      1      0      3     59      T
06     35      2     10     16      A
07     17      5     33      8      R
08     10     11     26     16      N
09     15      7     34      7      G
XX
BA   63 selected binding sequences
XX
RN   [1]
RX   MEDLINE; 93309433.
RA   Merika M., Orkin S. H.
RT   DNA-binding specificity of GATA family t.f.s
RL   Mol. Cell. Biol. 13:3999-4010 (1993).
XX

Figure D: The format of a Transfac matrix database entry
The following fields are relevant to Tffind:
     AC       Accession number
     ID   Transfac identification number
     NA   Name of binding factor
     PO   Position in the T.F. binding site, frequency of A,
C, G, T, respectively

The last column in the "PO" table contains the deduced
consensus nucleotide at each position.
____________________________________________________________


Using Tffind: Features and Options

     Each of Tffind's parameters and features are explained
in this section.  Through these, the user builds the
transcription factor binding site pattern and sets all of
the parameters associated with it.  The pattern and
parameters are then used to search the alignment or sequence
file for matches.  Each parameter has a default value
associated with it, which can be changed at any time.
However, it is important to keep in mind that for some
parameters, it is not only important what you change them
to, but also when you change them (parameters for which this
is true are noted.)  Tffind can be run both interactively
and in batch mode.  In the more user-friendly interactive
mode, as you alter the parameters and build your pattern, an
updated summary of the current pattern and parameter
settings is displayed in the main menu.  For a view of the
main menu, see Figure E.  In the following, discussion
focuses on interactive mode.  Most comments also apply to
batch mode as well.  For a summary of the batch mode syntax,
type "tffind" at the command-line.  For a more in-depth
discussion of the batch mode parameters, see the text file
entitled "tffind_command_line_options.txt".

____________________________________________________________

=======================================================================

                          MAIN MENU
                          ---- ----

N...Add motif to current pattern by name
P...Add motif to current pattern by consensus pattern
A...Add motif to current pattern by access number
I...Add motif to current pattern by i.d. (TransFac only)
S...Add spacer to current pattern
D...Delete previous sub-pattern
M...Print main menu
Q...QUIT PROGRAM (without finding pattern)
F...FIND PATTERN

1...List available t.f. binding sites in current matrix database
2...Change current matrix database
3...Change minimum number of sequences that must match pattern
4...Change cutoff value for score function
5...Search for pattern only within a range of the alignment or sequence
6...Choose which sequences to search on in the alignment
7...Toggle exons option
8...Toggle which sequence positions are in relation to
9...Toggle between masking lowercase nucleotides
0...Toggle between output formats (TFFIND and GFF)


Current pattern : {gata-1[0.85]: IMD} <0,100> {gata-1[0.85]: IMD}
Current alignment file : hmr1
Current matrix database : IMD Database
Minimum number of sequences that must match : 3 out of 3
Cutoff value for score function : 0.85
Range of search : Entire Sequence
Search in sequences :   0  1  2
Filter out regions overlapped by exons file? : no
Mask out lowercase nucleotides? : yes
Report sequences in relation to the sequence the match occurs on.
Output format : TFFIND format

=======================================================================

Please enter your menu choice.

Figure E: The Tffind interface.
____________________________________________________________


Add a motif to the current pattern by name (option N, -n)

     Use this option to start building a pattern.  After you
add a binding site, its name, score cutoff value, and which
database it was taken from are added to the current pattern.
As the pattern is built, it is updated in the main menu.
One feature of Tffind is that it allows you to look for
binding sites on the opposite strand from the rest of the
pattern.  To do this, append a '(c)' to the end of the
binding site's name.  See Figure F for an example of a match
to a pattern such as this.  Transcription factor binding
site names are case-insensitive.  If you cannot remember the
exact name of a specific binding site, simply use option 1,
and a list of all available binding site names, consensus
patterns, accession numbers, and Transfac i.d.s in the
current matrix database will be outputted to a file.  Also,
before adding a binding site to the pattern, be sure that
the settings for the "current matrix database" and the
cutoff value are correct.  If they are not, change them
using options 2 and 4, respectively.  One other important
thing to note is that in the Transfac database, some binding
sites have more than one entry, but with the same name.  In
cases such as this, Tffind simply uses the first binding
site it finds that matches the given name.  To avoid such
ambiguities, it is best to use the access number or Transfac
i.d. whenever possible.
____________________________________________________________

ACTATTGGTGGCACGCCAGAACACCTG        - DNA strand
|||||||||||||||||||||||||||
CAGCTGCCACCGTGCGGTCGGGTGGAC        + DNA strand


CAGCTGCCACCGTGCGGTCGGGTGGAC        Actual sequence data in
                                   alignment file.


Figure F: An example of a match to a pattern that covers 
both strands
Say you want to search for the Tal-1 binding site, followed
by exactly 10 base-pairs and then the e47 binding site on
the opposite strand (using default settings for database and
minimum percent.) To do this, construct the pattern
"{Tal-1[0.85]: IMD} <10> {e47(c)[0.85]: IMD}".   See the 
strand above to see what a match to this pattern would look
like on an actual strand of DNA.  The Tal-1 and e47(c) sites
are underlined in the figure.  In an actual alignment file,
only one strand (the forward strand) is given per sequence.
Upon searching for the given pattern, Tffind simply searches
for Tal-1, then exactly 10 bps, and finally the reverse
complement of e47, which is the same as looking on the
reverse strand.  The above sequence matches this pattern,
and as such would generate a hit.
____________________________________________________________

Add a motif to the current pattern by consensus pattern
(option P, -p)

     The consensus pattern of a transcription factor binding
site is simply the sequence of nucleotides that receives the
highest possible score. In other words, the consensus
pattern of a binding site is the pattern that best matches
the binding site.  In certain cases, this sequence is
obvious.  For example, in Figure C, the consensus sequence
for binding site "vHNF-1" is "GTTAAT" because, looking at
the PWM, it is obvious that the highest scoring nucleotide
at position 1 is G, at position 2 is T, and so on.
     However, the consensus sequence for "NF-kB" (see also
Figure C) is not quite as simple.  The whole need for the
PWM arises because nucleotides are not always the same at a
given position in a binding site.  In order to express this
fact in the form of a single character, Biologists use the
set of "ambiguous nucleotides."  In this set, for each of
the 15 possible non-empty subsets of {A,C,G,T}, there is a
standard symbol.  For example, one of the more common
ambiguous nucleotides is N, which stands for any one of A,
C, G, or T.  As such, the pattern NNNN matches any string of
exactly 4 nucleotides.  Other ambiguous nucleotides include
R, which stands for {A,G} and Y, which stands for {C,T}.
Using this set, the consensus sequence for "NF-kB" is
accordingly "GGGANTTYCC".  A google search for the phrase
"ambiguous nucleotides" should provide a link to a table
defining them all.
     For further information on adding a motif by consensus
sequence, see the preceding section, "Adding a Motif by
Name."  All that is discussed for adding a motif by name
also applies to adding a motif by consensus sequence.

Add a motif to the current pattern by access number (option
A, -a)

     Another way to add a transcription factor binding site
to the pattern is by specifying its access number.  For both
the Transfac and the IMD databases, a TFBS's access number
looks something like "M00001".  In IMD files, the access
number is the final entry on the header line for a binding
site.  In Transfac files, the access number can be found on
the line labeled "AC".  Specifying by access number is
useful because each binding site matrix has a unique access
number, which is not necessarily true about names or
consensus patterns.
     For further information on adding a motif by access
number, see the preceding section, "Adding a Motif by Name."
All that is discussed for adding a motif by name also
applies to adding a motif by access number.

Add a motif to the current pattern by Transfac i.d (option
I, -i)

     Transfac entries also are indexed by a unique Transfac
i.d.  Such i.d.s can be found in a Transfac matrix file on
lines labeled "ID", and are of the form "V$MYOD_01", where
"MYOD" is an abbreviation of the transcription factor's
name, in this case the myoblast determination gene product.
     For further information on adding a motif by Transfac
i.d., see the preceding section, "Adding a Motif by Name."
All that is discussed for adding a motif by name also
applies to adding a motif by Transfac i.d.
     
Add a spacer to the current pattern (option S, -s)

     Spacers are used to indicate a specific number of
nucleotides that occur between motifs.  The basic format of
a spacer is either one integer or two integers separated by
a comma and enclosed within "less than" and "greater than"
brackets.  The function of spacers is to define a distance
between two motifs.  In Tffind, each motif represents one
transcription factor binding site.  The job of a spacer is
to define the "window" in which the program searches for the
next motif.  Some examples of possible spacers that could be
used by Tffind, and their interpretations:
       - <4,40> - look for the next motif within a window of 4
                  to 40 bps
       - <40>   - look for the next motif exactly 40 bps away
In Tffind version 1.1, you can also specify spacers as "4,40"
or "40", with the triangular brackets omitted.
       
Delete the previously entered sub-pattern (option D)

     If a mistake is made upon entering the pattern in
interactive mode, this menu choice allows you to backtrack
without having to start over completely from the beginning.
It works on motifs, as well as on spacers.

Print Main Menu (option M)

     Use this option if you forget a command, or if you
would like to see a summary of
the current pattern and settings.

Quit the program without finding the pattern (option Q)

     This option is a convenience to the user.  If the
program was unintentionally run, the proper files are not
found in the working directory, or there is any other
problem, this option allows the user to immediately exit
Tffind.  Note that any pattern that has been built and any
parameters that have been changed up to this point will be
lost.  They must be re-entered the next time the program is
run.

Find the pattern (option F)

     Make this choice once the pattern has been properly
built, and all of the options are correctly set.  Upon
entering this choice, Tffind will immediately search through
the alignment file or sequence file for the current pattern
and current settings.  The program performs one search for
the pattern on the forward strand, immediately followed by a
search on the reverse strand.  If the alignment file is
large, it might take a little while to run.  However, for
most alignment files, execution completes quickly.  Note
that this choice is only permitted if a valid pattern
containing at least one motif has been entered.

List the available T.F. binding sites in the current matrix
database by name/consensus pattern/access number/Transfac
i.d. (option 1, -list)

     This option creates a handy file that lets you know the
available transcription factor binding sites in your matrix
file(s).  Upon choosing this option, Tffind outputs all of
the available transcription factor binding sites in the
current matrix database by name, consensus sequence, access
number, and Transfac i.d.  Tffind redirects this information
to a file in the current working directory.  The name of the
file is simply the name of the matrix database file with
"_list" appended to it.  To get a comprehensive list of all
of the available transcription factor binding site motifs
that you have access to, step through each database using
option 2, choosing option 1 between each database.  You will
then have several list files, one for each TFBS database
file.

Change the current matrix database (option 2, -db)

     As discussed previously, two matrix database formats
are supported by Tffind: the Transfac and IMD formats.  See
Figure C and Figure D for the formats of these two
databases.  Any file in these formats, including the
official Transfac and IMD database files themselves, can be
read by Tffind.  If you are creating your own matrix file,
it is probably easier to use the IMD format, since it
contains fewer fields and is more compact than the Transfac
format.
     Upon choosing this option, you will be prompted to
enter the name of the file of the new matrix database.  If
the file name that you enter can not be opened or the format
that it is written in is invalid, an error message will
appear prompting you to enter a new file name.
     Keep in mind that your choice of database can affect
the results of your search.  Different databases have
different definitions for certain transcription factor
binding sites.  For example, the GATA-1 binding site has a
consensus sequence of WGATAA in the IMD database and a
consensus sequence of SNNGATNNNN in the Transfac database.
Obviously, the PWM's associated with these consensus
sequences are different from each other.  Such discrepancies
will undoubtedly lead to different search results.  Be sure
to familiarize yourself with the difference in
representation between the two databases before doing any
search for a binding site.  Also, it is important to keep
track of which database you are currently working with.  If
you want to switch databases in the middle of a pattern, be
sure to change the database before entering the next
pattern.
     
Change minimum number of sequences that must match the
pattern(option 3, -min)

     By default, every sequence in the alignment must match
the entered pattern.  However, it is often useful to allow
one or more sequences in the alignment to not match.
Mutations (point mutations, deletions, insertions, etc.) can
cause certain sequences to not match the desired pattern,
especially in alignments of many sequences.  Allowing some
sequences to not match the pattern helps to alleviate this
problem.  This option is also useful in the special case
where only one sequence needs to match the pattern, which
will find every occurrence of the entered pattern within the
alignment.  The value for the minimum number of sequences
that must match the pattern must be between 1 and the number
of sequences in the alignment.

Change the cutoff value used by the scoring function (option
4, -cut)

     Every set of n sequential nucleotides is a potential
transcription factor binding site of length n.  To further
explain this, let us consider the simple case where you ask
Tffind to search for only one transcription factor binding
site.  For a pattern of binding sites, the following
discussion still holds true, however it is slightly more
complicated.  Let us assume that this binding site is a 6-
mer (i.e. it is six nucleotides long.)  Before it begins
searching for the binding site in the alignment, Tffind
first computes the maximum possible score that a "perfect"
hit to this binding site would achieve.  This maximum score
is obtained by going through the Position Weight Matrix one
position at a time, determining which nucleotide has the
highest frequency at the current position, and adding this
value to the cumulative score.  This maximum score is then
saved by Tffind for later use.  Similarly, Tffind also
calculates the minimum score.
     As Tffind sequentially searches through each position
of each sequence in the alignment, it looks at 6 nucleotides
at a time (or however long the pattern is.)  For each of
these 6-mers, a score (call it the current score) is
computed.   This current score is directly taken from the
Position Weight Matrix computed from the matrix file.  This
score is then compared to the target threshold score, as
computed from the following formula:
     
             St = Smin + (Smax - Smin) * C / 100
                              
where     St   : the target threshold score
          Smin : the minimum possible score
          Smax : the maximum possible score
          C    : the cutoff value used by the scoring function

     By changing the cutoff value, the user can easily
adjust the number and quality of the matches.  Notice that
the above equation gives us (solving for C):
     
                    St   -  Smin
              C = ----------------- * 100
                    Smax -  Smin
                              
     In this form, it is apparent that C is simply used to
adjust the scale of the target threshold score.  The higher
C is, the higher the threshold, leading to fewer hits that
match the weight matrix more closely.  By default, the
cutoff value is .85.  However, this value can be easily
changed to any value between 0 and 1 via option 4.  This
cutoff value is listed in square brackets after the name of
the binding site in the "current pattern" section of the
main menu, and is also reported in the header of the output
file.  For all matches to the pattern, Tffind reports the
lowest possible cutoff score that the match would be 
detected at.  For example, a score of 0.74 means that this
binding site would not score high enough with a cutoff value
of 0.75 or higher.
     
Search for the pattern only within a range of the alignment
(option 5, -range)

     When searching for transcription factor binding sites,
it is often useful to look only within a range of the
alignment, such as the 5' UTR.  Instead of needing to
manually reduce the size of the alignment or go through the
output by hand, Tffind can do this through option 5.  Upon
choosing this option, you will be prompted to enter the
beginning and end of the range.  Keep in mind that the range
is relative to the top-most sequence in the alignment.  The
range that you enter must be between 1 and the length of the
top-most sequence.  If you do not know the length of the top-
most sequence and wish to search all the way to its end,
simply type a '.' as the end of the range.  Notice that
there can be a subtle difference between searching, for
example, within the ranges '1 - 9' and '1 - .' in an
alignment whose topmost sequence is 9 bases long.  Say the
alignment looks something like this:
     
     123456789 (POSITION IN TOP SEQ)
     
     ATTCAAAGA---------
     ATTCGAATATTTAGATAA
     ATTCGAATTTTAAGATAA
     
     In this case, searching in the range '1 - 9'  for a
match to GATA-1 (AGATAA) in at least 2 of the sequences in
the alignment would miss the hit at the end of the alignment
in the bottom two sequences (underlined) because they occur
beyond position 9 in the topmost sequence.  However,
searching in the range '1 - .' would in fact find it, since
in this case, Tffind would search across the entire range of
the alignment.
     Keep in mind also that the range that you specify is
the range for the beginning of the pattern to appear in.
Therefore, if the pattern begins towards the end of the
range and then continues beyond the boundary, it will still
be reported by Tffind.  Also, upon the automatic search of
the reverse strand, the range is still relative to the
forward direction.  In other words, if you specify the first
ten positions in the alignment (1 - 10) as the range, then
Tffind will search on the last ten positions on the reverse
strand.
     
Choose which sequences to search on in the alignment (option
6, -lookin)

     In certain cases, it may be useful to only search
within specific sequences of the alignment.  This option
allows you to filter out any sequences in the alignment that
you are not interested with.  Notice that there is a
distinct difference between specifying one sequence to
search in via option 6 and changing the minimum number of
sequences that must match the pattern to 1 (option 3.)  In
the former case, hits will be reported only to the sequence
that is specified, whereas in the latter case, hits will be
reported to all sequences with at least 1 hit.
     Upon choosing option 6, you will be prompted by the
program to enter the sequence numbers of the sequences that
you wish to confine the search to.  Keep in mind that the
numbering of the sequences is such that the top-most
sequence is sequence number 0, the second highest sequence
is sequence number 1, and so on.  If at any time you change
your mind and wish to search on all of the sequences of the
alignment, simply enter a -1 when prompted to enter the
sequence number.  The current setting for option 6 is
printed at the bottom of the main menu.

Toggle exons option (option 7, -exons)

     Another feature of Tffind is the ability to filter out
known exon and gene coding regions.  When looking for
Transcription Factor Binding Sites, it is oftentimes
preferable to skip over known coding regions such as exons,
since binding sites are usually located in upstream, non-
coding regions of genes.  As such, Tffind allows you to
specify a file that lists the positions of known genes and
exons.  The program then automatically masks out any hits
contained completely or partially within the genes or exons
listed in this file.  The format of the exons file is the
same as the format of exons files used by the PipMaker suite
of programs.  See Figure G for a sample exons file and a
brief explanation of the fields. Use option 8 to toggle the
exon mask option on and off.
     If the mask is turned on when you enter "F" to find the
pattern, you will first be prompted to choose what type of
mask you wish to perform.  There are three possible options:
mask out entire genomic regions, mask out only the coding
region of genes, or mask out each individual exon region.
For all three options, hits contained only entirely within
the boundaries are masked out.  For instance, if your exons
file tells Tffind to mask out the region 200-300, and the
program finds a hit to your pattern that occurs at positions
195-205, it will not mask out this hit, and will report it
as it would any other hit.
     Each exon file contains one line for each gene
contained within the file.  Each of these lines starts with
either a ">" or a "<", indicating that the gene is on the
forward or reverse strand, respectively.  If you wish to
mask out the entire scope of these genes, both the exons and
introns, choose the first option, option G.  In Figure G,
this option would mask out any hits found in the range 100-
800 on the forward strand and 1000-2000 on the reverse
strand.
     The second option, option C, allows you to choose to
mask out the coding region of the gene.  This information is
optional in an exons file.  If it is included for a gene in
an exons file, this line has a "+" sign, followed by the
coordinates of the translated region.  It appears on the
line directly below the line with the ">" or "<".  Since
this line is optional, if you choose this option and a
certain gene does not have a "+" line, Tffind will
automatically mask out the entire gene (the information
contained on the ">" or "<" line.)  For example, if this
option is chosen with the exons file shown in Figure G, then
the ranges 150-750 on the forward strand and 1000-2000 on
the reverse strand would be automatically masked out.
     The final option is option E.  This option
automatically masks out any hits found by Tffind that fall
within the actual exons specified in the exons file.  For
example, in Figure G, any hits found in the regions 100-200
and 600-800 (on the forward strand), and 1000-1200, 1400-
1500, and 1800-2000 (on the reverse strand) would not be
reported.
     After specifying what type of mask you want Tffind to
perform, you will then be prompted to enter the name of the
exons file that you wish to use.  Be sure that you know the
name of this file, that you know its directory, and that it
is in the proper exon file format.

____________________________________________________________

  My favorite genomic region

  < 100 800 XYZZY
  + 150 750
  100 200
  600 800

  > 1000 2000 Frobozz gene
  1000 1200 exon 1
  1400 1500 alt. spliced exon
  1800 2000 exon 2

  ... etc.

Figure G: The format of an exons file
This file lists the locations of genes, exons, and coding
regions in a sequence.  The directionality of a gene (">"
or  "<"), its start and end positions, and name are on one
line, followed by an optional line beginning with a  "+"
character that indicates the positions of the first and last
nucleotides of the translated region (including the
initiation codon, Met, and the stop codon).  These are
followed by lines specifying the start and end positions of
each exon, which must be listed in order of increasing
address even if the gene is on the reverse strand ("<").
Blank lines are ignored, and you can put an optional title
line at the top.
____________________________________________________________


Toggle which sequence positions are in relation to (option
8, -reltop)

     Use this option to toggle between the two ways that
Tffind can report its positions: in relation to the sequence
that the hit occurs on, and in relation to the topmost
sequence in the alignment.  By default, all positions are
reported in relation to the sequence that the hit occurs on.
However, it may be useful to know the positions in relation
to the topmost sequence in the alignment in the case that a
hit does not occur on the topmost sequence.  Otherwise,
there would be no way to know where the hit lies in relation
to the topmost sequence, if this is of interest to you.

*** NEW FOR VERSION 1.1 ***
Toggle between masking lowercase nucleotides (option 9, -lc)

    Use this option to toggle between treating lowercase 
nucleotides as if they were uppercase and as if they were
masked out.  The default setting masks out all lowercase
nucleotides, treating them as if they were gaps.  In certain
applications, you may want to disregard lowercase nucleotides
since they are lowercase because they have been masked out 
(i.e. they are in a repeat, etc.)  In other cases, you may
want to treat lowercase and uppercase nucleotides equally.

*** NEW FOR VERSION 1.1 ***
Toggle between output formats (option 0, -gff)

    Use this option to toggle between the two possible output
formats for tffind: the original tffind output format and GFF
format.  GFF format documentation can be found at
(http://www.sanger.ac.uk/Software/formats/GFF/).

The Output of Tffind

     The output of Tffind is shown in Figure H.  The first
"paragraph" lists the alignment file, the range of the
search, which sequence hits are reported in relation to, the
search pattern itself, exon filtering information (if
applicable), the minimum number of sequences in the alignment
that must match the pattern, how lowercase nucleotides were
treated (masked out or the same as uppercase), and which sequences
were searched in.  Next is a table that contains the hits
that were found to the pattern.  The columns of this table
contain the sequence that the hit occurred on, the range
that the hit occurs in, the strand ("+" for forward and "-"
for reverse), the score of the hit (reported as the lowest
cutoff value that this match would be reported as a hit),
and the actual nucleotide pattern of the hit itself.  Each
hit is separated by a blank line.  A summary of the search
results follows the table.  This summary includes the total
number of hits found, the total number of hits filtered out
(if applicable), which sequence the positions of each hit
are reported in relation to, and a key giving the number
and name of each sequence in the alignment.
     An oddity can occur if the user chooses to report all
hits in relation to the topmost sequence in the alignment.
If, for example, a hit occurs in sequence 1 of the
alignment, but one of its endpoints aligns to a gap ('-') in
the topmost sequence, its position is pre-pended with a '!'
in the output.  Since the endpoints technically do not exist
in the topmost sequence, Tffind reports the positions in the
topmost sequence that occur immediately preceding and
following the gap that the hit aligns to in this case.
     New for Tffind version 1.1 is the option to have output
in GFF format, as documented at the Sanger Institute webpage
(http://www.sanger.ac.uk/Software/formats/GFF/).  This option
is available via menu option 0, or the '-gff' command-line
option.
____________________________________________________________

Hits in alignment file 'hmr_alignment_file' within the range of the entire alignment
in relation to the sequence that the hit occurs on
for the pattern " {gata-1[0.85]: IMD} ",
filtering out exon regions provided in the file 'exons_file'
(in relation to sequence 1),
treating lowercase nucleotides the same as uppercase nucleotides,
where at least 1 of the 1 sequences matches in sequence 0.


Seq  From     To       Strand  Score  Pattern
---  ----     --       ------  -----  -------

0    3        8        +       1.00   AGATAA
1    3        8        +       0.94   AGATAG
2    7        12       +       1.00   AGATAA

2    1        6        +       1.00   AGATAA

0    54       59       -       0.93   TGATAG
1    58       63       -       0.94   AGATAG


Total matches found : 3
Total hits filtered out : 1
All positions relative to the sequence that the hit occurs on.

Sequence key:

  0.....HumanDNA
  1.....MouseDNA
  2.....RatDNA

Figure H: Sample Tffind output
This output was obtained with the (batch mode) command:
tffind hmr_alignment_file imd.txt -n gata-1 -min 1 -exons g 1 exons_file

This search will look through the file "hmr_alignment_file"
for all occurrences of the GATA-1 binding site as described
in the TFBS matrix file "imd.txt" in any sequence in the
alignment, filtering out those hits that occur within the
genic regions contained in the file "exons_file."
____________________________________________________________