Tffind Documentation (release version 1.1) Matt Weirauch (weirauch@soe.ucsc.edu) Nov 3, 2003 What is Tffind? Tffind is a computer program written in C that locates patterns of transcription factor binding sites within a MultiPipMaker multiple alignment file or a file in FASTA format. It is available for free download at http://globin.cse.psu.edu, along with other programs for manipulating MultiPipMaker input and output. What Do I Need to Run Tffind? Tffind currently runs on Unix and Linux platforms. To run it, all you need is a Unix or Linux machine, a textual alignment file from MultiPipMaker or a file in standard FASTA format, and a transcription factor binding site matrix file in either TransFac MATRIX or IMD format. The formats of these files are described below. Patterns of Transcription Factor Binding Sites in Tffind Patterns in Tffind are built from two units: the motif and the spacer. A motif represents one transcription factor binding site in the pattern. Each motif is represented by a position weight matrix as calculated from the given matrix file. From this PWM, a score is calculated for the current position in the alignment or sequence. If this score meets a certain threshold, then the remainder of the pattern is checked to see if the entire pattern matches at the current position. If the entire pattern matches, the position is recorded as a "hit." The other part of a pattern is the spacer. Spacers are used to orient motifs in relation to one another along the sequence or alignment. Spacers can take on one of two forms. The spacer "<50>" means to look for the next motif in the pattern exactly 50 bps away from the previous motif. The spacer "<10,50>" means to look for the next motif anywhere in the window of 10 to 50 bps away from the previous motif. By using motifs and spacers, patterns of transcription factor binding sites can be constructed. For example, the pattern "MZF1 <0,100> AML-1a" searches for the MZF1 binding site with the AML-1a binding site following anywhere from 0 to 100 bps away. The spacer "<0>" is the same as having no spacer at all. Patterns in Tffind can be arbitrarily large or small. Input of Tffind To run Tffind, just type "tffind AlignmentFile MatrixFile" and you will be in interactive mode. Provided that there are no problems (i.e. the program has compiled successfully,) you should be greeted with the main menu screen. Tffind can also be run in batch mode (just type "tffind" at the command-line to see the syntax.) If you wish to redirect the output from the screen to a file, append "> outputfile" to the end of this. More about the options for Tffind is discussed later. The Alignment File Before discussing the program itself, we will first briefly discuss the format of Tffind's alignment or sequence inputs. Tffind accepts two formats of input: MultiPipMaker textual alignment format and FASTA format. A textual alignment file contains the results of a MultiPipMaker alignment between 2 or more sequences. The format of the alignment file is the exact same format as the textual output of MultiPipMaker. To receive output in this format, check the box labeled "Generate very verbose text (ASCII, compressed)" on the MultiPipMaker page, which can also be reached via the Penn State Globin Gene Server, http://globin.cse.psu.edu. A sample textual alignment file is shown in Figure B.1. ____________________________________________________________ 3 57 Human Mouse Rat AGATGAGAGTGATAAGAGAGAAAGA---GATAGGGGATGATAACACTTCCGGAAGTG AGGGAGAGATGATAAGAGAGAAAGAAAAGATAGGGGATGATAACACTTCCGGAAGTG AGGGAGAGATTTTTTGAGAGAGAGAAAAGATAGGGGATGATAACACTTCCGGAAGTG Figure B.1: The format of a textual multiple alignment file. The first two numbers indicate the number of sequences in the alignment and the length (in base pairs) of the alignments, respectively. Next are the names of each sequence, one per line. Finally, the alignment itself is given. Gaps in the alignment are represented by the character '-'. ____________________________________________________________ The second format accepted by Tffind is standard FASTA format. Files in this format are accepted so that users can search in single sequences in addition to alignments of sequences. FASTA format is essentially a single line description, followed by the sequence itself. See Figure B.2 for a sample. For both types of input, only the characters A, C, G, and T are recognized. Other characters, such as (a, c, g, t, N, n, X, x, -, etc.) are given a score of negative infinity. Therfore, it is impossible for a single binding site to be reported as a hit if it contains any of these characters. To force Tffind to treat lowercase nucletides as if they were uppercase, use option 9 (-lc for the command-line version.) ____________________________________________________________ >Description of Sequence GAGAATCGATCGACTACGATCAGCTAGCACAGCGGCGGCCTAGCGATCAGCGACTACTAG CGACTACGACGATCAGCATCAGCACTAGCAGCATCAGCTACAGCGACTACAGCGACTACT AGCAGCTACGACTAGCACTACGACGGCCCTAGCATCGCA Figure B.2: The format of a FASTA file. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. ____________________________________________________________ The Matrix File A file containing the transcription factor binding site position weight matrices is required to run Tffind. Tffind accepts files in both Transfac and IMD format. The format of IMD matrix files is shown in Figure C, and the Transfac format is shown in Figure D. When choosing between matrix files, it is important to keep in mind that different files may cause the program to report different results. A specific entry in one database may in fact have different values in the other database, which in turn may alter the results of the search. ____________________________________________________________ NF-kB 11.54 17.55 GGGANTTYCC M00001 A | 3 0 0 60 25 4 0 0 5 0 C | 0 0 0 2 29 1 18 32 73 77 G | 71 78 76 16 6 0 0 0 0 0 T | 4 0 2 0 18 73 60 46 0 1 vHNF-1 11.02 11.04 GTTAAT M00002 A | 0 0 0 3 3 0 C | 0 0 0 0 0 0 G | 3 0 0 0 0 0 T | 0 3 3 0 0 3 Figure C: The format of an IMD matrix database entry Starting at the top line, the left-most entry is the name of the T.F. binding site. The next two values are not used by Tffind. Fourth is the binding site's consensus sequence, and last is the IMD database accession number. The next four lines contain the count of each of the four nucleotides at each position in the transcription factor binding site. The first column gives the count of each nucleotide at the first position in the binding site, and so on. ____________________________________________________________ ____________________________________________________________ AC M00077 XX ID V$GATA3_01 XX DT 24.04.1995 (created); hiwi. DT 16.10.1995 (updated); ewi. XX NA GATA-3 XX DE GATA-binding factor 3 XX BF T00311 GATA-3; Species: human, Homo sapiens. XX PO A C G T 01 16 14 24 9 N 02 26 13 9 15 N 03 0 0 63 0 G 04 62 0 0 1 A 05 1 0 3 59 T 06 35 2 10 16 A 07 17 5 33 8 R 08 10 11 26 16 N 09 15 7 34 7 G XX BA 63 selected binding sequences XX RN [1] RX MEDLINE; 93309433. RA Merika M., Orkin S. H. RT DNA-binding specificity of GATA family t.f.s RL Mol. Cell. Biol. 13:3999-4010 (1993). XX Figure D: The format of a Transfac matrix database entry The following fields are relevant to Tffind: AC Accession number ID Transfac identification number NA Name of binding factor PO Position in the T.F. binding site, frequency of A, C, G, T, respectively The last column in the "PO" table contains the deduced consensus nucleotide at each position. ____________________________________________________________ Using Tffind: Features and Options Each of Tffind's parameters and features are explained in this section. Through these, the user builds the transcription factor binding site pattern and sets all of the parameters associated with it. The pattern and parameters are then used to search the alignment or sequence file for matches. Each parameter has a default value associated with it, which can be changed at any time. However, it is important to keep in mind that for some parameters, it is not only important what you change them to, but also when you change them (parameters for which this is true are noted.) Tffind can be run both interactively and in batch mode. In the more user-friendly interactive mode, as you alter the parameters and build your pattern, an updated summary of the current pattern and parameter settings is displayed in the main menu. For a view of the main menu, see Figure E. In the following, discussion focuses on interactive mode. Most comments also apply to batch mode as well. For a summary of the batch mode syntax, type "tffind" at the command-line. For a more in-depth discussion of the batch mode parameters, see the text file entitled "tffind_command_line_options.txt". ____________________________________________________________ ======================================================================= MAIN MENU ---- ---- N...Add motif to current pattern by name P...Add motif to current pattern by consensus pattern A...Add motif to current pattern by access number I...Add motif to current pattern by i.d. (TransFac only) S...Add spacer to current pattern D...Delete previous sub-pattern M...Print main menu Q...QUIT PROGRAM (without finding pattern) F...FIND PATTERN 1...List available t.f. binding sites in current matrix database 2...Change current matrix database 3...Change minimum number of sequences that must match pattern 4...Change cutoff value for score function 5...Search for pattern only within a range of the alignment or sequence 6...Choose which sequences to search on in the alignment 7...Toggle exons option 8...Toggle which sequence positions are in relation to 9...Toggle between masking lowercase nucleotides 0...Toggle between output formats (TFFIND and GFF) Current pattern : {gata-1[0.85]: IMD} <0,100> {gata-1[0.85]: IMD} Current alignment file : hmr1 Current matrix database : IMD Database Minimum number of sequences that must match : 3 out of 3 Cutoff value for score function : 0.85 Range of search : Entire Sequence Search in sequences : 0 1 2 Filter out regions overlapped by exons file? : no Mask out lowercase nucleotides? : yes Report sequences in relation to the sequence the match occurs on. Output format : TFFIND format ======================================================================= Please enter your menu choice. Figure E: The Tffind interface. ____________________________________________________________ Add a motif to the current pattern by name (option N, -n) Use this option to start building a pattern. After you add a binding site, its name, score cutoff value, and which database it was taken from are added to the current pattern. As the pattern is built, it is updated in the main menu. One feature of Tffind is that it allows you to look for binding sites on the opposite strand from the rest of the pattern. To do this, append a '(c)' to the end of the binding site's name. See Figure F for an example of a match to a pattern such as this. Transcription factor binding site names are case-insensitive. If you cannot remember the exact name of a specific binding site, simply use option 1, and a list of all available binding site names, consensus patterns, accession numbers, and Transfac i.d.s in the current matrix database will be outputted to a file. Also, before adding a binding site to the pattern, be sure that the settings for the "current matrix database" and the cutoff value are correct. If they are not, change them using options 2 and 4, respectively. One other important thing to note is that in the Transfac database, some binding sites have more than one entry, but with the same name. In cases such as this, Tffind simply uses the first binding site it finds that matches the given name. To avoid such ambiguities, it is best to use the access number or Transfac i.d. whenever possible. ____________________________________________________________ ACTATTGGTGGCACGCCAGAACACCTG - DNA strand ||||||||||||||||||||||||||| CAGCTGCCACCGTGCGGTCGGGTGGAC + DNA strand CAGCTGCCACCGTGCGGTCGGGTGGAC Actual sequence data in alignment file. Figure F: An example of a match to a pattern that covers both strands Say you want to search for the Tal-1 binding site, followed by exactly 10 base-pairs and then the e47 binding site on the opposite strand (using default settings for database and minimum percent.) To do this, construct the pattern "{Tal-1[0.85]: IMD} <10> {e47(c)[0.85]: IMD}". See the strand above to see what a match to this pattern would look like on an actual strand of DNA. The Tal-1 and e47(c) sites are underlined in the figure. In an actual alignment file, only one strand (the forward strand) is given per sequence. Upon searching for the given pattern, Tffind simply searches for Tal-1, then exactly 10 bps, and finally the reverse complement of e47, which is the same as looking on the reverse strand. The above sequence matches this pattern, and as such would generate a hit. ____________________________________________________________ Add a motif to the current pattern by consensus pattern (option P, -p) The consensus pattern of a transcription factor binding site is simply the sequence of nucleotides that receives the highest possible score. In other words, the consensus pattern of a binding site is the pattern that best matches the binding site. In certain cases, this sequence is obvious. For example, in Figure C, the consensus sequence for binding site "vHNF-1" is "GTTAAT" because, looking at the PWM, it is obvious that the highest scoring nucleotide at position 1 is G, at position 2 is T, and so on. However, the consensus sequence for "NF-kB" (see also Figure C) is not quite as simple. The whole need for the PWM arises because nucleotides are not always the same at a given position in a binding site. In order to express this fact in the form of a single character, Biologists use the set of "ambiguous nucleotides." In this set, for each of the 15 possible non-empty subsets of {A,C,G,T}, there is a standard symbol. For example, one of the more common ambiguous nucleotides is N, which stands for any one of A, C, G, or T. As such, the pattern NNNN matches any string of exactly 4 nucleotides. Other ambiguous nucleotides include R, which stands for {A,G} and Y, which stands for {C,T}. Using this set, the consensus sequence for "NF-kB" is accordingly "GGGANTTYCC". A google search for the phrase "ambiguous nucleotides" should provide a link to a table defining them all. For further information on adding a motif by consensus sequence, see the preceding section, "Adding a Motif by Name." All that is discussed for adding a motif by name also applies to adding a motif by consensus sequence. Add a motif to the current pattern by access number (option A, -a) Another way to add a transcription factor binding site to the pattern is by specifying its access number. For both the Transfac and the IMD databases, a TFBS's access number looks something like "M00001". In IMD files, the access number is the final entry on the header line for a binding site. In Transfac files, the access number can be found on the line labeled "AC". Specifying by access number is useful because each binding site matrix has a unique access number, which is not necessarily true about names or consensus patterns. For further information on adding a motif by access number, see the preceding section, "Adding a Motif by Name." All that is discussed for adding a motif by name also applies to adding a motif by access number. Add a motif to the current pattern by Transfac i.d (option I, -i) Transfac entries also are indexed by a unique Transfac i.d. Such i.d.s can be found in a Transfac matrix file on lines labeled "ID", and are of the form "V$MYOD_01", where "MYOD" is an abbreviation of the transcription factor's name, in this case the myoblast determination gene product. For further information on adding a motif by Transfac i.d., see the preceding section, "Adding a Motif by Name." All that is discussed for adding a motif by name also applies to adding a motif by Transfac i.d. Add a spacer to the current pattern (option S, -s) Spacers are used to indicate a specific number of nucleotides that occur between motifs. The basic format of a spacer is either one integer or two integers separated by a comma and enclosed within "less than" and "greater than" brackets. The function of spacers is to define a distance between two motifs. In Tffind, each motif represents one transcription factor binding site. The job of a spacer is to define the "window" in which the program searches for the next motif. Some examples of possible spacers that could be used by Tffind, and their interpretations: - <4,40> - look for the next motif within a window of 4 to 40 bps - <40> - look for the next motif exactly 40 bps away In Tffind version 1.1, you can also specify spacers as "4,40" or "40", with the triangular brackets omitted. Delete the previously entered sub-pattern (option D) If a mistake is made upon entering the pattern in interactive mode, this menu choice allows you to backtrack without having to start over completely from the beginning. It works on motifs, as well as on spacers. Print Main Menu (option M) Use this option if you forget a command, or if you would like to see a summary of the current pattern and settings. Quit the program without finding the pattern (option Q) This option is a convenience to the user. If the program was unintentionally run, the proper files are not found in the working directory, or there is any other problem, this option allows the user to immediately exit Tffind. Note that any pattern that has been built and any parameters that have been changed up to this point will be lost. They must be re-entered the next time the program is run. Find the pattern (option F) Make this choice once the pattern has been properly built, and all of the options are correctly set. Upon entering this choice, Tffind will immediately search through the alignment file or sequence file for the current pattern and current settings. The program performs one search for the pattern on the forward strand, immediately followed by a search on the reverse strand. If the alignment file is large, it might take a little while to run. However, for most alignment files, execution completes quickly. Note that this choice is only permitted if a valid pattern containing at least one motif has been entered. List the available T.F. binding sites in the current matrix database by name/consensus pattern/access number/Transfac i.d. (option 1, -list) This option creates a handy file that lets you know the available transcription factor binding sites in your matrix file(s). Upon choosing this option, Tffind outputs all of the available transcription factor binding sites in the current matrix database by name, consensus sequence, access number, and Transfac i.d. Tffind redirects this information to a file in the current working directory. The name of the file is simply the name of the matrix database file with "_list" appended to it. To get a comprehensive list of all of the available transcription factor binding site motifs that you have access to, step through each database using option 2, choosing option 1 between each database. You will then have several list files, one for each TFBS database file. Change the current matrix database (option 2, -db) As discussed previously, two matrix database formats are supported by Tffind: the Transfac and IMD formats. See Figure C and Figure D for the formats of these two databases. Any file in these formats, including the official Transfac and IMD database files themselves, can be read by Tffind. If you are creating your own matrix file, it is probably easier to use the IMD format, since it contains fewer fields and is more compact than the Transfac format. Upon choosing this option, you will be prompted to enter the name of the file of the new matrix database. If the file name that you enter can not be opened or the format that it is written in is invalid, an error message will appear prompting you to enter a new file name. Keep in mind that your choice of database can affect the results of your search. Different databases have different definitions for certain transcription factor binding sites. For example, the GATA-1 binding site has a consensus sequence of WGATAA in the IMD database and a consensus sequence of SNNGATNNNN in the Transfac database. Obviously, the PWM's associated with these consensus sequences are different from each other. Such discrepancies will undoubtedly lead to different search results. Be sure to familiarize yourself with the difference in representation between the two databases before doing any search for a binding site. Also, it is important to keep track of which database you are currently working with. If you want to switch databases in the middle of a pattern, be sure to change the database before entering the next pattern. Change minimum number of sequences that must match the pattern(option 3, -min) By default, every sequence in the alignment must match the entered pattern. However, it is often useful to allow one or more sequences in the alignment to not match. Mutations (point mutations, deletions, insertions, etc.) can cause certain sequences to not match the desired pattern, especially in alignments of many sequences. Allowing some sequences to not match the pattern helps to alleviate this problem. This option is also useful in the special case where only one sequence needs to match the pattern, which will find every occurrence of the entered pattern within the alignment. The value for the minimum number of sequences that must match the pattern must be between 1 and the number of sequences in the alignment. Change the cutoff value used by the scoring function (option 4, -cut) Every set of n sequential nucleotides is a potential transcription factor binding site of length n. To further explain this, let us consider the simple case where you ask Tffind to search for only one transcription factor binding site. For a pattern of binding sites, the following discussion still holds true, however it is slightly more complicated. Let us assume that this binding site is a 6- mer (i.e. it is six nucleotides long.) Before it begins searching for the binding site in the alignment, Tffind first computes the maximum possible score that a "perfect" hit to this binding site would achieve. This maximum score is obtained by going through the Position Weight Matrix one position at a time, determining which nucleotide has the highest frequency at the current position, and adding this value to the cumulative score. This maximum score is then saved by Tffind for later use. Similarly, Tffind also calculates the minimum score. As Tffind sequentially searches through each position of each sequence in the alignment, it looks at 6 nucleotides at a time (or however long the pattern is.) For each of these 6-mers, a score (call it the current score) is computed. This current score is directly taken from the Position Weight Matrix computed from the matrix file. This score is then compared to the target threshold score, as computed from the following formula: St = Smin + (Smax - Smin) * C / 100 where St : the target threshold score Smin : the minimum possible score Smax : the maximum possible score C : the cutoff value used by the scoring function By changing the cutoff value, the user can easily adjust the number and quality of the matches. Notice that the above equation gives us (solving for C): St - Smin C = ----------------- * 100 Smax - Smin In this form, it is apparent that C is simply used to adjust the scale of the target threshold score. The higher C is, the higher the threshold, leading to fewer hits that match the weight matrix more closely. By default, the cutoff value is .85. However, this value can be easily changed to any value between 0 and 1 via option 4. This cutoff value is listed in square brackets after the name of the binding site in the "current pattern" section of the main menu, and is also reported in the header of the output file. For all matches to the pattern, Tffind reports the lowest possible cutoff score that the match would be detected at. For example, a score of 0.74 means that this binding site would not score high enough with a cutoff value of 0.75 or higher. Search for the pattern only within a range of the alignment (option 5, -range) When searching for transcription factor binding sites, it is often useful to look only within a range of the alignment, such as the 5' UTR. Instead of needing to manually reduce the size of the alignment or go through the output by hand, Tffind can do this through option 5. Upon choosing this option, you will be prompted to enter the beginning and end of the range. Keep in mind that the range is relative to the top-most sequence in the alignment. The range that you enter must be between 1 and the length of the top-most sequence. If you do not know the length of the top- most sequence and wish to search all the way to its end, simply type a '.' as the end of the range. Notice that there can be a subtle difference between searching, for example, within the ranges '1 - 9' and '1 - .' in an alignment whose topmost sequence is 9 bases long. Say the alignment looks something like this: 123456789 (POSITION IN TOP SEQ) ATTCAAAGA--------- ATTCGAATATTTAGATAA ATTCGAATTTTAAGATAA In this case, searching in the range '1 - 9' for a match to GATA-1 (AGATAA) in at least 2 of the sequences in the alignment would miss the hit at the end of the alignment in the bottom two sequences (underlined) because they occur beyond position 9 in the topmost sequence. However, searching in the range '1 - .' would in fact find it, since in this case, Tffind would search across the entire range of the alignment. Keep in mind also that the range that you specify is the range for the beginning of the pattern to appear in. Therefore, if the pattern begins towards the end of the range and then continues beyond the boundary, it will still be reported by Tffind. Also, upon the automatic search of the reverse strand, the range is still relative to the forward direction. In other words, if you specify the first ten positions in the alignment (1 - 10) as the range, then Tffind will search on the last ten positions on the reverse strand. Choose which sequences to search on in the alignment (option 6, -lookin) In certain cases, it may be useful to only search within specific sequences of the alignment. This option allows you to filter out any sequences in the alignment that you are not interested with. Notice that there is a distinct difference between specifying one sequence to search in via option 6 and changing the minimum number of sequences that must match the pattern to 1 (option 3.) In the former case, hits will be reported only to the sequence that is specified, whereas in the latter case, hits will be reported to all sequences with at least 1 hit. Upon choosing option 6, you will be prompted by the program to enter the sequence numbers of the sequences that you wish to confine the search to. Keep in mind that the numbering of the sequences is such that the top-most sequence is sequence number 0, the second highest sequence is sequence number 1, and so on. If at any time you change your mind and wish to search on all of the sequences of the alignment, simply enter a -1 when prompted to enter the sequence number. The current setting for option 6 is printed at the bottom of the main menu. Toggle exons option (option 7, -exons) Another feature of Tffind is the ability to filter out known exon and gene coding regions. When looking for Transcription Factor Binding Sites, it is oftentimes preferable to skip over known coding regions such as exons, since binding sites are usually located in upstream, non- coding regions of genes. As such, Tffind allows you to specify a file that lists the positions of known genes and exons. The program then automatically masks out any hits contained completely or partially within the genes or exons listed in this file. The format of the exons file is the same as the format of exons files used by the PipMaker suite of programs. See Figure G for a sample exons file and a brief explanation of the fields. Use option 8 to toggle the exon mask option on and off. If the mask is turned on when you enter "F" to find the pattern, you will first be prompted to choose what type of mask you wish to perform. There are three possible options: mask out entire genomic regions, mask out only the coding region of genes, or mask out each individual exon region. For all three options, hits contained only entirely within the boundaries are masked out. For instance, if your exons file tells Tffind to mask out the region 200-300, and the program finds a hit to your pattern that occurs at positions 195-205, it will not mask out this hit, and will report it as it would any other hit. Each exon file contains one line for each gene contained within the file. Each of these lines starts with either a ">" or a "<", indicating that the gene is on the forward or reverse strand, respectively. If you wish to mask out the entire scope of these genes, both the exons and introns, choose the first option, option G. In Figure G, this option would mask out any hits found in the range 100- 800 on the forward strand and 1000-2000 on the reverse strand. The second option, option C, allows you to choose to mask out the coding region of the gene. This information is optional in an exons file. If it is included for a gene in an exons file, this line has a "+" sign, followed by the coordinates of the translated region. It appears on the line directly below the line with the ">" or "<". Since this line is optional, if you choose this option and a certain gene does not have a "+" line, Tffind will automatically mask out the entire gene (the information contained on the ">" or "<" line.) For example, if this option is chosen with the exons file shown in Figure G, then the ranges 150-750 on the forward strand and 1000-2000 on the reverse strand would be automatically masked out. The final option is option E. This option automatically masks out any hits found by Tffind that fall within the actual exons specified in the exons file. For example, in Figure G, any hits found in the regions 100-200 and 600-800 (on the forward strand), and 1000-1200, 1400- 1500, and 1800-2000 (on the reverse strand) would not be reported. After specifying what type of mask you want Tffind to perform, you will then be prompted to enter the name of the exons file that you wish to use. Be sure that you know the name of this file, that you know its directory, and that it is in the proper exon file format. ____________________________________________________________ My favorite genomic region < 100 800 XYZZY + 150 750 100 200 600 800 > 1000 2000 Frobozz gene 1000 1200 exon 1 1400 1500 alt. spliced exon 1800 2000 exon 2 ... etc. Figure G: The format of an exons file This file lists the locations of genes, exons, and coding regions in a sequence. The directionality of a gene (">" or "<"), its start and end positions, and name are on one line, followed by an optional line beginning with a "+" character that indicates the positions of the first and last nucleotides of the translated region (including the initiation codon, Met, and the stop codon). These are followed by lines specifying the start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand ("<"). Blank lines are ignored, and you can put an optional title line at the top. ____________________________________________________________ Toggle which sequence positions are in relation to (option 8, -reltop) Use this option to toggle between the two ways that Tffind can report its positions: in relation to the sequence that the hit occurs on, and in relation to the topmost sequence in the alignment. By default, all positions are reported in relation to the sequence that the hit occurs on. However, it may be useful to know the positions in relation to the topmost sequence in the alignment in the case that a hit does not occur on the topmost sequence. Otherwise, there would be no way to know where the hit lies in relation to the topmost sequence, if this is of interest to you. *** NEW FOR VERSION 1.1 *** Toggle between masking lowercase nucleotides (option 9, -lc) Use this option to toggle between treating lowercase nucleotides as if they were uppercase and as if they were masked out. The default setting masks out all lowercase nucleotides, treating them as if they were gaps. In certain applications, you may want to disregard lowercase nucleotides since they are lowercase because they have been masked out (i.e. they are in a repeat, etc.) In other cases, you may want to treat lowercase and uppercase nucleotides equally. *** NEW FOR VERSION 1.1 *** Toggle between output formats (option 0, -gff) Use this option to toggle between the two possible output formats for tffind: the original tffind output format and GFF format. GFF format documentation can be found at (http://www.sanger.ac.uk/Software/formats/GFF/). The Output of Tffind The output of Tffind is shown in Figure H. The first "paragraph" lists the alignment file, the range of the search, which sequence hits are reported in relation to, the search pattern itself, exon filtering information (if applicable), the minimum number of sequences in the alignment that must match the pattern, how lowercase nucleotides were treated (masked out or the same as uppercase), and which sequences were searched in. Next is a table that contains the hits that were found to the pattern. The columns of this table contain the sequence that the hit occurred on, the range that the hit occurs in, the strand ("+" for forward and "-" for reverse), the score of the hit (reported as the lowest cutoff value that this match would be reported as a hit), and the actual nucleotide pattern of the hit itself. Each hit is separated by a blank line. A summary of the search results follows the table. This summary includes the total number of hits found, the total number of hits filtered out (if applicable), which sequence the positions of each hit are reported in relation to, and a key giving the number and name of each sequence in the alignment. An oddity can occur if the user chooses to report all hits in relation to the topmost sequence in the alignment. If, for example, a hit occurs in sequence 1 of the alignment, but one of its endpoints aligns to a gap ('-') in the topmost sequence, its position is pre-pended with a '!' in the output. Since the endpoints technically do not exist in the topmost sequence, Tffind reports the positions in the topmost sequence that occur immediately preceding and following the gap that the hit aligns to in this case. New for Tffind version 1.1 is the option to have output in GFF format, as documented at the Sanger Institute webpage (http://www.sanger.ac.uk/Software/formats/GFF/). This option is available via menu option 0, or the '-gff' command-line option. ____________________________________________________________ Hits in alignment file 'hmr_alignment_file' within the range of the entire alignment in relation to the sequence that the hit occurs on for the pattern " {gata-1[0.85]: IMD} ", filtering out exon regions provided in the file 'exons_file' (in relation to sequence 1), treating lowercase nucleotides the same as uppercase nucleotides, where at least 1 of the 1 sequences matches in sequence 0. Seq From To Strand Score Pattern --- ---- -- ------ ----- ------- 0 3 8 + 1.00 AGATAA 1 3 8 + 0.94 AGATAG 2 7 12 + 1.00 AGATAA 2 1 6 + 1.00 AGATAA 0 54 59 - 0.93 TGATAG 1 58 63 - 0.94 AGATAG Total matches found : 3 Total hits filtered out : 1 All positions relative to the sequence that the hit occurs on. Sequence key: 0.....HumanDNA 1.....MouseDNA 2.....RatDNA Figure H: Sample Tffind output This output was obtained with the (batch mode) command: tffind hmr_alignment_file imd.txt -n gata-1 -min 1 -exons g 1 exons_file This search will look through the file "hmr_alignment_file" for all occurrences of the GATA-1 binding site as described in the TFBS matrix file "imd.txt" in any sequence in the alignment, filtering out those hits that occur within the genic regions contained in the file "exons_file." ____________________________________________________________