TopHat

A spliced read mapper for RNA-Seq

  

Site Map

Releases

Related Tools

  • Cufflinks: Isoform assembly and quantitation for RNA-Seq
  • Bowtie: Ultrafast short read alignment

Pre-built indexes

H. sapiens, UCSC hg18 2.7 GB
H. sapiens, UCSC hg19 2.7 GB
M. musculus, UCSC mm9 2.4 GB

All indexes are for assemblies, not contigs. Unplaced or unlocalized sequences and alternate haplotype assemblies are excluded.

Some unzip programs cannot handle archives >2 GB. If you have problems downloading or unzipping a >2 GB index, try downloading in two parts.

Check .zip file integrity with MD5s.

Pre-built indexes are compatible with Bowtie versions 0.9.8 and later. For older indexes, please contact us.

Publications

Contributors

Links

Manual


Please Note If you have questions about how to use TopHat or would like more information about different parts of the software, please email Cole Trapnell.

What is TopHat?


TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. TopHat runs on Linux and OS X.


What types of reads can I use TopHat with?


TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. The software has been extended longer reads and paired end reads from the latest Illumina machines, and is optimized for reads 75bp or longer.

Currently, TopHat does not allow short (fewer than a few nucleotides) insertions and deletions in the alignments it reports. Support for insertions and deletions will eventually be added. TopHat also does not natively support Applied Biosystems' Colorspace format.

Finally, current versions of TopHat expect the reads to be the same length, and mixing runs with paired- and single- end reads together is not supported. If you have applied your own trimming procedure to Illumina reads, or if you are using TopHat with a sequencing technology that produces variable-length reads, please ensure that the reads input to TopHat are the same length. This limitation is an engineering rather than an algorithmic one and will be addressed in a future release.


How does TopHat find junctions?


TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.

Short read sequencing machines can currently produce reads 100bp or longer, but many exons are shorter than this, and so would be missed in the initial mapping. TopHat solves this problem by splitting all input reads into smaller segments, and then mapping them independently. The segment alignments are "glued" back together in a final step of the program to produce the end-to-end read alignments.

TopHat generates its database of possible splice junctions from three sources of evidence. The first source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. The second source is only used when TopHat is run with paired end reads. When reads in a pair come from different exons of a transcript, they will generally be mapped far apart in the genome coordinate space. When this happens, TopHat tries to "close" the gap between them by looking for subsequences of the genomic interval between mates with a total length about equal to the expected distance between mates. The "introns" in this subsequence are added to the database. The third, and strongest, source of evidence for a splice junction is when two segments from the same read are mapped far apart, or when an internal segment fails to map. With long (>=75bp) reads, "GT-AG", "GC-AG" and "AT-AC" introns be found ab initio. With shorter reads, TopHat only reports alignments across "GT-AG" introns


Prerequisites


To use TopHat, you will need the following Bowtie in your PATH:

  • bowtie
  • bowtie-inspect
  • bowtie-build

You will also need Python version 2.4 or higher.


Obtaining and installing TopHat


You can download the source release here.

To install TopHat, unpack the tarball and change to the package directory as follows:

tar zxvf tophat-1.0.7.tar.gz
cd tophat-1.0.7/

Now build the package:

./configure --prefix=/path/to/install/directory/
make

Finally, install TopHat:

make install

Below, you will find a detailed list of command-line options you can use to control TopHat. Beginning users should take a look at the Getting started guide for a tutorial on running TopHat.

Please Note TopHat has a number of parameters and options, and their default values are tuned for processing mammalian RNA-Seq reads. If you would like to use TopHat for another class of organism, we recommend setting some of the parameters with more strict, conservative values than their defaults. Usually, setting the maximum intron size to 4 or 5 Kb is sufficient to discover most junctions while keeping the number of false positives low.

Using the tophat junction mapper


The following is a detailed description of the options used to control the tophat script:


Usage: tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
When running TopHat with paired ends, it is critical that the *_1 files an the *_2 files appear in separate comma separated lists, and that the order of the files in the two lists is the same.

NOTE: TopHat can align reads that are up to 1024 bp, and it handles paired end reads, but we do not recommend mixing several "types" of reads in the same TopHat run. For example, mixing 100bp single end reads and 2x27bp paired ends into the same TopHat run will give bad results. If you'd like to combine results from several "flavors" of RNA-Seq reads, you can run first with one of your sets, and feed the junctions produced by that run into future TopHat runs as externally supplied junctions with the -j option (see below)

Arguments:
<ebwt_base> The basename of the index to be searched. The basename is the name of any of the five index files up to but not including the first period. bowtie first looks in the current directory for the index files, then looks in the indexes subdirectory under the directory where the currently-running bowtie executable is located, then looks in the directory specified in the BOWTIE_INDEXES environment variable.
<reads1_1[,...,readsN_1]> A comma-separated list of files containing reads in FASTQ or FASTA format. When running TopHat with paired-end reads, this should be the *_1 ("left") set of files.
<[reads1_2,...readsN_2]> A comma-separated list of files containing reads in FASTA or FASTA format. Only used when running TopHat with paired end reads, and contains the *_2 ("right") set of files. The *_2 files MUST appear in the same order as the *_1 files.
Options:
-h/--help Prints the help message and exits
-v/--version Prints the TopHat version number and exits
-o/--output-dir <string> Sets the name of the directory in which TopHat will write all of its output. The default is "./tophat_out".
-r/--mate-inner-dist <int> This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. There is no default, and this parameter is required for paired end runs.
--mate-std-dev <int> The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.
-a/--min-anchor-length <int> The "anchor length". TopHat will report junctions spanned by reads with at least this many bases on each side of the junction. Note that individual spliced alignments may span a junction with fewer than this many bases on one side. However, every junction involved in spliced alignments is supported by at least one read with this many bases on each side. This must be at least 3 and the default is 8.
-m/--splice-mismatches <int> The maximum number of mismatches that may appear in the "anchor" region of a spliced alignment. The default is 0.
-i/--min-intron-length <int> The minimum intron length. TopHat will ignore donor/acceptor pairs closer than this many bases apart. The default is 70.
-I/--max-intron-length <int> The maximum intron length. When searching for junctions ab initio, TopHat will ignore donor/acceptor pairs farther than this many bases apart, except when such a pair is supported by a split segment alignment of a long read. The default is 500000.
--solexa-quals Use the Solexa scale for quality values in FASTQ files.
--solexa1.3-quals As of the Illumina GA pipeline version 1.3, quality scores are encoded in Phred-scaled base-64. Use this option for FASTQ files from pipeline 1.3 or later.
-F/--min-isoform-fraction <0.0-1.0> TopHat filters out junctions supported by too few alignments. Suppose a junction spanning two exons, is supported by S reads. Let the average depth of coverage of exon A be D, and assume that it is higher than B. If S / D is less than the minimum isoform fraction, the junction is not reported. A value of zero disables the filter. The default is 0.15.
-p/--num-threads <int> Use this many threads to align reads. The default is 1.
-g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the reference for a given read, and suppresses all alignments for reads with more than this many alignments. The default is 40.
--no-closure-search Disables the mate pair closure-based search for junctions. Currently, has no effect - closure search is off by default.
--closure-search Enables the mate pair closure-based search for junctions. Closure-based search should only be used when the expected inner distance between mates is small (<= 50bp)
--no-coverage-search Disables the coverage based search for junctions.
--coverage-search Enables the coverage based search for junctions. Use when coverage search is disabled by default (such as for reads 75bp or longer), for maximum sensitivity.
--fill-gaps Long reads may contain highly repetitive segments, which may have too many hits when mapped in the initial pass. This option tells the pipeline to map reads end to end, capturing reads that are unmappable as segments but that are mappable as whole reads.
--microexon-search With this option, the pipeline will attempt to find alignments incident to microexons. Works only for reads 50bp or longer.
--butterfly-search TopHat will use a slower but potentially more sensitive algorithm to find junctions in addition to its standard search. Consider using this if you expect that your experiment produced a lot of reads from pre-mRNA, that fall within the introns of your transcripts.
Advanced Options:
--segment-mismatches Read segments are mapped independently, allowing up to this many mismatches in each segment alignment. The default is 2.
--segment-length Each read is cut up into segments, each at least this long. These segments are mapped independently. The default is 25.
--min-closure-exon During closure search for paired end reads, exonic hops in the potential splice graph must be at least this long. The default is 50.
--min-closure-intron The minimum intron length that may be found during closure search. The default is 50.
--max-closure-intron The maximum intron length that may be found during closure search. The default is 5000.
--min-coverage-intron The minimum intron length that may be found during coverage search. The default is 50.
--max-coverage-intron The maximum intron length that may be found during coverage search. The default is 20000.
--min-segment-intron The minimum intron length that may be found during split-segment search. The default is 50.
--max-segment-intron The maximum intron length that may be found during split-segment search. The default is 500000.
--keep-tmp Causes TopHat to preserve its intermediate files produced during the run. By default, they are deleted upon exit.

Supplying your own junctions:

The options below allow you validate your own junctions with your RNA-Seq data. Note that the chromosome names in the files provided with the options below must match the names in the Bowtie index. These names are case-senstitive.

-G/--GFF <GFF3 file>

Supply TopHat with a list of gene model annotations. TopHat will use the gene, mRNA and exon records in this file to build a set of known splice junctions for each gene, and will attempt to align reads to these junctions even if they would not normally be covered by the initial mapping.

-j/--raw-juncs <.juncs file>

Supply TopHat with a list of raw junctions. Junctions are specified one per line, in a tab-delimited format. Records look like:


<chrom> <left> <right> <+/->

left and right are zero-based coordinates, and specify the last character of the left sequenced to be spliced to the first character of the right sequence, inclusive.

--no-novel-juncs Only look for junctions indicated in the supplied GFF file. (ignored without -G)
Providing TopHat with an annotation file

If you choose to supply TopHat with a GFF3 file of gene annotation, the program will look for the junctions between exons in the annotated transcripts. Note that all elements have must have ID tags and all element except the genes have Parent tags. Transcript records are called "mRNA", not "transcript". Most importantly, the values in the first column, which indicates the chromosome or contig on which the feature is located, must match a reference sequence record in the Bowtie index you are using with TopHat. You can get a list of the records in a Bowtie index by typing:


bowtie-inspect --names your_index


TopHat Output


The tophat script produces a number of files in the directory in which it was invoked. Most of these files are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are:


  1. accepted_hits.sam. A list of read alignments in SAM format. SAM is a compact short read alignment format that is increasingly being adopted. The formal specification is here.
  2. coverage.wig. A UCSC BedGraph wigglegram track, showing the depth of coverage at each position, including the spliced read alignments.
  3. junctions.bed. A UCSC BED track of junctions reported by TopHat. Each junction consists of two connected BED blocks, where each block is as long as the maximal overhang of any read spanning the junction. The score is the number of alignments spanning the junction.