News and updates
|New releases and related tools will be announced through the Bowtie mailing list.|
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.
Kim D and Salzberg SL. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biology 2011, 12:R72
Kim D, Pertea G, Trapnell C, Pimentel H, Kelley R, Salzberg SL. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. . Genome Biology 2011, 14:R36
What is TopHat?
TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. TopHat runs on Linux and OS X.
What types of reads can I use TopHat with?
TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. In TopHat 1.1.0, we began supporting Applied Biosystems' Colorspace format. The software is optimized for reads 75bp or longer.
How does TopHat find junctions?
TopHat can find splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping information, TopHat builds a database of possible splice junctions and then maps the reads against these junctions to confirm them.
Short read sequencing machines can currently produce reads 100bp or longer but many exons are shorter than this so they would be missed in the initial mapping. TopHat solves this problem mainly by splitting all input reads into smaller segments which are then mapped independently. The segment alignments are put back together in a final step of the program to produce the end-to-end read alignments.
TopHat generates its database of possible splice junctions from two sources of evidence. The first and strongest source of evidence for a splice junction is when two segments from the same read (for reads of at least 45bp) are mapped at a certain distance on the same genomic sequence or when an internal segment fails to map - again suggesting that such reads are spanning multiple exons. With this approach, "GT-AG", "GC-AG" and "AT-AC" introns will be found ab initio. The second source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. We only suggest users use this second option (--coverage-search) for short reads (< 45bp) and with a small number of reads (<= 10 million). This latter option will only report alignments across "GT-AG" introns
To use TopHat, you will need the following programs in your PATH:
Because TopHat outputs and handles alignments in BAM format, you will need to download and install the SAM tools. You may want to take a look at the Getting started guide for more detailed installation instructions, including installation of SAM tools and Boost.
You will also need Python version 2.6 or higher.
Obtaining and installing TopHat
You can download the latest source release and precompiled binaries for Linux and Mac OSX here. See the Getting started
guide for detailed instructions about installing TopHat from the binary
package or building TopHat and its dependencies from source.
To install TopHat from source package, unpack the tarball and change directory to the package directory as follows:
tar zxvf tophat-2.0.0.tar.gz
Configure the package, specifying the install path and the library dependencies as needed (see the Getting started guide for details):
./configure --prefix=<install_prefix> --with-boost=<boost_install_prefix> --with-bam=<samtools_install_prefix>
Finally, build and install TopHat:
As detailed in the Getting started guide, if you want to install TopHat 2 without overwriting a previous version of TopHat already installed on your system you should specify a new, separate <install_prefix> for the ./configure command above, and after the 'make install' step just copy the tophat2 script from <install_prefix>/bin to a directory that is in your shell's PATH, so you can invoke this new version of TopHat with the command 'tophat2'.
Below you will find a detailed list of command-line options you can
use to control TopHat. Beginning users should take a look at the
Getting started guide for a tutorial on
installing and running TopHat and its prerequisites.
Usage: tophat [options]* <genome_index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]
When running TopHat with paired reads it is critical that the *_1 files an the *_2
files appear in separate comma-delimited lists, and that the order of the files in the two lists is the same.
tophat [options]* <genome_index_base> PE_reads_1.fq.gz,SE_reads.fa PE_reads_2.fq.gzStarting with version 2.0.10 TopHat accepts mixed input file formats (FASTA/FASTQ).
NOTE: TopHat can align reads that are up to 1024 bp long,
and it handles paired-end reads and unpaired reads at once, but we do not recommend mixing different types of reads in the same TopHat run. For example,
mixing 100bp single end reads and 2x27bp paired reads in the same TopHat run may give sub-optimal results. If you'd like to combine
results from data sets with different types of RNA-Seq reads, you can follow a protocol like this:
The following is a detailed description of the options used to control the TopHat script.
Bowtie 2 specific options:
Bowtie 2 provides many options so that users can have more flexibility as to how reads are mapped. TopHat 2 allows users to pass many of these options to Bowtie 2 by preceding the Bowtie 2 option name with the --b2- prefix. Please refer to the Bowtie2 website for detailed information.
Reads can be aligned to potential fusion transcripts if the --fusion-search option is specified. The fusion alignments are reported in SAM format using custom fields XF and XP (see the output format) and some additional information about fusions will be reported (see fusions.out). Once mapping is done, you can run tophat-fusion-post to filter out fusion transcripts (see the TopHat-Fusion website for more details).
The options below allow you validate your own list of known transcripts or junctions with your RNA-Seq data. Note that the chromosome names in the files provided with the options below must match the names in the Bowtie index. These names are case-senstitive.
The options below allow you validate your own indels with your RNA-Seq data. Note that the chromosome names in the files provided with the options below must match the names in the Bowtie index. These names are case-senstitive.
The tophat script produces a number of files in the directory in which it was invoked. Most of these files are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are: