Site Map
Releases
| TopHat 1.0.12 (BETA) | 10/28/09 |
Related Tools
Pre-built indexes
| H. sapiens, UCSC hg18 | 2.7 GB |
| H. sapiens, UCSC hg19 | 2.7 GB |
| M. musculus, UCSC mm9 | 2.4 GB |
All indexes are for assemblies, not contigs. Unplaced or unlocalized sequences and alternate haplotype assemblies are excluded.
Some unzip programs cannot handle archives >2 GB. If you have problems downloading or unzipping a >2 GB index, try downloading in two parts.
Check .zip file integrity with MD5s.
Pre-built indexes are compatible with Bowtie versions 0.9.8 and later. For older indexes, please contact us.
Publications
Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics doi:10.1093/bioinformatics/btp120
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 10:R25.
Contributors
Links
Manual
What is TopHat?TopHat is a program that aligns RNA-Seq reads to a genome in order to identify exon-exon splice junctions. It is built on the ultrafast short read mapping program Bowtie. TopHat runs on Linux and OS X. What types of reads can I use TopHat with?TopHat was designed to work with reads produced by the Illumina Genome Analyzer, although users have been successful in using TopHat with reads from other technologies. The software has been extended longer reads and paired end reads from the latest Illumina machines, and is optimized for reads 75bp or longer. Currently, TopHat does not allow short (fewer than a few nucleotides) insertions and deletions in the alignments it reports. Support for insertions and deletions will eventually be added. TopHat also does not natively support Applied Biosystems' Colorspace format. Finally, current versions of TopHat expect the reads to be the same length, and mixing runs with paired- and single- end reads together is not supported. If you have applied your own trimming procedure to Illumina reads, or if you are using TopHat with a sequencing technology that produces variable-length reads, please ensure that the reads input to TopHat are the same length. This limitation is an engineering rather than an algorithmic one and will be addressed in a future release. How does TopHat find junctions?TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them. Short read sequencing machines can currently produce reads 100bp or longer, but many exons are shorter than this, and so would be missed in the initial mapping. TopHat solves this problem by splitting all input reads into smaller segments, and then mapping them independently. The segment alignments are "glued" back together in a final step of the program to produce the end-to-end read alignments. TopHat generates its database of possible splice junctions from three sources of evidence. The first source is pairings of "coverage islands", which are distinct regions of piled up reads in the initial mapping. Neighboring islands are often spliced together in the transcriptome, so TopHat looks for ways to join these with an intron. The second source is only used when TopHat is run with paired end reads. When reads in a pair come from different exons of a transcript, they will generally be mapped far apart in the genome coordinate space. When this happens, TopHat tries to "close" the gap between them by looking for subsequences of the genomic interval between mates with a total length about equal to the expected distance between mates. The "introns" in this subsequence are added to the database. The third, and strongest, source of evidence for a splice junction is when two segments from the same read are mapped far apart, or when an internal segment fails to map. With long (>=75bp) reads, "GT-AG", "GC-AG" and "AT-AC" introns be found ab initio. With shorter reads, TopHat only reports alignments across "GT-AG" introns PrerequisitesTo use TopHat, you will need the following Bowtie in your PATH:
You will also need Python version 2.4 or higher. Obtaining and installing TopHatYou can download the source release here. To install TopHat, unpack the tarball and change to the package directory as follows:
tar zxvf tophat-1.0.7.tar.gz cd tophat-1.0.7/ Now build the package:
./configure --prefix=/path/to/install/directory/ make Finally, install TopHat:
make install Below, you will find a detailed list of command-line options you can
use to control TopHat. Beginning users should take a look at the
Getting started guide for a tutorial on
running TopHat.
Using the tophat junction mapperThe following is a detailed description of the options used to control the tophat script: Usage: tophat [options]* <index_base> <reads1_1[,...,readsN_1]> [reads1_2,...readsN_2]When running TopHat with paired ends, it is critical that the *_1 files an the *_2 files appear in separate comma separated lists, and that the order of the files in the two lists is the same. NOTE: TopHat can align reads that are up to 1024 bp, and it handles paired end reads, but we do not recommend mixing several "types" of reads in the same TopHat run. For example, mixing 100bp single end reads and 2x27bp paired ends into the same TopHat run will give bad results. If you'd like to combine results from several "flavors" of RNA-Seq reads, you can run first with one of your sets, and feed the junctions produced by that run into future TopHat runs as externally supplied junctions with the -j option (see below)
Supplying your own junctions: The options below allow you validate your own junctions with your RNA-Seq data. Note that the chromosome names in the files provided with the options below must match the names in the Bowtie index. These names are case-senstitive.
If you choose to supply TopHat with a GFF3 file of gene annotation, the program will look for the junctions between exons in the annotated transcripts. Note that all elements have must have ID tags and all element except the genes have Parent tags. Transcript records are called "mRNA", not "transcript". Most importantly, the values in the first column, which indicates the chromosome or contig on which the feature is located, must match a reference sequence record in the Bowtie index you are using with TopHat. You can get a list of the records in a Bowtie index by typing: bowtie-inspect --names your_index TopHat OutputThe tophat script produces a number of files in the directory in which it was invoked. Most of these files are internal, intermediate files that are generated for use within the pipeline. The output files you will likely want to look at are:
|
