TopHat

A spliced read mapper for RNA-Seq

  

TopHat is a fast splice junction mapper for RNA-Seq reads. It aligns RNA-Seq reads to mammalian-sized genomes using the ultra high-throughput short read aligner Bowtie, and then analyzes the mapping results to identify splice junctions between exons.

TopHat is a collaborative effort between the University of Maryland Center for Bioinformatics and Computational Biology and the University of California, Berkeley Departments of Mathematics and Molecular and Cell Biology.
Open Source Software

Site Map

Releases

Related Tools

  • Cufflinks: Isoform assembly and quantitation for RNA-Seq
  • Bowtie: Ultrafast short read alignment

Pre-built indexes

H. sapiens, UCSC hg18 2.7 GB
H. sapiens, UCSC hg19 2.7 GB
M. musculus, UCSC mm9 2.4 GB

All indexes are for assemblies, not contigs. Unplaced or unlocalized sequences and alternate haplotype assemblies are excluded.

Some unzip programs cannot handle archives >2 GB. If you have problems downloading or unzipping a >2 GB index, try downloading in two parts.

Check .zip file integrity with MD5s.

Pre-built indexes are compatible with Bowtie versions 0.9.8 and later. For older indexes, please contact us.

Publications

Contributors

Links

Bowtie index update 11/14/2009

In response to user requests, Ben Langmead was kind enough to rebuild the Bowtie indexes for human and mouse from UCSC assembly fasta files. Because UCSC fasta files have simple record names, such as "chrX", TopHat runs against them are easier to visualize with the UCSC genome browser or the Integrative Genomics Viewer. We recommend that users who need indexes other than human or mouse build them from UCSC fasta files.


TopHat 1.0.12 (BETA) release 10/28/2009

This release includes both critical fixes and new features, including:

  • A serious bug that hurts sensitivity for short (~36bp) reads that was introduced in 1.0.11 has been fixed
  • TopHat now automatically deletes the intermediate files it produces during each run, which can be very large. You can preserve them by specifying --keep-tmp at the beginning of a run.
  • A new optional search algorithm for short (~36bp) reads, designed to improve junction detection sensitivity, is now available with --butterfly-search.
  • TopHat no longer calculates gene expression. Users interested in expression calculations should consider using Cufflinks for gene- and isoform-level expression calculations.
  • Numerous performance enhancements and reductions in memory usage. For reads 75bp or longer, memory usage is dramatically lower, and should scale much for runs with hundreds of millions of reads.
  • The manual has been updated to better describe the types of reads TopHat expects. The manual also incorrectly stated that TopHat doesn't look for "GC-AG" and "AT-AC" introns, and this has been corrected.


IMPORTANT - Bowtie update 10/13/2009

Until recently, there was a bug that could cause TopHat to report no alignments or junctions with some Bowtie indexes (including) some indexes downloadable from this site. All users are strongly encouraged to upgrade to Bowtie 0.11.0 or later, and the next update to TopHat will force this upgrade.


TopHat 1.0.11 (BETA) and Cufflinks release 9/26/2009

We're pleased to announce the release of a sister tool to TopHat, called Cufflinks. TopHat aligns your RNA-Seq reads; Cufflinks assembles those alignments into transcripts and also calculates isoform and gene level expression in your samples.

This TopHat release contains a number of stability improvements, fixes, and some substantial performance increases. The disk footprint is also reduced, though it's still large, and further reductions are coming in future releases.

We advise all users to adopt Cufflinks to compute expression values. Cufflinks contains a sophisticated algorithm for this calculation, that is far more accurate than TopHat's method. In an upcoming release of TopHat, the RPKM calculation in TopHat will be removed to simplify maintenance.


1.0.10 (BETA) release 7/30/2009

This is a fix release. Notable changes:

  • More SAM compliance fixes.
  • Reduced the frequency of certain types of false junctions through improved spliced alignment filtering


Minor update to 1.0.9 7/10/2009

Version 1.0.9 of TopHat released on 7/8/09 had an incorrect default value for --max-intron length. It is now 500,000, as intended.


1.0.9 (BETA) release - 7/8/2009

This release includes both fixes and new features. This upgrade requires Bowtie 0.10.0.0 or later. Other changes including:

  • Substantially improved sensitivity for reads shorter than 75bp
  • An optional "gap-filling" phase to map multireads from transcribed repeats
  • Fixed some SAM compliance issues
  • Optional (limited) search for alignments that involve microexons
  • Complex index record names no longer crash the pipeline.
  • The command line options have been overhauled, and the meaning of the -a/--min-anchor option has changed. Please see the manual for further details.
  • Closure search is now off by default for all read types
  • Coverage search is off by default for reads 75bp or longer
  • Previous version could report spliced alignments with gaps longer than --max-intron, if any were found. The --max-intron and --min-intron limits are now strictly enforced.

Bowtie updated to 0.10.0.0

IMPORTANT: TopHat 1.0.8 is incompatible with Bowtie 0.10.0.0, which was released this week. While the release of TopHat 1.0.9, which is imminent, will fix the incompatibilities, users are encouraged to stick with Bowtie 0.9.9.3 for now.


1.0.8 (BETA) release - 5/25/2009

This is mostly a fix release, but all users are encourage to upgrade, as some of the bugs fixed were fairly major. Other notable improvements include:

  • If you have reads 50bp or longer, TopHat will look for GC-AG and AT-AC introns
  • Logging has been improved
  • Fewer false positives in gene families with tandem copies

Known issues:

  • Some users have reported pipeline crashes when using Bowtie indexes with long or complex record names. This will be fixed in the next release, but for now, using an index with simple names (no spaces or pipes) is a workaround. Users are recommended to use names like "chr12" to avoid problems.


1.0 (BETA) release - 5/4/2009

TopHat has been almost entirely redesigned and rewritten to handle "second-generation" RNA-Seq data. Reads longer than 50bp and paired end reads are substantially more powerful for finding splice junctions, and TopHat needed new algorithms to take advantage of them. While this release should be considered a beta, and still contains bugs, it has been under development for several months and has been tested by several groups on both first- and second-generation RNA-Seq data in multiple organisms. Longer and/or paired end reads provide a dramatic leap in sensitivity and specificity. Notable improvements include:

  • Paired-end RNA-Seq read support
  • Long read support
  • Improved SAM output
  • No longer depends on Maq
  • Mismatches near splicing anchors now allowed
  • Much more of the pipeline is multithreaded, yielding a massive performance boost
  • Compiles under GCC 4.3


TopHat paper published - 3/16/2009

Our paper on discovering splice junctions has appeared at Bioinformatics.


0.8.3 release - 3/12/2009

This release contains the following enhancements and fixes:

  • Reporting now has a smaller memory footprint
  • A possible source of erroneous alignments due to hashing collisions has been eliminated
  • The install scripts now correctly detects whether to build TopHat with 64-bit compiler flags.


TopHat paper accepted - 3/1/2009

Our paper on discovering splice junctions has been accepted at Bioinformatics, and should appear soon.


0.8.2 release - 3/1/2009

This release contains the following enhancements and fixes:

  • TopHat now reports the alignments it finds in the SAM format. The SAM tools were written primarily by Heng Li at Sanger, and will allow TopHat users to call expressed SNPs from their RNA-Seq reads. The SAM tools themselves are still under development, so TopHat's SAM support should be considered experimental.
  • You can now specify a list of junctions for TopHat to check in a raw format, without using a GFF file of genes
  • The new -o option allows you to change where TopHat puts its output, instead of always writing to "./tophat_out"


0.8.1 release - 1/30/2009

This release contains the following enhancements and fixes:

  • New experimental support for user-supplied annotations. TopHat will accept a GFF file, and will look for junctions contained in the GFF file. TopHat will also perform a basic RPKM calculation on the regions in the annotation, normalized to those annotations only (rather than the whole map). The file must contain "gene", "exon" and "mRNA" records, in the normal record ID, Parent heirarchy. Users are encouraged to treat GFF support as unstable and interpret their results with caution.
  • Several minor bugfixes.

TopHat 0.8.1 uses some code kindly provided by Robert Bradley. The code originally came from Rob's statistical alignment package FSA.


0.8.0 release - 1/19/2009

This release contains the following enhancements and fixes:

  • Dramatic reduction in false positives.
  • TopHat now estimates a minor isoform frequency for each splice junction, and filters infrequent events to cut down dramatically on the false positives. By default, minor isoforms must occur at at least 15 percent of the major isoform.
  • The new output file coverage.wig is a UCSC wigglegram of alignment coverage.
  • TopHat supports multithreading, though not all stages of the pipeline use multiple threads.
  • TopHat now allows reads to have multiple alignments, and it suppresses alignments for reads that have more than a user-specified number (10, by default).
  • The memory exhaustion problem associated with converting Bowtie alignments to Maq has been fixed.
  • You are no longer required to concatenate your reads into a single input file.
  • TopHat will attempt to automatically determine seed length, quality scale, and FASTA/FASTQ format from your input reads.
  • If you are missing a Maq binary fasta file for your reference, one will be created in the output directory using bowtie-inspect. You can copy this file to the location of your bowtie index to avoid this step in your next run.


0.7.2 release - 12/05/2008

The following issues have been fixed:

  • Bowtie 0.9.8 renamed bowtie-convert to bowtie-maqconvert, and TopHat is now compatible with both the new and old name.
  • Minor cosmetic improvements in the TopHat output log.
  • Improved checking in the installer to emit sensible error messages when compiling on Solaris. Solaris is currently not supported, but hopefully will be in the next release.

Known issues:

  • TopHat can exhaust memory when run with many (> 50 million) reads on some machines. This will be fixed in the next release.


0.7.1 release - 11/08/2008

The following issues have been fixed:

  • Maq 0.7.0 changed the Maq map file format. Bowtie 0.9.7 now supports both the new and old mapping format, and thus so now does TopHat. TopHat now checks the version of Maq on the system and uses the correct format.
  • Minor command line interface improvements
  • The -X option has been added to allow the use of FASTQ files that are scaled on the Solexa quality scale, as opposed to Phred (the default). Note that TopHat doesn't support FASTQ-int, only ASCII-encoded qualities are used.
  • The -D option has been added, allowing users to specify when to look for junctions within single islands, as opposed to just between two distinct islands
  • The -Q option allows the user to specify a Phred quality character below which the island consensus caller will use the reference base call. That is, TopHat will not allow SNPs to be called where base quality drops below a certain threshold.
  • TopHat now includes Heng Li's fq_all2std.pl format conversion script to make installation easier.


0.7.0 release - 10/27/08

The first public release of TopHat is now available for download. To use TopHat, you will need to install Bowtie and Maq. Both are open source and freely available under the Artistic license. When you install Bowtie, you should also install the Bowtie index for the genome in your RNA-Seq experiment, if one is available. If there is no pre-built index for the organism you're interested in, you can follow the Bowtie manual's section on how to build one yourself.

Because this is the first release, the manual is very limited. Only the basic options have been described. However, we will be updating it frequently, so please check back. If you find something unclear, or have questions about how TopHat works, please email Cole Trapnell. We will be posting a list of frequently asked questions soon.

In this release, TopHat does not consider mate pairing between reads. You can analyze paired-end RNA-Seq data with TopHat, but the program won't make use of the mate information. Yet. Use of mate pair information is our top development priority. Check back soon for a release with full paired-end support