Samantha Halliday
John Parkinson
The study of entire microbial communities using mRNA shotgun sequencing (
meta-transcriptomics) offers a unique view of gene activity across a large number of strains simultaneously. For a metatranscriptomic analysis to be thorough, it is necessary to assign both taxonomic and functional identities to each read or transcript. In many studies, this information is computed using a single alignment or alignment pipeline, such as
MG-RAST, but these approaches face a number of challenges prompted by the incredible diversity of bacteria. High-quality taxonomic assignments make it possible to build compartmentalized gene network reconstructions (that properly isolate the cytosol of each cell), enable spore detection (e.g. of vancomycin-resistant Clostridians, which may instigate disorders such as autism and obesity), provide cleanly-isolated bacterial exomes for easy assembly, and open the door to other sophisticated analyses dependent on understanding microbial gene activity
in situ.
Gist (Generative Inference of Sequence Taxonomy) began as a project to develop a sequence classifier based on arbitrary input classes, and has developed into a high-precision, noise-tolerant taxonomic classifier focused specifically on the problem of annotating short (76 nt and higher), unassembled reads with little or no quality filtering. Gist implements many of the techniques which have been published in recent years as promising methods for classifying metatranscriptomic data (alignment via
BWA, composition analysis using
Naive Bayes and
Nearest Neighbour) as well as a number of new or unusual techniques (on-the-fly gene translation using
FragGeneScan, support for priors specified as expected abundances, composition analysis using Gaussian mixture models and Expected Codelta Correlation, and a neural network based approach for balancing method weights) that allow Gist to compete with or exceed the performance of all existing methods in the datasets tested thus far.
Getting Gist
The current version of Gist is 0.7.17. Its source code can be
downloaded from GitHub. This version is functionally complete, but not yet fully documented. Due to minor implementation restrictions, this version only runs under Linux. See README.md for bare-bones installation instructions, usage tips, and licence information.
Classes and Datasets
A comprehensive database for Gist is still in development. In the meantime, it will be necessary for the user to construct smaller databases based on their expectations about the environment. The very-soon-to-be-released
Genepuddle pipeline will expedite this; until then, users are encouraged to get acquainted with
Flux Simulator.
Citing Gist
A manuscript is currently in preparation for submission. An early preprint can be found
at bioRxiv.
Credits
Gist was written and is maintained by Samantha Halliday (
rhetorica@cs.toronto.edu) with advice from her supervisor, John Parkinson (
john.parkinson@utoronto.ca). Please feel free to send Samantha mail if you have any questions or requests.