Next generation sequencing (NGS)-based approaches to the study of microbial communities present a number of unique challenges in data analysis, necessitating the development of special tools and pipelines to effectively process the raw information.
Usually it takes several steps to remove or filter contaminants from RNA-seq sequences to make them suitable for alignment and analysis. This requires users to type several commands and specifies parameters for different commands for each sequence file.
Taxonomic classification is a sub-task of metagenome analysis. In particular, it involves identifying the most likely species (and sometimes the location on the chromosome) of a given fragment of DNA (respectively, RNA in metatranscriptomics.)
One part of this, compositional analysis, is generally done with very primitive classifiers. As recently as 2012, studies have been published using a single Naïve Bayes distribution to represent the whole genome for each species. Such an approach is frail and unable to adapt to the realities of horizontal gene transfer, bacterial evolution, and instrument error rates. Other popular methods, such as interpolated Markov models (IMMs) and Gaussian-priorized Nearest Neighbours, offer improvements over Naïve Bayes but present daunting running times or
GIST (Generative Inference of Sequence Taxonomy) overcomes all of this by using an ensemble of statistical methods instead of relying on one set of assumptions about gene space. Using newer search methods like BWA, combined with comprehensive statistical techniques like mixture models and expected codelta, as well as considering both nucleotide and amino acid information, it provides much more precise and robust predictions.
For more information about Gist, click here
Existing metagenomic (and metatranscriptomic) pipelines use pristine reference sequences for validation, or use no validation at all. Genepuddle's goal is to provide a better alternative, by shaping FluxSimulator output according to expression levels, and then combining the results from multiple runs to generate a mixed-species and mixed-strain environment.
Genepuddle can be found in the Gist
Gargle is a high-precision sequence aligner that uses as much information as possible about sequences to try and find the best possible fit.