Archive for the ‘Sequence Classification’ Category

Open-Source Sequence Clustering Methods Improve the State Of the Art

Wednesday, February 24th, 2016

Open-Source Sequence Clustering Methods Improve the State Of the Art by Evguenia Kopylova et al.


Sequence clustering is a common early step in amplicon-based microbial community analysis, when raw sequencing reads are clustered into operational taxonomic units (OTUs) to reduce the run time of subsequent analysis steps. Here, we evaluated the performance of recently released state-of-the-art open-source clustering software products, namely, OTUCLUST, Swarm, SUMACLUST, and SortMeRNA, against current principal options (UCLUST and USEARCH) in QIIME, hierarchical clustering methods in mothur, and USEARCH’s most recent clustering algorithm, UPARSE. All the latest open-source tools showed promising results, reporting up to 60% fewer spurious OTUs than UCLUST, indicating that the underlying clustering algorithm can vastly reduce the number of these derived OTUs. Furthermore, we observed that stringent quality filtering, such as is done in UPARSE, can cause a significant underestimation of species abundance and diversity, leading to incorrect biological results. Swarm, SUMACLUST, and SortMeRNA have been included in the QIIME 1.9.0 release.

IMPORTANCE Massive collections of next-generation sequencing data call for fast, accurate, and easily accessible bioinformatics algorithms to perform sequence clustering. A comprehensive benchmark is presented, including open-source tools and the popular USEARCH suite. Simulated, mock, and environmental communities were used to analyze sensitivity, selectivity, species diversity (alpha and beta), and taxonomic composition. The results demonstrate that recent clustering algorithms can significantly improve accuracy and preserve estimated diversity without the application of aggressive filtering. Moreover, these tools are all open source, apply multiple levels of multithreading, and scale to the demands of modern next-generation sequencing data, which is essential for the analysis of massive multidisciplinary studies such as the Earth Microbiome Project (EMP) (J. A. Gilbert, J. K. Jansson, and R. Knight, BMC Biol 12:69, 2014,

Bioinformatics has specialized clustering issues but improvements in clustering algorithms are likely to have benefits for others.

Not to mention garage gene hackers, who may benefit more directly.

A Brief Survey on Sequence Classification

Monday, December 6th, 2010

A Brief Survey on Sequence Classification Authors: Zhengzheng Xing, Jian Pei, Eamonn Keogh


Sequence classification has a broad range of applications such as genomic analysis, information retrieval, health informatics, finance, and abnormal detection. Different from the classification task on feature vectors, sequences do not have explicit features. Even with sophisticated feature selection techniques, the dimensionality of potential features may still be very high and the sequential nature of features is difficult to capture. This makes sequence classification a more challenging task than classification on feature vectors. In this paper, we present a brief review of the existing work on sequence classification. We summarize the sequence classification in terms of methodologies and application domains. We also provide a review on several extensions of the sequence classification problem, such as early classification on sequences and semi-supervised learning on sequences.

Excellent survey article on sequence classification, which as the authors note, is a rapidly developing field of research.

This article was published in the “newsletter” of the ACM Special Interest Group on Knowledge Discovery and Data Mining. Far more substantive material than I am accustomed to seeing in any “newsletter.”

The ACM has very attractive student discounts and if you are serious about being an information professional, it is one of the organizations that I would recommend in addition to the usual library suspects.