Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

Yang, B; Peng, Y; Leung, HCM; Yiu, SM; Chen, J; Chin, FYL

File Download

re01.htm

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/1651318.1651322
Scopus: eid_2-s2.0-74049111901

Supplementary

Bookmarks:
- CiteULike: 1
Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

Title	Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
Authors	Yang, B Peng, Y Leung, HCM Yiu, SM Chen, J Chin, FYL
Keywords	Binning DNA composition features k-mean clustering l-mer Metagenomics Modified Chebychev distance
Issue Date	2009
Publisher	ACM.
Citation	The 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), co-located with the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2-6 November 2009. In Conference Proceedings, 2009, p. 3-10 How to Cite? DOI: http://dx.doi.org/10.1145/1651318.1651322
Abstract	With the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/∼alse/MetaCluster/. Copyright 2009 ACM.
Persistent Identifier	http://hdl.handle.net/10722/93158
ISBN	978-1-60558-803-2
References	References in Scopus

DC Field	Value	Language
dc.contributor.author	Yang, B	en_HK
dc.contributor.author	Peng, Y	en_HK
dc.contributor.author	Leung, HCM	en_HK
dc.contributor.author	Yiu, SM	en_HK
dc.contributor.author	Chen, J	en_HK
dc.contributor.author	Chin, FYL	en_HK
dc.date.accessioned	2010-09-25T14:52:39Z	-
dc.date.available	2010-09-25T14:52:39Z	-
dc.date.issued	2009	en_HK
dc.identifier.citation	The 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), co-located with the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2-6 November 2009. In Conference Proceedings, 2009, p. 3-10	en_HK
dc.identifier.isbn	978-1-60558-803-2	-
dc.identifier.uri	http://hdl.handle.net/10722/93158	-
dc.description.abstract	With the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/∼alse/MetaCluster/. Copyright 2009 ACM.	en_HK
dc.language	eng	en_HK
dc.publisher	ACM.	-
dc.relation.ispartof	DTMBIO '09: Proceedings of the third international workshop on Data and text mining in bioinformatics	en_HK
dc.subject	Binning	en_HK
dc.subject	DNA composition features	en_HK
dc.subject	k-mean clustering	en_HK
dc.subject	l-mer	en_HK
dc.subject	Metagenomics	en_HK
dc.subject	Modified Chebychev distance	-
dc.title	Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers	en_HK
dc.type	Conference_Paper	en_HK
dc.identifier.email	Yang, B: byang@cs.hku.hk	en_HK
dc.identifier.email	Leung, HCM: cmleung2@cs.hku.hk	en_HK
dc.identifier.email	Yiu, SM: smyiu@cs.hku.hk	en_HK
dc.identifier.email	Chin, FYL: chin@cs.hku.hk	-
dc.identifier.authority	Leung, HCM=rp00144	en_HK
dc.identifier.authority	Yiu, SM=rp00207	en_HK
dc.identifier.authority	Chin, FYL=rp00105	en_HK
dc.description.nature	link_to_OA_fulltext	-
dc.identifier.doi	10.1145/1651318.1651322	en_HK
dc.identifier.scopus	eid_2-s2.0-74049111901	en_HK
dc.identifier.hkuros	166457	en_HK
dc.relation.references	http://www.scopus.com/mlt/select.url?eid=2-s2.0-74049111901&selection=ref&src=s&origin=recordpage	en_HK
dc.identifier.spage	3	en_HK
dc.identifier.epage	10	en_HK
dc.publisher.place	United States	-
dc.identifier.scopusauthorid	Chin, FYL=7005101915	en_HK
dc.identifier.scopusauthorid	Chen, JC=35301763400	en_HK
dc.identifier.scopusauthorid	Yiu, SM=7003282240	en_HK
dc.identifier.scopusauthorid	Leung, HCM=35233742700	en_HK
dc.identifier.scopusauthorid	Peng, Y=8713314400	en_HK
dc.identifier.scopusauthorid	Yang, B=7404472246	en_HK
dc.identifier.citeulike	6647886	-
dc.customcontrol.immutable	sml 140806	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats