File Download
  Links for fulltext
     (May Require Subscription)
Supplementary

Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers

TitleUnsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
Authors
KeywordsBinning
DNA composition features
k-mean clustering
l-mer
Metagenomics
Modified Chebychev distance
Issue Date2009
PublisherACM.
Citation
The 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), co-located with the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2-6 November 2009. In Conference Proceedings, 2009, p. 3-10 How to Cite?
AbstractWith the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/∼alse/MetaCluster/. Copyright 2009 ACM.
Persistent Identifierhttp://hdl.handle.net/10722/93158
ISBN
References

 

DC FieldValueLanguage
dc.contributor.authorYang, Ben_HK
dc.contributor.authorPeng, Yen_HK
dc.contributor.authorLeung, HCMen_HK
dc.contributor.authorYiu, SMen_HK
dc.contributor.authorChen, Jen_HK
dc.contributor.authorChin, FYLen_HK
dc.date.accessioned2010-09-25T14:52:39Z-
dc.date.available2010-09-25T14:52:39Z-
dc.date.issued2009en_HK
dc.identifier.citationThe 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), co-located with the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2-6 November 2009. In Conference Proceedings, 2009, p. 3-10en_HK
dc.identifier.isbn978-1-60558-803-2-
dc.identifier.urihttp://hdl.handle.net/10722/93158-
dc.description.abstractWith the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/∼alse/MetaCluster/. Copyright 2009 ACM.en_HK
dc.languageengen_HK
dc.publisherACM.-
dc.relation.ispartofDTMBIO '09: Proceedings of the third international workshop on Data and text mining in bioinformaticsen_HK
dc.subjectBinningen_HK
dc.subjectDNA composition featuresen_HK
dc.subjectk-mean clusteringen_HK
dc.subjectl-meren_HK
dc.subjectMetagenomicsen_HK
dc.subjectModified Chebychev distance-
dc.titleUnsupervised binning of environmental genomic fragments based on an error robust selection of l-mersen_HK
dc.typeConference_Paperen_HK
dc.identifier.emailYang, B: byang@cs.hku.hken_HK
dc.identifier.emailLeung, HCM: cmleung2@cs.hku.hken_HK
dc.identifier.emailYiu, SM: smyiu@cs.hku.hken_HK
dc.identifier.emailChin, FYL: chin@cs.hku.hk-
dc.identifier.authorityLeung, HCM=rp00144en_HK
dc.identifier.authorityYiu, SM=rp00207en_HK
dc.identifier.authorityChin, FYL=rp00105en_HK
dc.description.naturelink_to_OA_fulltext-
dc.identifier.doi10.1145/1651318.1651322en_HK
dc.identifier.scopuseid_2-s2.0-74049111901en_HK
dc.identifier.hkuros166457en_HK
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-74049111901&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.spage3en_HK
dc.identifier.epage10en_HK
dc.publisher.placeUnited States-
dc.identifier.scopusauthoridChin, FYL=7005101915en_HK
dc.identifier.scopusauthoridChen, JC=35301763400en_HK
dc.identifier.scopusauthoridYiu, SM=7003282240en_HK
dc.identifier.scopusauthoridLeung, HCM=35233742700en_HK
dc.identifier.scopusauthoridPeng, Y=8713314400en_HK
dc.identifier.scopusauthoridYang, B=7404472246en_HK
dc.identifier.citeulike6647886-
dc.customcontrol.immutablesml 140806-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats