File Download
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1145/1651318.1651322
- Scopus: eid_2-s2.0-74049111901
Conference Paper: Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
Title | Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers |
---|---|
Authors | |
Keywords | Binning DNA composition features k-mean clustering l-mer Metagenomics Modified Chebychev distance |
Issue Date | 2009 |
Publisher | ACM. |
Citation | The 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), co-located with the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2-6 November 2009. In Conference Proceedings, 2009, p. 3-10 How to Cite? |
Abstract | With the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/∼alse/MetaCluster/. Copyright 2009 ACM. |
Persistent Identifier | http://hdl.handle.net/10722/93158 |
ISBN | |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yang, B | en_HK |
dc.contributor.author | Peng, Y | en_HK |
dc.contributor.author | Leung, HCM | en_HK |
dc.contributor.author | Yiu, SM | en_HK |
dc.contributor.author | Chen, J | en_HK |
dc.contributor.author | Chin, FYL | en_HK |
dc.date.accessioned | 2010-09-25T14:52:39Z | - |
dc.date.available | 2010-09-25T14:52:39Z | - |
dc.date.issued | 2009 | en_HK |
dc.identifier.citation | The 3rd ACM International Workshop on Data and Text Mining in Bioinformatics (DTMBIO '09), co-located with the 18th ACM International Conference on Information and Knowledge Management (CIKM 2009), Hong Kong, China, 2-6 November 2009. In Conference Proceedings, 2009, p. 3-10 | en_HK |
dc.identifier.isbn | 978-1-60558-803-2 | - |
dc.identifier.uri | http://hdl.handle.net/10722/93158 | - |
dc.description.abstract | With the rapid development of genome sequencing techniques, the traditional research methods of microorganisms based on the isolation and cultivation are gradually replaced by metagenomics, also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomic is the taxonomic characterization of the DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as 'binning'. Existing binning methods belong to supervised or semi-supervised approaches which rely heavily on the reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and unstable of marker genes, these methods may not be applicable in all cases. In this paper, we present an unsupervised binning method based on the distribution of a careful selected set of l-mers (substrings of length l in reads). From our experiments, we show that our approach can accurately bin DNA fragments with various length and relative species abundance ratio without any reference and training datasets. Another highlight of our approach is error robustness. The binning accuracy only decreases less than 1% while the sequencing error rate increases from 0% to 5% which is much lower than the typical sequencing error rate of entire existing commercial sequencing platform which is less than 2%. The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/∼alse/MetaCluster/. Copyright 2009 ACM. | en_HK |
dc.language | eng | en_HK |
dc.publisher | ACM. | - |
dc.relation.ispartof | DTMBIO '09: Proceedings of the third international workshop on Data and text mining in bioinformatics | en_HK |
dc.subject | Binning | en_HK |
dc.subject | DNA composition features | en_HK |
dc.subject | k-mean clustering | en_HK |
dc.subject | l-mer | en_HK |
dc.subject | Metagenomics | en_HK |
dc.subject | Modified Chebychev distance | - |
dc.title | Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers | en_HK |
dc.type | Conference_Paper | en_HK |
dc.identifier.email | Yang, B: byang@cs.hku.hk | en_HK |
dc.identifier.email | Leung, HCM: cmleung2@cs.hku.hk | en_HK |
dc.identifier.email | Yiu, SM: smyiu@cs.hku.hk | en_HK |
dc.identifier.email | Chin, FYL: chin@cs.hku.hk | - |
dc.identifier.authority | Leung, HCM=rp00144 | en_HK |
dc.identifier.authority | Yiu, SM=rp00207 | en_HK |
dc.identifier.authority | Chin, FYL=rp00105 | en_HK |
dc.description.nature | link_to_OA_fulltext | - |
dc.identifier.doi | 10.1145/1651318.1651322 | en_HK |
dc.identifier.scopus | eid_2-s2.0-74049111901 | en_HK |
dc.identifier.hkuros | 166457 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-74049111901&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.spage | 3 | en_HK |
dc.identifier.epage | 10 | en_HK |
dc.publisher.place | United States | - |
dc.identifier.scopusauthorid | Chin, FYL=7005101915 | en_HK |
dc.identifier.scopusauthorid | Chen, JC=35301763400 | en_HK |
dc.identifier.scopusauthorid | Yiu, SM=7003282240 | en_HK |
dc.identifier.scopusauthorid | Leung, HCM=35233742700 | en_HK |
dc.identifier.scopusauthorid | Peng, Y=8713314400 | en_HK |
dc.identifier.scopusauthorid | Yang, B=7404472246 | en_HK |
dc.identifier.citeulike | 6647886 | - |
dc.customcontrol.immutable | sml 140806 | - |