MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data

Li, D; Leung, HCM; Wong, CK; Zhang, Y; Law, WC; Xin, Y; Luo, R; Ting, HF; Lam, TW

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1109/ICCABS.2018.8541953

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data

Title	MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data
Authors	Li, D Leung, HCM Wong, CK Zhang, Y Law, WC Xin, Y Luo, R Ting, HF Lam, TW
Issue Date	2018
Publisher	IEEE. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1800307
Citation	2018 IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Las Vegas, NV, USA, 18-20 October 2018 How to Cite? DOI: http://dx.doi.org/10.1109/ICCABS.2018.8541953
Abstract	Detecting pathogen, the causal bacteria or virus, of infections such as pneumonia is an important step in diagnosis. Traditional method for pathogen detection is time- consuming as infectious disease may be caused by a large range of pathogens which should be checked one by one. This causes the delay of treatment or even mistreatment of patients. Unbiased next-generation sequencing (NGS) can detect DNA fragments (reads) of all species in a metagenomics sample with a mixture of different species. Those NGS reads could be classified into different taxa by comparing them with a collection of reference genome sequences, and pathogens could be detected if some reads match them. In clinical diagnoses, it is important that a classifier can detect a significant number of reads supporting the potential pathogens and report as few false classifications as possible, to give a high abundance rank for the pathogen. Otherwise, the pathogen cannot be distinguished from background noises, and it will take doctors a long time to go through a long list of candidates to verify its existence. Existing metagenomic classifiers do not perform well for detecting low-similarity pathogens, i.e., pathogen with genome that is not similar to the reference. It is because most classifiers detect pathogen by constructing a characteristic profile (e.g. k-mers) for each reference and assign reads to species by comparing them with the profiles. When the characteristic profile does not match with the genome of low- similarity pathogens, this approach fails and results in many incorrect or nonspecifically classification. Some tools assign reads to reference sequences by local or semi-global alignment. The analysis time is long (over 4 hours for a typical dataset of 1 Gb) but more reads from the pathogen can be assigned correctly. However, the alignment score of reads are still low for low-similarity pathogen. These reads cannot be assigned to the pathogen specifically such that the number of reads supporting the pathogen is still too low. In order to detect low-similarity pathogen, we introduce MegaPath for NGS-based pathogen detection. There are two major contributions. First, instead of assigning each read to reference sequence one by one, MegaPath analyzes all aligned reads globally to determine a subset of reads with confident alignments. Then MegaPath reassigns non-specifically aligned reads to species with confident alignments, and discards unconfident alignments to avoid potential false classifications. It will increase the number of reads supporting the pathogen and reduce the number of false positive assignments. Second, MegaPath adopts a fast alignment-based approach using an enhanced maximum-exact-match prefix seeding strategy and SIMD-accelerated Smith-Waterman algorithm. Use a metagenomic NGS sample of cerebrospinal fluid (CSF) [1] as an example. The similarity of the pathogen to reference is 18.7%. Centrifuge [2] and Kraken [4], based on characteristic profile, detect 31 and 6 reads from the pathogen respectively. The abundance rank of the pathogen is 710 and 384 respectively. Thus, the doctor needs to go through a list of 300+ species to find out the pathogen. By an alignment processes taking 4 hours, SURPI [3] can detect 76 reads for the pathogen and its rank go up to 245. With better alignment tools and global analysis of reads, MegaPath takes less than one hour to detect 608 reads for the pathogen and its rank is at 33. Thus, MegaPath has the best performance among existing software with a reasonable running time. Experiment results for more datasets can be found in the full paper. In addition to detecting pathogens with known reference sequences, MegaPath can also detect pathogens without any similar DNA-level sequences in the reference database, using de novo assembly and protein alignment.
Persistent Identifier	http://hdl.handle.net/10722/274110
ISBN	9781538685204

DC Field	Value	Language
dc.contributor.author	Li, D	-
dc.contributor.author	Leung, HCM	-
dc.contributor.author	Wong, CK	-
dc.contributor.author	Zhang, Y	-
dc.contributor.author	Law, WC	-
dc.contributor.author	Xin, Y	-
dc.contributor.author	Luo, R	-
dc.contributor.author	Ting, HF	-
dc.contributor.author	Lam, TW	-
dc.date.accessioned	2019-08-18T14:55:17Z	-
dc.date.available	2019-08-18T14:55:17Z	-
dc.date.issued	2018	-
dc.identifier.citation	2018 IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), Las Vegas, NV, USA, 18-20 October 2018	-
dc.identifier.isbn	9781538685204	-
dc.identifier.uri	http://hdl.handle.net/10722/274110	-
dc.description.abstract	Detecting pathogen, the causal bacteria or virus, of infections such as pneumonia is an important step in diagnosis. Traditional method for pathogen detection is time- consuming as infectious disease may be caused by a large range of pathogens which should be checked one by one. This causes the delay of treatment or even mistreatment of patients. Unbiased next-generation sequencing (NGS) can detect DNA fragments (reads) of all species in a metagenomics sample with a mixture of different species. Those NGS reads could be classified into different taxa by comparing them with a collection of reference genome sequences, and pathogens could be detected if some reads match them. In clinical diagnoses, it is important that a classifier can detect a significant number of reads supporting the potential pathogens and report as few false classifications as possible, to give a high abundance rank for the pathogen. Otherwise, the pathogen cannot be distinguished from background noises, and it will take doctors a long time to go through a long list of candidates to verify its existence. Existing metagenomic classifiers do not perform well for detecting low-similarity pathogens, i.e., pathogen with genome that is not similar to the reference. It is because most classifiers detect pathogen by constructing a characteristic profile (e.g. k-mers) for each reference and assign reads to species by comparing them with the profiles. When the characteristic profile does not match with the genome of low- similarity pathogens, this approach fails and results in many incorrect or nonspecifically classification. Some tools assign reads to reference sequences by local or semi-global alignment. The analysis time is long (over 4 hours for a typical dataset of 1 Gb) but more reads from the pathogen can be assigned correctly. However, the alignment score of reads are still low for low-similarity pathogen. These reads cannot be assigned to the pathogen specifically such that the number of reads supporting the pathogen is still too low. In order to detect low-similarity pathogen, we introduce MegaPath for NGS-based pathogen detection. There are two major contributions. First, instead of assigning each read to reference sequence one by one, MegaPath analyzes all aligned reads globally to determine a subset of reads with confident alignments. Then MegaPath reassigns non-specifically aligned reads to species with confident alignments, and discards unconfident alignments to avoid potential false classifications. It will increase the number of reads supporting the pathogen and reduce the number of false positive assignments. Second, MegaPath adopts a fast alignment-based approach using an enhanced maximum-exact-match prefix seeding strategy and SIMD-accelerated Smith-Waterman algorithm. Use a metagenomic NGS sample of cerebrospinal fluid (CSF) [1] as an example. The similarity of the pathogen to reference is 18.7%. Centrifuge [2] and Kraken [4], based on characteristic profile, detect 31 and 6 reads from the pathogen respectively. The abundance rank of the pathogen is 710 and 384 respectively. Thus, the doctor needs to go through a list of 300+ species to find out the pathogen. By an alignment processes taking 4 hours, SURPI [3] can detect 76 reads for the pathogen and its rank go up to 245. With better alignment tools and global analysis of reads, MegaPath takes less than one hour to detect 608 reads for the pathogen and its rank is at 33. Thus, MegaPath has the best performance among existing software with a reasonable running time. Experiment results for more datasets can be found in the full paper. In addition to detecting pathogens with known reference sequences, MegaPath can also detect pathogens without any similar DNA-level sequences in the reference database, using de novo assembly and protein alignment.	-
dc.language	eng	-
dc.publisher	IEEE. The Journal's web site is located at http://ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1800307	-
dc.relation.ispartof	IEEE International Conference on Computational Advances in Bio and Medical Sciences Proceedings	-
dc.relation.ispartof	2018 IEEE 8th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)	-
dc.rights	IEEE International Conference on Computational Advances in Bio and Medical Sciences Proceedings. Copyright © IEEE.	-
dc.rights	©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.	-
dc.title	MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data	-
dc.type	Conference_Paper	-
dc.identifier.email	Leung, HCM: cmleung3@hku.hk	-
dc.identifier.email	Zhang, Y: yifanz@hku.hk	-
dc.identifier.email	Xin, Y: yxinbal@HKUCC-COM.hku.hk	-
dc.identifier.email	Luo, R: rbluo@cs.hku.hk	-
dc.identifier.email	Ting, HF: hfting@cs.hku.hk	-
dc.identifier.email	Lam, TW: twlam@cs.hku.hk	-
dc.identifier.authority	Leung, HCM=rp00144	-
dc.identifier.authority	Luo, R=rp02360	-
dc.identifier.authority	Ting, HF=rp00177	-
dc.identifier.authority	Lam, TW=rp00135	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1109/ICCABS.2018.8541953	-
dc.identifier.hkuros	302243	-
dc.identifier.volume	2018	-
dc.publisher.place	United States	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: MegaPath: Low-Similarity Pathogen Detection from Metagenomic NGS Data

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats