Natural language processing algorithms for randomized trials

Wang, Fan; 王帆

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Public Health: Theses

postgraduate thesis: Natural language processing algorithms for randomized trials

Title	Natural language processing algorithms for randomized trials
Authors	Wang, Fan 王帆
Advisors	Advisor(s):Pang, HMH Wu, JTK
Issue Date	2020
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Wang, F. [王帆]. (2020). Natural language processing algorithms for randomized trials. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Randomized controlled trial is the ‘gold standard’ study design in clinical and health research. Adequate reporting in trials can maximize the value of the trial findings to clinical and health care. Failure to comply with the CONSORT reporting guideline for randomized trials could complicate study interpretation and potentially impact subsequent research. The objective of this thesis is 1) to develop an automated reporting checklist generation tool using natural language processing, 2) to apply machine learning-based prediction algorithms to check for compliance with CONSORT reporting item 18, and 3) to perform machine learning prediction on whether a published study is positive or negative using only the content from the study’s abstract. This thesis explores natural language processing algorithms for randomized trials based on rules and machine learning approaches. The first study used published journal articles as training, testing, and validation sets to develop, refine, and evaluate our rule-based tool. 158 articles reporting randomized controlled trials were selected from high impact factor journals under the following categories: 1) Medicine, general and internal, 2) Oncology; and 3) Cardiac and cardiovascular systems. A graphical user interface for the tool was built using Java. For evaluating the performance of our method, we calculated an accuracy metric defined as the number of correct assessments divided by all assessments. Two case studies for randomized trials are provided as an illustration for the tool. Of the 30 fully implemented items, 28 (93%) have more than 90% accuracy on the validation set. The results showed that our tool performed well in the validation set evaluation of fully implemented reporting items in terms of accuracy. The second study used machine learning methods to 1) detect reporting of CONSORT Item 18 for ancillary analysis, and 2) use the study abstract to predict whether a study is positive. We used three levels of feature extraction engines, including word vectors, TF-IDF vectors, and word embedding. Results of several prediction classifiers, including naïve Bayes, linear logistic regression, support vector machine, random forests, gradient boosting, and convolutional neural network, were compared. The results of two sub-studies showed that the performance of feature extraction engines and prediction models varied on different tasks. A workflow for the text classification task was also proposed. Given this framework, our work is broadly applicable to articles in other medical and health categories outside of the three that we focused on. The results from the two studies showed that natural language processing algorithms could be applied to assist in the reporting of randomized trials and to better utilize the information from randomized trials. The use of natural language processing could help users save substantial time when generating the CONSORT checklist, narrow the search scope and reduce the manual effort when screening for appropriate articles. Findings of the thesis also provide some good guidance for future applications of artificial intelligence and machine learning techniques to other medical and health literature.
Degree	Master of Philosophy
Subject	Natural language processing (Computer science) Clinical trials Machine learning Artificial intelligence - Medical applications
Dept/Program	Public Health
Persistent Identifier	http://hdl.handle.net/10722/302556

DC Field	Value	Language
dc.contributor.advisor	Pang, HMH	-
dc.contributor.advisor	Wu, JTK	-
dc.contributor.author	Wang, Fan	-
dc.contributor.author	王帆	-
dc.date.accessioned	2021-09-07T03:41:27Z	-
dc.date.available	2021-09-07T03:41:27Z	-
dc.date.issued	2020	-
dc.identifier.citation	Wang, F. [王帆]. (2020). Natural language processing algorithms for randomized trials. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/302556	-
dc.description.abstract	Randomized controlled trial is the ‘gold standard’ study design in clinical and health research. Adequate reporting in trials can maximize the value of the trial findings to clinical and health care. Failure to comply with the CONSORT reporting guideline for randomized trials could complicate study interpretation and potentially impact subsequent research. The objective of this thesis is 1) to develop an automated reporting checklist generation tool using natural language processing, 2) to apply machine learning-based prediction algorithms to check for compliance with CONSORT reporting item 18, and 3) to perform machine learning prediction on whether a published study is positive or negative using only the content from the study’s abstract. This thesis explores natural language processing algorithms for randomized trials based on rules and machine learning approaches. The first study used published journal articles as training, testing, and validation sets to develop, refine, and evaluate our rule-based tool. 158 articles reporting randomized controlled trials were selected from high impact factor journals under the following categories: 1) Medicine, general and internal, 2) Oncology; and 3) Cardiac and cardiovascular systems. A graphical user interface for the tool was built using Java. For evaluating the performance of our method, we calculated an accuracy metric defined as the number of correct assessments divided by all assessments. Two case studies for randomized trials are provided as an illustration for the tool. Of the 30 fully implemented items, 28 (93%) have more than 90% accuracy on the validation set. The results showed that our tool performed well in the validation set evaluation of fully implemented reporting items in terms of accuracy. The second study used machine learning methods to 1) detect reporting of CONSORT Item 18 for ancillary analysis, and 2) use the study abstract to predict whether a study is positive. We used three levels of feature extraction engines, including word vectors, TF-IDF vectors, and word embedding. Results of several prediction classifiers, including naïve Bayes, linear logistic regression, support vector machine, random forests, gradient boosting, and convolutional neural network, were compared. The results of two sub-studies showed that the performance of feature extraction engines and prediction models varied on different tasks. A workflow for the text classification task was also proposed. Given this framework, our work is broadly applicable to articles in other medical and health categories outside of the three that we focused on. The results from the two studies showed that natural language processing algorithms could be applied to assist in the reporting of randomized trials and to better utilize the information from randomized trials. The use of natural language processing could help users save substantial time when generating the CONSORT checklist, narrow the search scope and reduce the manual effort when screening for appropriate articles. Findings of the thesis also provide some good guidance for future applications of artificial intelligence and machine learning techniques to other medical and health literature.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Natural language processing (Computer science)	-
dc.subject.lcsh	Clinical trials	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Artificial intelligence - Medical applications	-
dc.title	Natural language processing algorithms for randomized trials	-
dc.type	PG_Thesis	-
dc.description.thesisname	Master of Philosophy	-
dc.description.thesislevel	Master	-
dc.description.thesisdiscipline	Public Health	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2020	-
dc.identifier.mmsid	991044291215103414	-

File Download

Supplementary

postgraduate thesis: Natural language processing algorithms for randomized trials

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats