File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Natural language processing algorithms for randomized trials
Title | Natural language processing algorithms for randomized trials |
---|---|
Authors | |
Advisors | |
Issue Date | 2020 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Wang, F. [王帆]. (2020). Natural language processing algorithms for randomized trials. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Randomized controlled trial is the ‘gold standard’ study design in clinical and health research. Adequate reporting in trials can maximize the value of the trial findings to clinical and health care. Failure to comply with the CONSORT reporting guideline for randomized trials could complicate study interpretation and potentially impact subsequent research. The objective of this thesis is 1) to develop an automated reporting checklist generation tool using natural language processing, 2) to apply machine learning-based prediction algorithms to check for compliance with CONSORT reporting item 18, and 3) to perform machine learning prediction on whether a published study is positive or negative using only the content from the study’s abstract.
This thesis explores natural language processing algorithms for randomized trials based on rules and machine learning approaches. The first study used published journal articles as training, testing, and validation sets to develop, refine, and evaluate our rule-based tool. 158 articles reporting randomized controlled trials were selected from high impact factor journals under the following categories: 1) Medicine, general and internal, 2) Oncology; and 3) Cardiac and cardiovascular systems. A graphical user interface for the tool was built using Java. For evaluating the performance of our method, we calculated an accuracy metric defined as the number of correct assessments divided by all assessments. Two case studies for randomized trials are provided as an illustration for the tool. Of the 30 fully implemented items, 28 (93%) have more than 90% accuracy on the validation set. The results showed that our tool performed well in the validation set evaluation of fully implemented reporting items in terms of accuracy.
The second study used machine learning methods to 1) detect reporting of CONSORT Item 18 for ancillary analysis, and 2) use the study abstract to predict whether a study is positive. We used three levels of feature extraction engines, including word vectors, TF-IDF vectors, and word embedding. Results of several prediction classifiers, including naïve Bayes, linear logistic regression, support vector machine, random forests, gradient boosting, and convolutional neural network, were compared. The results of two sub-studies showed that the performance of feature extraction engines and prediction models varied on different tasks. A workflow for the text classification task was also proposed. Given this framework, our work is broadly applicable to articles in other medical and health categories outside of the three that we focused on.
The results from the two studies showed that natural language processing algorithms could be applied to assist in the reporting of randomized trials and to better utilize the information from randomized trials. The use of natural language processing could help users save substantial time when generating the CONSORT checklist, narrow the search scope and reduce the manual effort when screening for appropriate articles. Findings of the thesis also provide some good guidance for future applications of artificial intelligence and machine learning techniques to other medical and health literature. |
Degree | Master of Philosophy |
Subject | Natural language processing (Computer science) Clinical trials Machine learning Artificial intelligence - Medical applications |
Dept/Program | Public Health |
Persistent Identifier | http://hdl.handle.net/10722/302556 |
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Pang, HMH | - |
dc.contributor.advisor | Wu, JTK | - |
dc.contributor.author | Wang, Fan | - |
dc.contributor.author | 王帆 | - |
dc.date.accessioned | 2021-09-07T03:41:27Z | - |
dc.date.available | 2021-09-07T03:41:27Z | - |
dc.date.issued | 2020 | - |
dc.identifier.citation | Wang, F. [王帆]. (2020). Natural language processing algorithms for randomized trials. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/302556 | - |
dc.description.abstract | Randomized controlled trial is the ‘gold standard’ study design in clinical and health research. Adequate reporting in trials can maximize the value of the trial findings to clinical and health care. Failure to comply with the CONSORT reporting guideline for randomized trials could complicate study interpretation and potentially impact subsequent research. The objective of this thesis is 1) to develop an automated reporting checklist generation tool using natural language processing, 2) to apply machine learning-based prediction algorithms to check for compliance with CONSORT reporting item 18, and 3) to perform machine learning prediction on whether a published study is positive or negative using only the content from the study’s abstract. This thesis explores natural language processing algorithms for randomized trials based on rules and machine learning approaches. The first study used published journal articles as training, testing, and validation sets to develop, refine, and evaluate our rule-based tool. 158 articles reporting randomized controlled trials were selected from high impact factor journals under the following categories: 1) Medicine, general and internal, 2) Oncology; and 3) Cardiac and cardiovascular systems. A graphical user interface for the tool was built using Java. For evaluating the performance of our method, we calculated an accuracy metric defined as the number of correct assessments divided by all assessments. Two case studies for randomized trials are provided as an illustration for the tool. Of the 30 fully implemented items, 28 (93%) have more than 90% accuracy on the validation set. The results showed that our tool performed well in the validation set evaluation of fully implemented reporting items in terms of accuracy. The second study used machine learning methods to 1) detect reporting of CONSORT Item 18 for ancillary analysis, and 2) use the study abstract to predict whether a study is positive. We used three levels of feature extraction engines, including word vectors, TF-IDF vectors, and word embedding. Results of several prediction classifiers, including naïve Bayes, linear logistic regression, support vector machine, random forests, gradient boosting, and convolutional neural network, were compared. The results of two sub-studies showed that the performance of feature extraction engines and prediction models varied on different tasks. A workflow for the text classification task was also proposed. Given this framework, our work is broadly applicable to articles in other medical and health categories outside of the three that we focused on. The results from the two studies showed that natural language processing algorithms could be applied to assist in the reporting of randomized trials and to better utilize the information from randomized trials. The use of natural language processing could help users save substantial time when generating the CONSORT checklist, narrow the search scope and reduce the manual effort when screening for appropriate articles. Findings of the thesis also provide some good guidance for future applications of artificial intelligence and machine learning techniques to other medical and health literature. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.subject.lcsh | Natural language processing (Computer science) | - |
dc.subject.lcsh | Clinical trials | - |
dc.subject.lcsh | Machine learning | - |
dc.subject.lcsh | Artificial intelligence - Medical applications | - |
dc.title | Natural language processing algorithms for randomized trials | - |
dc.type | PG_Thesis | - |
dc.description.thesisname | Master of Philosophy | - |
dc.description.thesislevel | Master | - |
dc.description.thesisdiscipline | Public Health | - |
dc.description.nature | published_or_final_version | - |
dc.date.hkucongregation | 2020 | - |
dc.identifier.mmsid | 991044291215103414 | - |