File Download
Supplementary
-
Citations:
- Appears in Collections:
postgraduate thesis: Document analysis with text mining approaches in digital forensics
Title | Document analysis with text mining approaches in digital forensics |
---|---|
Authors | |
Issue Date | 2017 |
Publisher | The University of Hong Kong (Pokfulam, Hong Kong) |
Citation | Yang, M. [楊敏]. (2017). Document analysis with text mining approaches in digital forensics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. |
Abstract | Textual evidence is important to digital investigation, which provides valuable information for criminal analysis. However, discovering valuable information from massive data is challenging. In this dissertation, we employ text mining techniques to analyze the textual data in digital forensics. Specifically, we study the following three problems: information extraction, authorship attribution and sentiment analysis.
First, information extraction (IE) is to automatically extract useful information, patterns and trends from massive text data, which is increasingly important in digital investigation as the potential digital evidences have grown rapidly. We propose a two-stage information extraction framework, which may assist digital investigators in finding evidences more efficiently. Firstly, we employ a named entity recognition approach on the collected text data to extract personal names, locations and organizations. Secondly, we use the association rule mining to identify relations among the extracted named entities. We validate the effectiveness of the framework on the Enron email dataset. Experimental results show that the proposed information extraction framework can help investigators find relevant information from the text data effectively and efficiently.
Second, an increasing number of criminal activities have been committed by spreading falsehoods and illegal contents on the Internet anonymously. It is difficult to trace and identify criminals in cybercrime investigation. Consequently, automatic authorship attribution of digital data becomes essential in digital investigation. Even though many achievements have been made, the traditional authorship attribution approaches are seldom used in forensic examination due to their low accuracies. In this thesis, we propose a novel authorship attribution model, which combines both the profile-based approach and the instance-based approach. Instead of asserting that a given text is written by a particular author, our approach aims to reduce the number of candidate authors and narrow the scope of suspects with high accuracy. Our experimental results demonstrate that our algorithm can successfully output a small number of candidate authors with high accuracy.
Finally, people tend to express their emotions on opinion-rich websites, such as online review sites, forums, blogs and microblogging sites. Performing sentiment analysis on these online posts are important to digital investigation since the posts usually represent the senders’ emotional fingerprints. We propose the LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. Our method combines the lexicon-based learning with the corpus-based learning in a unified co-training framework. The proposed model is capable of incorporating both domain-specific and domain-independent knowledge. Comparing to the state-of-the-art sentiment classification methods, the LCCT model exhibits better performance on different datasets in both English and Chinese. |
Degree | Doctor of Philosophy |
Subject | Data mining Computer crimes - Investigation |
Dept/Program | Computer Science |
Persistent Identifier | http://hdl.handle.net/10722/240676 |
HKU Library Item ID | b5855006 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yang, Min | - |
dc.contributor.author | 楊敏 | - |
dc.date.accessioned | 2017-05-09T23:14:54Z | - |
dc.date.available | 2017-05-09T23:14:54Z | - |
dc.date.issued | 2017 | - |
dc.identifier.citation | Yang, M. [楊敏]. (2017). Document analysis with text mining approaches in digital forensics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. | - |
dc.identifier.uri | http://hdl.handle.net/10722/240676 | - |
dc.description.abstract | Textual evidence is important to digital investigation, which provides valuable information for criminal analysis. However, discovering valuable information from massive data is challenging. In this dissertation, we employ text mining techniques to analyze the textual data in digital forensics. Specifically, we study the following three problems: information extraction, authorship attribution and sentiment analysis. First, information extraction (IE) is to automatically extract useful information, patterns and trends from massive text data, which is increasingly important in digital investigation as the potential digital evidences have grown rapidly. We propose a two-stage information extraction framework, which may assist digital investigators in finding evidences more efficiently. Firstly, we employ a named entity recognition approach on the collected text data to extract personal names, locations and organizations. Secondly, we use the association rule mining to identify relations among the extracted named entities. We validate the effectiveness of the framework on the Enron email dataset. Experimental results show that the proposed information extraction framework can help investigators find relevant information from the text data effectively and efficiently. Second, an increasing number of criminal activities have been committed by spreading falsehoods and illegal contents on the Internet anonymously. It is difficult to trace and identify criminals in cybercrime investigation. Consequently, automatic authorship attribution of digital data becomes essential in digital investigation. Even though many achievements have been made, the traditional authorship attribution approaches are seldom used in forensic examination due to their low accuracies. In this thesis, we propose a novel authorship attribution model, which combines both the profile-based approach and the instance-based approach. Instead of asserting that a given text is written by a particular author, our approach aims to reduce the number of candidate authors and narrow the scope of suspects with high accuracy. Our experimental results demonstrate that our algorithm can successfully output a small number of candidate authors with high accuracy. Finally, people tend to express their emotions on opinion-rich websites, such as online review sites, forums, blogs and microblogging sites. Performing sentiment analysis on these online posts are important to digital investigation since the posts usually represent the senders’ emotional fingerprints. We propose the LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. Our method combines the lexicon-based learning with the corpus-based learning in a unified co-training framework. The proposed model is capable of incorporating both domain-specific and domain-independent knowledge. Comparing to the state-of-the-art sentiment classification methods, the LCCT model exhibits better performance on different datasets in both English and Chinese. | - |
dc.language | eng | - |
dc.publisher | The University of Hong Kong (Pokfulam, Hong Kong) | - |
dc.relation.ispartof | HKU Theses Online (HKUTO) | - |
dc.rights | This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. | - |
dc.rights | The author retains all proprietary rights, (such as patent rights) and the right to use in future works. | - |
dc.subject.lcsh | Data mining | - |
dc.subject.lcsh | Computer crimes - Investigation | - |
dc.title | Document analysis with text mining approaches in digital forensics | - |
dc.type | PG_Thesis | - |
dc.identifier.hkul | b5855006 | - |
dc.description.thesisname | Doctor of Philosophy | - |
dc.description.thesislevel | Doctoral | - |
dc.description.thesisdiscipline | Computer Science | - |
dc.description.nature | published_or_final_version | - |
dc.identifier.mmsid | 991022190589703414 | - |