Document analysis with text mining approaches in digital forensics

Yang, Min; 楊敏

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Document analysis with text mining approaches in digital forensics

Title	Document analysis with text mining approaches in digital forensics
Authors	Yang, Min 楊敏
Issue Date	2017
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Yang, M. [楊敏]. (2017). Document analysis with text mining approaches in digital forensics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	Textual evidence is important to digital investigation, which provides valuable information for criminal analysis. However, discovering valuable information from massive data is challenging. In this dissertation, we employ text mining techniques to analyze the textual data in digital forensics. Specifically, we study the following three problems: information extraction, authorship attribution and sentiment analysis. First, information extraction (IE) is to automatically extract useful information, patterns and trends from massive text data, which is increasingly important in digital investigation as the potential digital evidences have grown rapidly. We propose a two-stage information extraction framework, which may assist digital investigators in finding evidences more efficiently. Firstly, we employ a named entity recognition approach on the collected text data to extract personal names, locations and organizations. Secondly, we use the association rule mining to identify relations among the extracted named entities. We validate the effectiveness of the framework on the Enron email dataset. Experimental results show that the proposed information extraction framework can help investigators find relevant information from the text data effectively and efficiently. Second, an increasing number of criminal activities have been committed by spreading falsehoods and illegal contents on the Internet anonymously. It is difficult to trace and identify criminals in cybercrime investigation. Consequently, automatic authorship attribution of digital data becomes essential in digital investigation. Even though many achievements have been made, the traditional authorship attribution approaches are seldom used in forensic examination due to their low accuracies. In this thesis, we propose a novel authorship attribution model, which combines both the profile-based approach and the instance-based approach. Instead of asserting that a given text is written by a particular author, our approach aims to reduce the number of candidate authors and narrow the scope of suspects with high accuracy. Our experimental results demonstrate that our algorithm can successfully output a small number of candidate authors with high accuracy. Finally, people tend to express their emotions on opinion-rich websites, such as online review sites, forums, blogs and microblogging sites. Performing sentiment analysis on these online posts are important to digital investigation since the posts usually represent the senders’ emotional fingerprints. We propose the LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. Our method combines the lexicon-based learning with the corpus-based learning in a unified co-training framework. The proposed model is capable of incorporating both domain-specific and domain-independent knowledge. Comparing to the state-of-the-art sentiment classification methods, the LCCT model exhibits better performance on different datasets in both English and Chinese.
Degree	Doctor of Philosophy
Subject	Data mining Computer crimes - Investigation
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/240676
HKU Library Item ID	b5855006

DC Field	Value	Language
dc.contributor.author	Yang, Min	-
dc.contributor.author	楊敏	-
dc.date.accessioned	2017-05-09T23:14:54Z	-
dc.date.available	2017-05-09T23:14:54Z	-
dc.date.issued	2017	-
dc.identifier.citation	Yang, M. [楊敏]. (2017). Document analysis with text mining approaches in digital forensics. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/240676	-
dc.description.abstract	Textual evidence is important to digital investigation, which provides valuable information for criminal analysis. However, discovering valuable information from massive data is challenging. In this dissertation, we employ text mining techniques to analyze the textual data in digital forensics. Specifically, we study the following three problems: information extraction, authorship attribution and sentiment analysis. First, information extraction (IE) is to automatically extract useful information, patterns and trends from massive text data, which is increasingly important in digital investigation as the potential digital evidences have grown rapidly. We propose a two-stage information extraction framework, which may assist digital investigators in finding evidences more efficiently. Firstly, we employ a named entity recognition approach on the collected text data to extract personal names, locations and organizations. Secondly, we use the association rule mining to identify relations among the extracted named entities. We validate the effectiveness of the framework on the Enron email dataset. Experimental results show that the proposed information extraction framework can help investigators find relevant information from the text data effectively and efficiently. Second, an increasing number of criminal activities have been committed by spreading falsehoods and illegal contents on the Internet anonymously. It is difficult to trace and identify criminals in cybercrime investigation. Consequently, automatic authorship attribution of digital data becomes essential in digital investigation. Even though many achievements have been made, the traditional authorship attribution approaches are seldom used in forensic examination due to their low accuracies. In this thesis, we propose a novel authorship attribution model, which combines both the profile-based approach and the instance-based approach. Instead of asserting that a given text is written by a particular author, our approach aims to reduce the number of candidate authors and narrow the scope of suspects with high accuracy. Our experimental results demonstrate that our algorithm can successfully output a small number of candidate authors with high accuracy. Finally, people tend to express their emotions on opinion-rich websites, such as online review sites, forums, blogs and microblogging sites. Performing sentiment analysis on these online posts are important to digital investigation since the posts usually represent the senders’ emotional fingerprints. We propose the LCCT (Lexicon-based and Corpus-based, Co-Training) model for semi-supervised sentiment classification. Our method combines the lexicon-based learning with the corpus-based learning in a unified co-training framework. The proposed model is capable of incorporating both domain-specific and domain-independent knowledge. Comparing to the state-of-the-art sentiment classification methods, the LCCT model exhibits better performance on different datasets in both English and Chinese.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.subject.lcsh	Data mining	-
dc.subject.lcsh	Computer crimes - Investigation	-
dc.title	Document analysis with text mining approaches in digital forensics	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5855006	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.mmsid	991022190589703414	-

File Download

Supplementary

postgraduate thesis: Document analysis with text mining approaches in digital forensics

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats