Imbalanced data problems in digital investigation

Qin, Shengzhi; 秦圣智

File Download

FullText.pdf

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Imbalanced data problems in digital investigation

Title	Imbalanced data problems in digital investigation
Authors	Qin, Shengzhi 秦圣智
Advisors	Advisor(s):Chow, KP
Issue Date	2023
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Qin, S. [秦圣智]. (2023). Imbalanced data problems in digital investigation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
Abstract	More and more digital investigation tasks employ machine learning models to address problems as artificial intelligence advances. However, one issue that frequently arises in digital investigation tasks is the imbalanced data issue, which has a significant impact on how well a machine learning model performs. The common solutions against imbalanced data problems can be improved in various digital investigation scenarios. In this thesis, we discuss imbalanced data problem solutions for three different scenarios in digital investigation. These three scenarios have different data imbalance severities. Corresponding solutions are proposed for improving the performance. The first imbalanced scenario is caused by label shortage for minority class. In some digital investigation tasks, the cost of manually labeling data is high and the speed is slow, so it may cause the data imbalance problem in training data. The task of automated public opinion analysis of social media is discussed as first sample scenario. The location data in the corpus is the minor class, and obtaining the location's high quality training label is challenging. The data with location label is highly imbalanced in the training data. On the basis of a location dictionary, we suggest a semi-supervised learning technique to automatically label more location in the dataset to improve the location label ratio in the dataset, which can decrease the effect of the imbalance problem. The second imbalanced scenario is relative imbalance problem. Relatively imbalance refers to the imbalanced scenario where the samples of the minority class are not scarce. In digital investigation tasks, relative imbalanced scenario is common. Android malware classification is the sample scenario for relative imbalanced task, since malware samples for training are not rare. However, existing solutions to the relative imbalance problem cannot achieve low false positive rate and high malware identification rate at the same time. We demonstrate the imbalance tuning method to better fitting the relative imbalanced scenario of malware classification task. The method trains with balanced data and tunes with the imbalanced ratio in the real world, which could achieve better performance. The third imbalanced scenario is absolute imbalance problem. Absolute imbalance refers to the imbalance scenario where the minority class samples are extremely scarce. In this scenario, unsupervised learning is utilized, as only negative data is available for model training, with no positive data present. In digital investigation, some scenarios are hard to get the malicious data. Anomaly detection on water treatment system is discussed as sample scenario. We propose Traceable Time-series Prediction Anomaly Detection (TTPAD) method which performs unsupervised real-time anomaly detection by time series data prediction. This method addresses the problem of absolute imbalance since it only uses data from the water treatment system's regular operation. In digital investigation, there is no one-size-fits-all solution to the imbalance issues, as the appropriate solution may vary depending on the specific problem. The experiment results show that for different imbalanced scenarios, the corresponding solution we propose can effectively improve the performance.
Degree	Doctor of Philosophy
Subject	Forensic sciences - Data processing Criminal investigation - Data processing Machine learning Data sets
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/341604

DC Field	Value	Language
dc.contributor.advisor	Chow, KP	-
dc.contributor.author	Qin, Shengzhi	-
dc.contributor.author	秦圣智	-
dc.date.accessioned	2024-03-18T09:56:20Z	-
dc.date.available	2024-03-18T09:56:20Z	-
dc.date.issued	2023	-
dc.identifier.citation	Qin, S. [秦圣智]. (2023). Imbalanced data problems in digital investigation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.	-
dc.identifier.uri	http://hdl.handle.net/10722/341604	-
dc.description.abstract	More and more digital investigation tasks employ machine learning models to address problems as artificial intelligence advances. However, one issue that frequently arises in digital investigation tasks is the imbalanced data issue, which has a significant impact on how well a machine learning model performs. The common solutions against imbalanced data problems can be improved in various digital investigation scenarios. In this thesis, we discuss imbalanced data problem solutions for three different scenarios in digital investigation. These three scenarios have different data imbalance severities. Corresponding solutions are proposed for improving the performance. The first imbalanced scenario is caused by label shortage for minority class. In some digital investigation tasks, the cost of manually labeling data is high and the speed is slow, so it may cause the data imbalance problem in training data. The task of automated public opinion analysis of social media is discussed as first sample scenario. The location data in the corpus is the minor class, and obtaining the location's high quality training label is challenging. The data with location label is highly imbalanced in the training data. On the basis of a location dictionary, we suggest a semi-supervised learning technique to automatically label more location in the dataset to improve the location label ratio in the dataset, which can decrease the effect of the imbalance problem. The second imbalanced scenario is relative imbalance problem. Relatively imbalance refers to the imbalanced scenario where the samples of the minority class are not scarce. In digital investigation tasks, relative imbalanced scenario is common. Android malware classification is the sample scenario for relative imbalanced task, since malware samples for training are not rare. However, existing solutions to the relative imbalance problem cannot achieve low false positive rate and high malware identification rate at the same time. We demonstrate the imbalance tuning method to better fitting the relative imbalanced scenario of malware classification task. The method trains with balanced data and tunes with the imbalanced ratio in the real world, which could achieve better performance. The third imbalanced scenario is absolute imbalance problem. Absolute imbalance refers to the imbalance scenario where the minority class samples are extremely scarce. In this scenario, unsupervised learning is utilized, as only negative data is available for model training, with no positive data present. In digital investigation, some scenarios are hard to get the malicious data. Anomaly detection on water treatment system is discussed as sample scenario. We propose Traceable Time-series Prediction Anomaly Detection (TTPAD) method which performs unsupervised real-time anomaly detection by time series data prediction. This method addresses the problem of absolute imbalance since it only uses data from the water treatment system's regular operation. In digital investigation, there is no one-size-fits-all solution to the imbalance issues, as the appropriate solution may vary depending on the specific problem. The experiment results show that for different imbalanced scenarios, the corresponding solution we propose can effectively improve the performance.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.subject.lcsh	Forensic sciences - Data processing	-
dc.subject.lcsh	Criminal investigation - Data processing	-
dc.subject.lcsh	Machine learning	-
dc.subject.lcsh	Data sets	-
dc.title	Imbalanced data problems in digital investigation	-
dc.type	PG_Thesis	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.date.hkucongregation	2024	-
dc.identifier.mmsid	991044781604003414	-

File Download

Supplementary

postgraduate thesis: Imbalanced data problems in digital investigation

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats