File Download
Supplementary

postgraduate thesis: Imbalanced data problems in digital investigation

TitleImbalanced data problems in digital investigation
Authors
Advisors
Advisor(s):Chow, KP
Issue Date2023
PublisherThe University of Hong Kong (Pokfulam, Hong Kong)
Citation
Qin, S. [秦圣智]. (2023). Imbalanced data problems in digital investigation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.
AbstractMore and more digital investigation tasks employ machine learning models to address problems as artificial intelligence advances. However, one issue that frequently arises in digital investigation tasks is the imbalanced data issue, which has a significant impact on how well a machine learning model performs. The common solutions against imbalanced data problems can be improved in various digital investigation scenarios. In this thesis, we discuss imbalanced data problem solutions for three different scenarios in digital investigation. These three scenarios have different data imbalance severities. Corresponding solutions are proposed for improving the performance. The first imbalanced scenario is caused by label shortage for minority class. In some digital investigation tasks, the cost of manually labeling data is high and the speed is slow, so it may cause the data imbalance problem in training data. The task of automated public opinion analysis of social media is discussed as first sample scenario. The location data in the corpus is the minor class, and obtaining the location's high quality training label is challenging. The data with location label is highly imbalanced in the training data. On the basis of a location dictionary, we suggest a semi-supervised learning technique to automatically label more location in the dataset to improve the location label ratio in the dataset, which can decrease the effect of the imbalance problem. The second imbalanced scenario is relative imbalance problem. Relatively imbalance refers to the imbalanced scenario where the samples of the minority class are not scarce. In digital investigation tasks, relative imbalanced scenario is common. Android malware classification is the sample scenario for relative imbalanced task, since malware samples for training are not rare. However, existing solutions to the relative imbalance problem cannot achieve low false positive rate and high malware identification rate at the same time. We demonstrate the imbalance tuning method to better fitting the relative imbalanced scenario of malware classification task. The method trains with balanced data and tunes with the imbalanced ratio in the real world, which could achieve better performance. The third imbalanced scenario is absolute imbalance problem. Absolute imbalance refers to the imbalance scenario where the minority class samples are extremely scarce. In this scenario, unsupervised learning is utilized, as only negative data is available for model training, with no positive data present. In digital investigation, some scenarios are hard to get the malicious data. Anomaly detection on water treatment system is discussed as sample scenario. We propose Traceable Time-series Prediction Anomaly Detection (TTPAD) method which performs unsupervised real-time anomaly detection by time series data prediction. This method addresses the problem of absolute imbalance since it only uses data from the water treatment system's regular operation. In digital investigation, there is no one-size-fits-all solution to the imbalance issues, as the appropriate solution may vary depending on the specific problem. The experiment results show that for different imbalanced scenarios, the corresponding solution we propose can effectively improve the performance.
DegreeDoctor of Philosophy
SubjectForensic sciences - Data processing
Criminal investigation - Data processing
Machine learning
Data sets
Dept/ProgramComputer Science
Persistent Identifierhttp://hdl.handle.net/10722/341604

 

DC FieldValueLanguage
dc.contributor.advisorChow, KP-
dc.contributor.authorQin, Shengzhi-
dc.contributor.author秦圣智-
dc.date.accessioned2024-03-18T09:56:20Z-
dc.date.available2024-03-18T09:56:20Z-
dc.date.issued2023-
dc.identifier.citationQin, S. [秦圣智]. (2023). Imbalanced data problems in digital investigation. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR.-
dc.identifier.urihttp://hdl.handle.net/10722/341604-
dc.description.abstractMore and more digital investigation tasks employ machine learning models to address problems as artificial intelligence advances. However, one issue that frequently arises in digital investigation tasks is the imbalanced data issue, which has a significant impact on how well a machine learning model performs. The common solutions against imbalanced data problems can be improved in various digital investigation scenarios. In this thesis, we discuss imbalanced data problem solutions for three different scenarios in digital investigation. These three scenarios have different data imbalance severities. Corresponding solutions are proposed for improving the performance. The first imbalanced scenario is caused by label shortage for minority class. In some digital investigation tasks, the cost of manually labeling data is high and the speed is slow, so it may cause the data imbalance problem in training data. The task of automated public opinion analysis of social media is discussed as first sample scenario. The location data in the corpus is the minor class, and obtaining the location's high quality training label is challenging. The data with location label is highly imbalanced in the training data. On the basis of a location dictionary, we suggest a semi-supervised learning technique to automatically label more location in the dataset to improve the location label ratio in the dataset, which can decrease the effect of the imbalance problem. The second imbalanced scenario is relative imbalance problem. Relatively imbalance refers to the imbalanced scenario where the samples of the minority class are not scarce. In digital investigation tasks, relative imbalanced scenario is common. Android malware classification is the sample scenario for relative imbalanced task, since malware samples for training are not rare. However, existing solutions to the relative imbalance problem cannot achieve low false positive rate and high malware identification rate at the same time. We demonstrate the imbalance tuning method to better fitting the relative imbalanced scenario of malware classification task. The method trains with balanced data and tunes with the imbalanced ratio in the real world, which could achieve better performance. The third imbalanced scenario is absolute imbalance problem. Absolute imbalance refers to the imbalance scenario where the minority class samples are extremely scarce. In this scenario, unsupervised learning is utilized, as only negative data is available for model training, with no positive data present. In digital investigation, some scenarios are hard to get the malicious data. Anomaly detection on water treatment system is discussed as sample scenario. We propose Traceable Time-series Prediction Anomaly Detection (TTPAD) method which performs unsupervised real-time anomaly detection by time series data prediction. This method addresses the problem of absolute imbalance since it only uses data from the water treatment system's regular operation. In digital investigation, there is no one-size-fits-all solution to the imbalance issues, as the appropriate solution may vary depending on the specific problem. The experiment results show that for different imbalanced scenarios, the corresponding solution we propose can effectively improve the performance.-
dc.languageeng-
dc.publisherThe University of Hong Kong (Pokfulam, Hong Kong)-
dc.relation.ispartofHKU Theses Online (HKUTO)-
dc.rightsThe author retains all proprietary rights, (such as patent rights) and the right to use in future works.-
dc.rightsThis work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.-
dc.subject.lcshForensic sciences - Data processing-
dc.subject.lcshCriminal investigation - Data processing-
dc.subject.lcshMachine learning-
dc.subject.lcshData sets-
dc.titleImbalanced data problems in digital investigation-
dc.typePG_Thesis-
dc.description.thesisnameDoctor of Philosophy-
dc.description.thesislevelDoctoral-
dc.description.thesisdisciplineComputer Science-
dc.description.naturepublished_or_final_version-
dc.date.hkucongregation2024-
dc.identifier.mmsid991044781604003414-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats