Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods
Title | Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods |
---|---|
Authors | |
Issue Date | 10-Dec-2024 |
Abstract | We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists’ annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse |
Persistent Identifier | http://hdl.handle.net/10722/351748 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Hu, Jiamian | - |
dc.contributor.author | Hong, Yuanyuan | - |
dc.contributor.author | Chen, Yihua | - |
dc.contributor.author | Wang, He | - |
dc.contributor.author | Yasuhara, Moriaki | - |
dc.date.accessioned | 2024-11-25T00:35:20Z | - |
dc.date.available | 2024-11-25T00:35:20Z | - |
dc.date.issued | 2024-12-10 | - |
dc.identifier.uri | http://hdl.handle.net/10722/351748 | - |
dc.description.abstract | <p>We present the Noisy Ostracods, a noisy dataset for genus and species classification of crustacean ostracods with specialists’ annotations. Over the 71466 specimens collected, 5.58% of them are estimated to be noisy (possibly problematic) at genus level. The dataset is created to addressing a real-world challenge: creating a clean fine-grained taxonomy dataset. The Noisy Ostracods dataset has diverse<br>noises from multiple sources. Firstly, the noise is open-set, including new classes discovered during curation that were not part of the original annotation. The dataset has pseudo-classes, where annotators misclassified samples that should belong to an existing class into a new pseudo-class. The Noisy Ostracods dataset is highly imbalanced with a imbalance factor ρ = 22429. This presents a unique challenge for robust machine learning methods, as existing approaches have not been extensively evaluated on fine-grained classification tasks with such diverse real-world noise. Initial experiments using current robust learning techniques have not yielded significant performance improvements on the Noisy Ostracods dataset compared to cross-entropy training on the raw, noisy data. On the other hand, noise detection methods have underperformed in error hit rate compared to naive cross-validation ensembling for identifying problematic labels. These findings suggest that the fine-grained, imbalanced nature, and complex noise<br>characteristics of the dataset present considerable challenges for existing noiserobust algorithms. By openly releasing the Noisy Ostracods dataset, our goal is to encourage further research into the development of noise-resilient machine learning methods capable of effectively handling diverse, real-world noise in finegrained classification tasks. The dataset, along with its evaluation protocols, can be accessed at https://github.com/H-Jamieu/Noisy_ostracods.<br></p> | - |
dc.language | eng | - |
dc.relation.ispartof | Neural Information Processing Systems (NeurIPS), 2024 (10/12/2024-15/12/2024, Vancouver, Canada) | - |
dc.title | Noisy Ostracods: A Fine-Grained, Imbalanced Real-World Dataset for Benchmarking Robust Machine Learning and Label Correction Methods | - |
dc.type | Conference_Paper | - |
dc.description.nature | preprint | - |