ForesTexter: An efficient random forest algorithm for imbalanced text categorization

Wu, Qingyao; Ye, Yunming; Zhang, Haijun; Ng, Michael K.; Ho, Shen Shyang

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1016/j.knosys.2014.06.004
Scopus: eid_2-s2.0-84905560384
WOS: WOS:000340221600009
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
Appears in Collections:
- Mathematics: Journal/Magazine Articles

Article: ForesTexter: An efficient random forest algorithm for imbalanced text categorization

Title	ForesTexter: An efficient random forest algorithm for imbalanced text categorization
Authors	Wu, Qingyao Ye, Yunming Zhang, Haijun Ng, Michael K.Ho, Shen Shyang
Keywords	Imbalanced classification Text categorization SVM Stratified sampling Random forests
Issue Date	2014
Citation	Knowledge-Based Systems, 2014, v. 67, p. 105-116 How to Cite? DOI: http://dx.doi.org/10.1016/j.knosys.2014.06.004
Abstract	In this paper, we propose a new random forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in building their decision trees. As a result, it selects many subspaces that contain few, if any, informative features for the minority class. Furthermore, the Gini measure for data splitting is considered to be skew sensitive and bias towards the majority class. Due to the inherent complex characteristics of imbalanced text datasets, learning RF from such data requires new approaches to overcome challenges related to feature subspace selection and cut-point choice while performing node splitting. To this end, we propose a new tree induction method that selects splits, both feature subspace selection and splitting criterion, for RF on imbalanced text data. The key idea is to stratify features into two groups and to generate effective term weighting for the features. One group contains positive features for the minority class and the other one contains the negative features for the majority class. Then, for feature subspace selection, we effectively select features from each group based on the term weights. The advantage of our approach is that each subspace contains adequate informative features for both minority and majority classes. One difference between our proposed tree induction method and the classical RF method is that our method uses Support Vector Machines (SVM) classifier to split the training data into smaller and more balance subsets at each tree node, and then successively retrains the SVM classifiers on the data partitions to refine the model while moving down the tree. In this way, we force the classifiers to learn from refined feature subspaces and data subsets to fit the imbalanced data better. Hence, the tree model becomes more robust for text categorization task with imbalanced dataset. Experimental results on various benchmark imbalanced text datasets (Reuters-21578, Ohsumed, and imbalanced 20 newsgroup) consistently demonstrate the effectiveness of our proposed ForesTexter method. The performance of our proposed approach is competitive against the standard random forest and different variants of SVM algorithms. © 2014 Elsevier B.V. All rights reserved.
Persistent Identifier	http://hdl.handle.net/10722/276999
ISSN	0950-7051 2023 Impact Factor: 7.2 2023 SCImago Journal Rankings: 2.219
ISI Accession Number ID	WOS:000340221600009

DC Field	Value	Language
dc.contributor.author	Wu, Qingyao	-
dc.contributor.author	Ye, Yunming	-
dc.contributor.author	Zhang, Haijun	-
dc.contributor.author	Ng, Michael K.	-
dc.contributor.author	Ho, Shen Shyang	-
dc.date.accessioned	2019-09-18T08:35:18Z	-
dc.date.available	2019-09-18T08:35:18Z	-
dc.date.issued	2014	-
dc.identifier.citation	Knowledge-Based Systems, 2014, v. 67, p. 105-116	-
dc.identifier.issn	0950-7051	-
dc.identifier.uri	http://hdl.handle.net/10722/276999	-
dc.description.abstract	In this paper, we propose a new random forest (RF) based ensemble method, ForesTexter, to solve the imbalanced text categorization problems. RF has shown great success in many real-world applications. However, the problem of learning from text data with class imbalance is a relatively new challenge that needs to be addressed. A RF algorithm tends to use a simple random sampling of features in building their decision trees. As a result, it selects many subspaces that contain few, if any, informative features for the minority class. Furthermore, the Gini measure for data splitting is considered to be skew sensitive and bias towards the majority class. Due to the inherent complex characteristics of imbalanced text datasets, learning RF from such data requires new approaches to overcome challenges related to feature subspace selection and cut-point choice while performing node splitting. To this end, we propose a new tree induction method that selects splits, both feature subspace selection and splitting criterion, for RF on imbalanced text data. The key idea is to stratify features into two groups and to generate effective term weighting for the features. One group contains positive features for the minority class and the other one contains the negative features for the majority class. Then, for feature subspace selection, we effectively select features from each group based on the term weights. The advantage of our approach is that each subspace contains adequate informative features for both minority and majority classes. One difference between our proposed tree induction method and the classical RF method is that our method uses Support Vector Machines (SVM) classifier to split the training data into smaller and more balance subsets at each tree node, and then successively retrains the SVM classifiers on the data partitions to refine the model while moving down the tree. In this way, we force the classifiers to learn from refined feature subspaces and data subsets to fit the imbalanced data better. Hence, the tree model becomes more robust for text categorization task with imbalanced dataset. Experimental results on various benchmark imbalanced text datasets (Reuters-21578, Ohsumed, and imbalanced 20 newsgroup) consistently demonstrate the effectiveness of our proposed ForesTexter method. The performance of our proposed approach is competitive against the standard random forest and different variants of SVM algorithms. © 2014 Elsevier B.V. All rights reserved.	-
dc.language	eng	-
dc.relation.ispartof	Knowledge-Based Systems	-
dc.subject	Imbalanced classification	-
dc.subject	Text categorization	-
dc.subject	SVM	-
dc.subject	Stratified sampling	-
dc.subject	Random forests	-
dc.title	ForesTexter: An efficient random forest algorithm for imbalanced text categorization	-
dc.type	Article	-
dc.description.nature	link_to_subscribed_fulltext	-
dc.identifier.doi	10.1016/j.knosys.2014.06.004	-
dc.identifier.scopus	eid_2-s2.0-84905560384	-
dc.identifier.volume	67	-
dc.identifier.spage	105	-
dc.identifier.epage	116	-
dc.identifier.isi	WOS:000340221600009	-
dc.identifier.issnl	0950-7051	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: ForesTexter: An efficient random forest algorithm for imbalanced text categorization

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats