Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces

Ye, YM; Li, HB; Deng, XB; Huang, JZ

File Download

re01.htm

Links for fulltext

(May Require Subscription)

Find via

Supplementary

Citations:
Appears in Collections:
- E-Busness Technology Institute: Journal/Magazine Articles

Article: Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces

Title	Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces
Authors	Ye, YM Li, HB Deng, XB Huang, JZ
Keywords	Search Interface Detection Random Forest Hidden Web Form Classification
Issue Date	2008
Publisher	The Association for Computational Linguistics and Chinese Language Processing. The Journal's web site is located at http://www.aclclp.org.tw/journal/index.php
Citation	International Journal of Computational Linguistics and Chinese Language Processing, 2008, v. 13 n. 4, p. 387-404 How to Cite?
Abstract	Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naïve Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.
Persistent Identifier	http://hdl.handle.net/10722/197392
ISSN	1027-376X

DC Field	Value	Language
dc.contributor.author	Ye, YM	en_US
dc.contributor.author	Li, HB	en_US
dc.contributor.author	Deng, XB	en_US
dc.contributor.author	Huang, JZ	en_US
dc.date.accessioned	2014-05-23T02:48:57Z	-
dc.date.available	2014-05-23T02:48:57Z	-
dc.date.issued	2008	en_US
dc.identifier.citation	International Journal of Computational Linguistics and Chinese Language Processing, 2008, v. 13 n. 4, p. 387-404	en_US
dc.identifier.issn	1027-376X	-
dc.identifier.uri	http://hdl.handle.net/10722/197392	-
dc.description.abstract	Search interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naïve Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.	-
dc.language	eng	en_US
dc.publisher	The Association for Computational Linguistics and Chinese Language Processing. The Journal's web site is located at http://www.aclclp.org.tw/journal/index.php	-
dc.relation.ispartof	International Journal of Computational Linguistics and Chinese Language Processing	en_US
dc.subject	Search Interface Detection	-
dc.subject	Random Forest	-
dc.subject	Hidden Web	-
dc.subject	Form Classification	-
dc.title	Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces	en_US
dc.type	Article	en_US
dc.identifier.email	Huang, JZ: jhuang@eti.hku.hk	en_US
dc.description.nature	link_to_OA_fulltext	-
dc.identifier.hkuros	164907	en_US
dc.identifier.volume	13	en_US
dc.identifier.issue	4	-
dc.identifier.spage	387	en_US
dc.identifier.epage	404	en_US
dc.publisher.place	Taiwan	en_US
dc.identifier.issnl	1027-376X	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats