File Download
  Links for fulltext
     (May Require Subscription)
  • Find via Find It@HKUL
Supplementary

Article: Feature Weighting Random Forest for Detection of Hidden Web Search Interfaces

TitleFeature Weighting Random Forest for Detection of Hidden Web Search Interfaces
Authors
KeywordsSearch Interface Detection
Random Forest
Hidden Web
Form Classification
Issue Date2008
PublisherThe Association for Computational Linguistics and Chinese Language Processing. The Journal's web site is located at http://www.aclclp.org.tw/journal/index.php
Citation
International Journal of Computational Linguistics and Chinese Language Processing, 2008, v. 13 n. 4, p. 387-404 How to Cite?
AbstractSearch interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naïve Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.
Persistent Identifierhttp://hdl.handle.net/10722/197392
ISSN

 

DC FieldValueLanguage
dc.contributor.authorYe, YMen_US
dc.contributor.authorLi, HBen_US
dc.contributor.authorDeng, XBen_US
dc.contributor.authorHuang, JZen_US
dc.date.accessioned2014-05-23T02:48:57Z-
dc.date.available2014-05-23T02:48:57Z-
dc.date.issued2008en_US
dc.identifier.citationInternational Journal of Computational Linguistics and Chinese Language Processing, 2008, v. 13 n. 4, p. 387-404en_US
dc.identifier.issn1027-376X-
dc.identifier.urihttp://hdl.handle.net/10722/197392-
dc.description.abstractSearch interface detection is an essential task for extracting information from the hidden Web. The challenge for this task is that search interface data is represented in high-dimensional and sparse features with many missing values. This paper presents a new multi-classifier ensemble approach to solving this problem. In this approach, we have extended the random forest algorithm with a weighted feature selection method to build the individual classifiers. With this improved random forest algorithm (IRFA), each classifier can be learned from a weighted subset of the feature space so that the ensemble of decision trees can fully exploit the useful features of search interface patterns. We have compared our ensemble approach with other well-known classification algorithms, such as SVM, C4.5, Naïve Bayes, and original random forest algorithm (RFA). The experimental results have shown that our method is more effective in detecting search interfaces of the hidden Web.-
dc.languageengen_US
dc.publisherThe Association for Computational Linguistics and Chinese Language Processing. The Journal's web site is located at http://www.aclclp.org.tw/journal/index.php-
dc.relation.ispartofInternational Journal of Computational Linguistics and Chinese Language Processingen_US
dc.subjectSearch Interface Detection-
dc.subjectRandom Forest-
dc.subjectHidden Web-
dc.subjectForm Classification-
dc.titleFeature Weighting Random Forest for Detection of Hidden Web Search Interfacesen_US
dc.typeArticleen_US
dc.identifier.emailHuang, JZ: jhuang@eti.hku.hken_US
dc.description.naturelink_to_OA_fulltext-
dc.identifier.hkuros164907en_US
dc.identifier.volume13en_US
dc.identifier.issue4-
dc.identifier.spage387en_US
dc.identifier.epage404en_US
dc.publisher.placeTaiwanen_US
dc.identifier.issnl1027-376X-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats