Scalable, Generic, And Adaptive Systems For Focused Crawling

Gouriten, G; Maniu, S; Senellart, P

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1145/2631775.2631795
Scopus: eid_2-s2.0-84907405833

Supplementary

Citations:
- Scopus: 0
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Scalable, Generic, And Adaptive Systems For Focused Crawling

Title	Scalable, Generic, And Adaptive Systems For Focused Crawling
Authors	Gouriten, G Maniu, S Senellart, P
Keywords	focused crawling graph exploration multi-armed bandits
Issue Date	2014
Publisher	ACM.
Citation	The 25th ACM Conference on Hypertext and Social Media (HT'14), Santiago, Chile, 1-4 September 2014. In the Proceedings of the 25th ACM Conference on Hypertext and Social Media, 2014, p. 35-45 How to Cite? DOI: http://dx.doi.org/10.1145/2631775.2631795
Abstract	Focused crawling is the process of exploring a graph iteratively, focusing on parts of the graph relevant to a given topic. It occurs in many situations such as a company collecting data on competition, a journalist surfing the Web to investigate a political scandal, or an archivist recording the activity of influential Twitter users during a presidential election. In all these applications, users explore a graph (e.g., the Web or a social network), nodes are discovered one by one, the total number of exploration steps is constrained, some nodes are more valuable than others, and the objective is to maximize the total value of the crawled subgraph. In this article, we introduce scalable, generic, and adaptive systems for focused crawling. Our first effort is to define an abstraction of focused crawling applicable to a large domain of real-world scenarios. We then propose a generic algorithm, which allows us to identify and optimize the relevant subsystems. We prove the intractability of finding an optimal exploration, even when all the information is available. Taking this intractability into account, we investigate how the crawler can be steered in several experimental graphs. We show the good performance of a greedy strategy and the importance of being able to run at each step a new estimation of the crawling frontier. We then discuss this estimation through heuristics, self-trained regression, and multi-armed bandits. Finally, we investigate their scalability and efficiency in different real-world scenarios and by comparing with state-of-the-art systems.
Persistent Identifier	http://hdl.handle.net/10722/201108
ISBN	9781450329545

DC Field	Value	Language
dc.contributor.author	Gouriten, G	en_US
dc.contributor.author	Maniu, S	en_US
dc.contributor.author	Senellart, P	en_US
dc.date.accessioned	2014-08-21T07:13:35Z	-
dc.date.available	2014-08-21T07:13:35Z	-
dc.date.issued	2014	en_US
dc.identifier.citation	The 25th ACM Conference on Hypertext and Social Media (HT'14), Santiago, Chile, 1-4 September 2014. In the Proceedings of the 25th ACM Conference on Hypertext and Social Media, 2014, p. 35-45	en_US
dc.identifier.isbn	9781450329545	-
dc.identifier.uri	http://hdl.handle.net/10722/201108	-
dc.description.abstract	Focused crawling is the process of exploring a graph iteratively, focusing on parts of the graph relevant to a given topic. It occurs in many situations such as a company collecting data on competition, a journalist surfing the Web to investigate a political scandal, or an archivist recording the activity of influential Twitter users during a presidential election. In all these applications, users explore a graph (e.g., the Web or a social network), nodes are discovered one by one, the total number of exploration steps is constrained, some nodes are more valuable than others, and the objective is to maximize the total value of the crawled subgraph. In this article, we introduce scalable, generic, and adaptive systems for focused crawling. Our first effort is to define an abstraction of focused crawling applicable to a large domain of real-world scenarios. We then propose a generic algorithm, which allows us to identify and optimize the relevant subsystems. We prove the intractability of finding an optimal exploration, even when all the information is available. Taking this intractability into account, we investigate how the crawler can be steered in several experimental graphs. We show the good performance of a greedy strategy and the importance of being able to run at each step a new estimation of the crawling frontier. We then discuss this estimation through heuristics, self-trained regression, and multi-armed bandits. Finally, we investigate their scalability and efficiency in different real-world scenarios and by comparing with state-of-the-art systems.	-
dc.language	eng	en_US
dc.publisher	ACM.	en_US
dc.relation.ispartof	Proceedings of the 25th ACM Conference on Hypertext and Social Media	en_US
dc.subject	focused crawling	-
dc.subject	graph exploration	-
dc.subject	multi-armed bandits	-
dc.title	Scalable, Generic, And Adaptive Systems For Focused Crawling	en_US
dc.type	Conference_Paper	en_US
dc.identifier.email	Maniu, S: smaniu@cs.hku.hk	en_US
dc.identifier.doi	10.1145/2631775.2631795	-
dc.identifier.scopus	eid_2-s2.0-84907405833	-
dc.identifier.hkuros	232987	en_US
dc.identifier.spage	35	-
dc.identifier.epage	45	-
dc.publisher.place	New York	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Scalable, Generic, And Adaptive Systems For Focused Crawling

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats