Explore or Exploit? Effective Strategies for Disambiguating Large Databases

Cheng, CK; Lo, E; Yang, X; Luk, MH; Li, X; Xie, X

File Download

re01.htm

Links for fulltext

(May Require Subscription)

Find via

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Explore or Exploit? Effective Strategies for Disambiguating Large Databases

Title	Explore or Exploit? Effective Strategies for Disambiguating Large Databases
Authors	Cheng, CK Lo, E Yang, X Luk, MH Li, X Xie, X
Issue Date	2010
Publisher	Very Large Data Bases (VLDB). The Proceedings' web site is located at http://www.vldb.org/pvldb/index.html
Citation	The 36th International Conference on Very Large Data Bases (VLDB), Singapore, 13 - 17 September 2010. In Proceedings of the VLDB Endowment, 2010, v. 3 n. 1, p. 815-825 How to Cite?
Abstract	Data ambiguity is inherent in applications such as data integration, location-based services, and sensor monitoring. In many situations, it is possible to “clean”, or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement errors, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In order to obtain a database with a higher quality, we study how to disambiguate a database by appropriately selecting candidates to clean. This problem is challenging because cleaning involves a cost, is limited by a budget, may fail, and may not remove all ambiguities. Moreover, the statistical information about how likely database objects can be cleaned may not be precisely known. We tackle these challenges by proposing two types of algorithms. The first type makes use of greedy heuristics to make sensible decisions; however, these algorithms do not make use of cleaning information and require user input for parameters to achieve high cleaning effectiveness. We propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Experimental evaluations on real and synthetic datasets validate the effectiveness and effi- ciency of our approaches.
Persistent Identifier	http://hdl.handle.net/10722/224244
ISSN	2150-8097 2023 Impact Factor: 2.6 2023 SCImago Journal Rankings: 2.666

DC Field	Value	Language
dc.contributor.author	Cheng, CK	-
dc.contributor.author	Lo, E	-
dc.contributor.author	Yang, X	-
dc.contributor.author	Luk, MH	-
dc.contributor.author	Li, X	-
dc.contributor.author	Xie, X	-
dc.date.accessioned	2016-03-30T07:06:46Z	-
dc.date.available	2016-03-30T07:06:46Z	-
dc.date.issued	2010	-
dc.identifier.citation	The 36th International Conference on Very Large Data Bases (VLDB), Singapore, 13 - 17 September 2010. In Proceedings of the VLDB Endowment, 2010, v. 3 n. 1, p. 815-825	-
dc.identifier.issn	2150-8097	-
dc.identifier.uri	http://hdl.handle.net/10722/224244	-
dc.description.abstract	Data ambiguity is inherent in applications such as data integration, location-based services, and sensor monitoring. In many situations, it is possible to “clean”, or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement errors, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In order to obtain a database with a higher quality, we study how to disambiguate a database by appropriately selecting candidates to clean. This problem is challenging because cleaning involves a cost, is limited by a budget, may fail, and may not remove all ambiguities. Moreover, the statistical information about how likely database objects can be cleaned may not be precisely known. We tackle these challenges by proposing two types of algorithms. The first type makes use of greedy heuristics to make sensible decisions; however, these algorithms do not make use of cleaning information and require user input for parameters to achieve high cleaning effectiveness. We propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Experimental evaluations on real and synthetic datasets validate the effectiveness and effi- ciency of our approaches.	-
dc.language	eng	-
dc.publisher	Very Large Data Bases (VLDB). The Proceedings' web site is located at http://www.vldb.org/pvldb/index.html	-
dc.relation.ispartof	Very Large Data Bases (VLDB) Conference	-
dc.title	Explore or Exploit? Effective Strategies for Disambiguating Large Databases	-
dc.type	Conference_Paper	-
dc.identifier.email	Cheng, CK: ckcheng@cs.hku.hk	-
dc.identifier.email	Yang, X: sailingwood@gmail.com	-
dc.identifier.email	Li, X: thinking.xiang@gmail.com	-
dc.identifier.email	Xie, X: xiexike@hotmail.com	-
dc.identifier.authority	Cheng, CK=rp00074	-
dc.description.nature	link_to_OA_fulltext	-
dc.identifier.hkuros	175921	-
dc.identifier.volume	3	-
dc.identifier.issue	1	-
dc.identifier.spage	815	-
dc.identifier.epage	825	-
dc.publisher.place	Singapore	-
dc.publisher.place	United States	-
dc.identifier.issnl	2150-8097	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Conference Paper: Explore or Exploit? Effective Strategies for Disambiguating Large Databases

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats