File Download
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Explore or Exploit? Effective Strategies for Disambiguating Large Databases
Title | Explore or Exploit? Effective Strategies for Disambiguating Large Databases |
---|---|
Authors | |
Issue Date | 2010 |
Publisher | Very Large Data Bases (VLDB). The Proceedings' web site is located at http://www.vldb.org/pvldb/index.html |
Citation | The 36th International Conference on Very Large Data Bases (VLDB), Singapore, 13 - 17 September 2010. In Proceedings of the VLDB Endowment, 2010, v. 3 n. 1, p. 815-825 How to Cite? |
Abstract | Data ambiguity is inherent in applications such as data integration,
location-based services, and sensor monitoring. In many situations,
it is possible to “clean”, or remove, ambiguities from these
databases. For example, the GPS location of a user is inexact due
to measurement errors, but context information (e.g., what a user is
doing) can be used to reduce the imprecision of the location value.
In order to obtain a database with a higher quality, we study how
to disambiguate a database by appropriately selecting candidates
to clean. This problem is challenging because cleaning involves
a cost, is limited by a budget, may fail, and may not remove all
ambiguities. Moreover, the statistical information about how likely
database objects can be cleaned may not be precisely known. We
tackle these challenges by proposing two types of algorithms. The
first type makes use of greedy heuristics to make sensible decisions;
however, these algorithms do not make use of cleaning information
and require user input for parameters to achieve high cleaning effectiveness.
We propose the Explore-Exploit (or EE) algorithm,
which gathers valuable information during the cleaning process to
determine how the remaining cleaning budget should be invested.
We also study how to fine-tune the parameters of EE in order to
achieve optimal cleaning effectiveness. Experimental evaluations
on real and synthetic datasets validate the effectiveness and effi-
ciency of our approaches. |
Persistent Identifier | http://hdl.handle.net/10722/224244 |
ISSN | 2023 Impact Factor: 2.6 2023 SCImago Journal Rankings: 2.666 |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Cheng, CK | - |
dc.contributor.author | Lo, E | - |
dc.contributor.author | Yang, X | - |
dc.contributor.author | Luk, MH | - |
dc.contributor.author | Li, X | - |
dc.contributor.author | Xie, X | - |
dc.date.accessioned | 2016-03-30T07:06:46Z | - |
dc.date.available | 2016-03-30T07:06:46Z | - |
dc.date.issued | 2010 | - |
dc.identifier.citation | The 36th International Conference on Very Large Data Bases (VLDB), Singapore, 13 - 17 September 2010. In Proceedings of the VLDB Endowment, 2010, v. 3 n. 1, p. 815-825 | - |
dc.identifier.issn | 2150-8097 | - |
dc.identifier.uri | http://hdl.handle.net/10722/224244 | - |
dc.description.abstract | Data ambiguity is inherent in applications such as data integration, location-based services, and sensor monitoring. In many situations, it is possible to “clean”, or remove, ambiguities from these databases. For example, the GPS location of a user is inexact due to measurement errors, but context information (e.g., what a user is doing) can be used to reduce the imprecision of the location value. In order to obtain a database with a higher quality, we study how to disambiguate a database by appropriately selecting candidates to clean. This problem is challenging because cleaning involves a cost, is limited by a budget, may fail, and may not remove all ambiguities. Moreover, the statistical information about how likely database objects can be cleaned may not be precisely known. We tackle these challenges by proposing two types of algorithms. The first type makes use of greedy heuristics to make sensible decisions; however, these algorithms do not make use of cleaning information and require user input for parameters to achieve high cleaning effectiveness. We propose the Explore-Exploit (or EE) algorithm, which gathers valuable information during the cleaning process to determine how the remaining cleaning budget should be invested. We also study how to fine-tune the parameters of EE in order to achieve optimal cleaning effectiveness. Experimental evaluations on real and synthetic datasets validate the effectiveness and effi- ciency of our approaches. | - |
dc.language | eng | - |
dc.publisher | Very Large Data Bases (VLDB). The Proceedings' web site is located at http://www.vldb.org/pvldb/index.html | - |
dc.relation.ispartof | Very Large Data Bases (VLDB) Conference | - |
dc.title | Explore or Exploit? Effective Strategies for Disambiguating Large Databases | - |
dc.type | Conference_Paper | - |
dc.identifier.email | Cheng, CK: ckcheng@cs.hku.hk | - |
dc.identifier.email | Yang, X: sailingwood@gmail.com | - |
dc.identifier.email | Li, X: thinking.xiang@gmail.com | - |
dc.identifier.email | Xie, X: xiexike@hotmail.com | - |
dc.identifier.authority | Cheng, CK=rp00074 | - |
dc.description.nature | link_to_OA_fulltext | - |
dc.identifier.hkuros | 175921 | - |
dc.identifier.volume | 3 | - |
dc.identifier.issue | 1 | - |
dc.identifier.spage | 815 | - |
dc.identifier.epage | 825 | - |
dc.publisher.place | Singapore | - |
dc.publisher.place | United States | - |
dc.identifier.issnl | 2150-8097 | - |