Cleaning algorithms for novel applications

Mo, Luyi; 莫璐怡

File Download

FullText.pdf

Links for fulltext

(May Require Subscription)

DOI: 10.5353/th_b5699933

Supplementary

Citations:
Appears in Collections:
- HKU Theses Online
- Computer Science: Theses

postgraduate thesis: Cleaning algorithms for novel applications

Title	Cleaning algorithms for novel applications
Authors	Mo, Luyi 莫璐怡
Issue Date	2015
Publisher	The University of Hong Kong (Pokfulam, Hong Kong)
Citation	Mo, L. [莫璐怡]. (2015). Cleaning algorithms for novel applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5699933
Abstract	The information managed in emerging applications, such as location-based service, sensor network, and crowdsourcing system, is usually imperfect. In many situations, data can be cleaned (e.g., removed or reduced) by performing appropriate operations. In this thesis, we study the cleaning problem under limited resources for two novel applications: querying probabilistic data, and collecting data from human intelligence tasks in crowdsourcing environments. Probabilistic databases have been developed to handle uncertain data recently. For example, the temperature readings in a sensor network may be uncertain due to the lack of latest readings from sensors at every moment. A probabilistic database is able to capture the real value distributions of the readings, and enables evaluation of probabilistic queries on the data. However, data uncertainty may lead to ambiguous query results. By performing cleaning operations on the data, for example, probing some sensors for their latest readings, the ambiguity in query results can be reduced. In this thesis, we first study how to quantify the ambiguity of query results returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database in order to improve top-k query quality. Specifically, we consider the facts that cleaning may involve a cost and fail. We propose optimal cleaning algorithms as well as several heuristics to select the data to clean under a limited budget. In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified to collect data. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may be casual Internet users, their answers are hardly perfect. If more workers are employed to perform a HIT, the quality of the HIT’s answer could be statistically improved. Hence, assigning the number of workers (or plurality) of each HIT is an effective way to reduce (or clean) the imperfectness of the collected data (i.e., HITs answers). In this thesis, we address the important problem of determining the plurality of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT’s answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case.
Degree	Doctor of Philosophy
Subject	Database management Data mining
Dept/Program	Computer Science
Persistent Identifier	http://hdl.handle.net/10722/223008
HKU Library Item ID	b5699933

DC Field	Value	Language
dc.contributor.author	Mo, Luyi	-
dc.contributor.author	莫璐怡	-
dc.date.accessioned	2016-02-17T23:14:30Z	-
dc.date.available	2016-02-17T23:14:30Z	-
dc.date.issued	2015	-
dc.identifier.citation	Mo, L. [莫璐怡]. (2015). Cleaning algorithms for novel applications. (Thesis). University of Hong Kong, Pokfulam, Hong Kong SAR. Retrieved from http://dx.doi.org/10.5353/th_b5699933	-
dc.identifier.uri	http://hdl.handle.net/10722/223008	-
dc.description.abstract	The information managed in emerging applications, such as location-based service, sensor network, and crowdsourcing system, is usually imperfect. In many situations, data can be cleaned (e.g., removed or reduced) by performing appropriate operations. In this thesis, we study the cleaning problem under limited resources for two novel applications: querying probabilistic data, and collecting data from human intelligence tasks in crowdsourcing environments. Probabilistic databases have been developed to handle uncertain data recently. For example, the temperature readings in a sensor network may be uncertain due to the lack of latest readings from sensors at every moment. A probabilistic database is able to capture the real value distributions of the readings, and enables evaluation of probabilistic queries on the data. However, data uncertainty may lead to ambiguous query results. By performing cleaning operations on the data, for example, probing some sensors for their latest readings, the ambiguity in query results can be reduced. In this thesis, we first study how to quantify the ambiguity of query results returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database in order to improve top-k query quality. Specifically, we consider the facts that cleaning may involve a cost and fail. We propose optimal cleaning algorithms as well as several heuristics to select the data to clean under a limited budget. In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified to collect data. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may be casual Internet users, their answers are hardly perfect. If more workers are employed to perform a HIT, the quality of the HIT’s answer could be statistically improved. Hence, assigning the number of workers (or plurality) of each HIT is an effective way to reduce (or clean) the imperfectness of the collected data (i.e., HITs answers). In this thesis, we address the important problem of determining the plurality of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT’s answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case.	-
dc.language	eng	-
dc.publisher	The University of Hong Kong (Pokfulam, Hong Kong)	-
dc.relation.ispartof	HKU Theses Online (HKUTO)	-
dc.rights	This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.	-
dc.rights	The author retains all proprietary rights, (such as patent rights) and the right to use in future works.	-
dc.subject.lcsh	Database management	-
dc.subject.lcsh	Data mining	-
dc.title	Cleaning algorithms for novel applications	-
dc.type	PG_Thesis	-
dc.identifier.hkul	b5699933	-
dc.description.thesisname	Doctor of Philosophy	-
dc.description.thesislevel	Doctoral	-
dc.description.thesisdiscipline	Computer Science	-
dc.description.nature	published_or_final_version	-
dc.identifier.doi	10.5353/th_b5699933	-
dc.identifier.mmsid	991018967399703414	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

postgraduate thesis: Cleaning algorithms for novel applications

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats