File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1016/j.datak.2008.04.001
- Scopus: eid_2-s2.0-45249100209
- WOS: WOS:000258448400005
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Extracting k most important groups from data efficiently
Title | Extracting k most important groups from data efficiently |
---|---|
Authors | |
Keywords | Optimization and performance |
Issue Date | 2008 |
Publisher | Elsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/datak |
Citation | Data And Knowledge Engineering, 2008, v. 66 n. 2, p. 289-310 How to Cite? |
Abstract | We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is "find the 10 combinations of product-type and month with the largest sum of sales". The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms. © 2008 Elsevier B.V. All rights reserved. |
Persistent Identifier | http://hdl.handle.net/10722/60621 |
ISSN | 2023 Impact Factor: 2.7 2023 SCImago Journal Rankings: 0.691 |
ISI Accession Number ID | |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Yiu, ML | en_HK |
dc.contributor.author | Mamoulis, N | en_HK |
dc.contributor.author | Hristidis, V | en_HK |
dc.date.accessioned | 2010-05-31T04:15:08Z | - |
dc.date.available | 2010-05-31T04:15:08Z | - |
dc.date.issued | 2008 | en_HK |
dc.identifier.citation | Data And Knowledge Engineering, 2008, v. 66 n. 2, p. 289-310 | en_HK |
dc.identifier.issn | 0169-023X | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/60621 | - |
dc.description.abstract | We study an important data analysis operator, which extracts the k most important groups from data (i.e., the k groups with the highest aggregate values). In a data warehousing context, an example of the above query is "find the 10 combinations of product-type and month with the largest sum of sales". The problem is challenging as the potential number of groups can be much larger than the memory capacity. We propose on-demand methods for efficient top-k groups processing, under limited memory size. In particular, we design top-k groups retrieval techniques for three representative scenarios as follows. For the scenario with data physically ordered by measure, we propose the write-optimized multi-pass sorted access algorithm (WMSA), that exploits available memory for efficient top-k groups computation. Regarding the scenario with unordered data, we develop the recursive hash algorithm (RHA), which applies hashing with early aggregation, coupled with branch-and-bound techniques and derivation heuristics for tight score bounds of hash partitions. Next, we design the clustered groups algorithm (CGA), which accelerates top-k groups processing for the case where data is clustered by a subset of group-by attributes. Extensive experiments with real and synthetic datasets demonstrate the applicability and efficiency of the proposed algorithms. © 2008 Elsevier B.V. All rights reserved. | en_HK |
dc.language | eng | en_HK |
dc.publisher | Elsevier BV. The Journal's web site is located at http://www.elsevier.com/locate/datak | en_HK |
dc.relation.ispartof | Data and Knowledge Engineering | en_HK |
dc.subject | Optimization and performance | en_HK |
dc.title | Extracting k most important groups from data efficiently | en_HK |
dc.type | Article | en_HK |
dc.identifier.email | Mamoulis, N:nikos@cs.hku.hk | en_HK |
dc.identifier.authority | Mamoulis, N=rp00155 | en_HK |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1016/j.datak.2008.04.001 | en_HK |
dc.identifier.scopus | eid_2-s2.0-45249100209 | en_HK |
dc.identifier.hkuros | 150167 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-45249100209&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.volume | 66 | en_HK |
dc.identifier.issue | 2 | en_HK |
dc.identifier.spage | 289 | en_HK |
dc.identifier.epage | 310 | en_HK |
dc.identifier.isi | WOS:000258448400005 | - |
dc.publisher.place | Netherlands | en_HK |
dc.identifier.scopusauthorid | Yiu, ML=8589889600 | en_HK |
dc.identifier.scopusauthorid | Mamoulis, N=6701782749 | en_HK |
dc.identifier.scopusauthorid | Hristidis, V=6507537461 | en_HK |
dc.identifier.issnl | 0169-023X | - |