File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Scopus: 0
- Appears in Collections:
Article: Constructing compressed suffix arrays with large alphabets
Title | Constructing compressed suffix arrays with large alphabets |
---|---|
Authors | |
Issue Date | 2003 |
Publisher | Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/ |
Citation | Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 How to Cite? |
Abstract | Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003. |
Persistent Identifier | http://hdl.handle.net/10722/93476 |
ISSN | 2023 SCImago Journal Rankings: 0.606 |
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Hon, WK | en_HK |
dc.contributor.author | Lam, TW | en_HK |
dc.contributor.author | Sadakane, K | en_HK |
dc.contributor.author | Sung, WK | en_HK |
dc.date.accessioned | 2010-09-25T15:02:19Z | - |
dc.date.available | 2010-09-25T15:02:19Z | - |
dc.date.issued | 2003 | en_HK |
dc.identifier.citation | Lecture Notes In Computer Science (Including Subseries Lecture Notes In Artificial Intelligence And Lecture Notes In Bioinformatics), 2003, v. 2906, p. 240-249 | en_HK |
dc.identifier.issn | 0302-9743 | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/93476 | - |
dc.description.abstract | Recent research in compressing suffix arrays has resulted in two breakthrough indexing data structures, namely, compressed suffix arrays (CSA) [7] and FM-index [5]. Either of them makes it feasible to store a full-text index in the main memory even for a piece of text data with a few billion characters (such as human DNA). However, constructing such indexing data structures with limited working memory (i.e., without constructing suffix arrays) is not a trivial task. This paper addresses this problem. Currently, only CSA admits a space-efficient construction algorithm [15]. For a text T of length n over an alphabet Ó, this algorithm requires O(|∑|n log n) time and (2Ho + 1 + ε)n bits of working space, where Ho is the 0-th order empirical entropy of T and ε is any non-zero constant. This algorithm is good enough when the alphabet size |∑| is small. It is not practical for text data containing protein, Chinese or Japanese, where the alphabet may include up to a few thousand characters. The main contribution of this paper is a new algorithm which can construct CSA in O(n log n) time using (Ho + 2 + ε)n bits of working space. Note that the running time of our algorithm is independent of the alphabet size and the space requirement is smaller as it is likely that Ho > 1. This paper also makes contribution to the space-efficient construction of FM-index. We show that FM-index can indeed be constructed from CSA directly in O(n) time. © Springer-Verlag Berlin Heidelberg 2003. | en_HK |
dc.language | eng | en_HK |
dc.publisher | Springer Verlag. The Journal's web site is located at http://springerlink.com/content/105633/ | en_HK |
dc.relation.ispartof | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) | en_HK |
dc.title | Constructing compressed suffix arrays with large alphabets | en_HK |
dc.type | Article | en_HK |
dc.identifier.email | Lam, TW:twlam@cs.hku.hk | en_HK |
dc.identifier.authority | Lam, TW=rp00135 | en_HK |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.scopus | eid_2-s2.0-35248823623 | en_HK |
dc.identifier.hkuros | 91576 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-35248823623&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.volume | 2906 | en_HK |
dc.identifier.spage | 240 | en_HK |
dc.identifier.epage | 249 | en_HK |
dc.publisher.place | Germany | en_HK |
dc.identifier.scopusauthorid | Hon, WK=7004282818 | en_HK |
dc.identifier.scopusauthorid | Lam, TW=7202523165 | en_HK |
dc.identifier.scopusauthorid | Sadakane, K=7005716583 | en_HK |
dc.identifier.scopusauthorid | Sung, WK=13310059700 | en_HK |
dc.identifier.issnl | 0302-9743 | - |