File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1016/j.ipm.2008.06.005
- Scopus: eid_2-s2.0-55949105178
- WOS: WOS:000264937000009
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Characteristics of character usage in Chinese Web searching
Title | Characteristics of character usage in Chinese Web searching | ||||
---|---|---|---|---|---|
Authors | |||||
Keywords | Character usage Chinese Search log analysis Web mining Zipf distribution | ||||
Issue Date | 2009 | ||||
Publisher | Elsevier Ltd. The Journal's web site is located at http://www.elsevier.com/locate/infoproman | ||||
Citation | Information Processing And Management, 2009, v. 45 n. 1, p. 115-130 How to Cite? | ||||
Abstract | The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3-6) had similar structures with β-values in the range of 0.66-0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines. © 2008 Elsevier Ltd. All rights reserved. | ||||
Persistent Identifier | http://hdl.handle.net/10722/60215 | ||||
ISSN | 2023 Impact Factor: 7.4 2023 SCImago Journal Rankings: 2.134 | ||||
ISI Accession Number ID |
Funding Information: This research has been supported in part by a Seed Funding for Basic Research ( PI: M. Chau) from the University of Hong Kong. We would like to thank Timmy Yu from Timway Hong Kong Search Engine Limited for his help in providing the search log data used in this study. We also thank Jackey Ng, Raygen Lam, and Ivan Ho from the University of Hong Kong for their help in data processing. | ||||
References |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Chau, M | en_HK |
dc.contributor.author | Lu, Y | en_HK |
dc.contributor.author | Fang, X | en_HK |
dc.contributor.author | Yang, CC | en_HK |
dc.date.accessioned | 2010-05-31T04:06:03Z | - |
dc.date.available | 2010-05-31T04:06:03Z | - |
dc.date.issued | 2009 | en_HK |
dc.identifier.citation | Information Processing And Management, 2009, v. 45 n. 1, p. 115-130 | en_HK |
dc.identifier.issn | 0306-4573 | en_HK |
dc.identifier.uri | http://hdl.handle.net/10722/60215 | - |
dc.description.abstract | The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3-6) had similar structures with β-values in the range of 0.66-0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines. © 2008 Elsevier Ltd. All rights reserved. | en_HK |
dc.language | eng | en_HK |
dc.publisher | Elsevier Ltd. The Journal's web site is located at http://www.elsevier.com/locate/infoproman | en_HK |
dc.relation.ispartof | Information Processing and Management | en_HK |
dc.subject | Character usage | en_HK |
dc.subject | Chinese | en_HK |
dc.subject | Search log analysis | en_HK |
dc.subject | Web mining | en_HK |
dc.subject | Zipf distribution | en_HK |
dc.title | Characteristics of character usage in Chinese Web searching | en_HK |
dc.type | Article | en_HK |
dc.identifier.email | Chau, M: mchau@hkucc.hku.hk | en_HK |
dc.identifier.authority | Chau, M=rp01051 | en_HK |
dc.description.nature | link_to_subscribed_fulltext | - |
dc.identifier.doi | 10.1016/j.ipm.2008.06.005 | en_HK |
dc.identifier.scopus | eid_2-s2.0-55949105178 | en_HK |
dc.identifier.hkuros | 159270 | en_HK |
dc.relation.references | http://www.scopus.com/mlt/select.url?eid=2-s2.0-55949105178&selection=ref&src=s&origin=recordpage | en_HK |
dc.identifier.volume | 45 | en_HK |
dc.identifier.issue | 1 | en_HK |
dc.identifier.spage | 115 | en_HK |
dc.identifier.epage | 130 | en_HK |
dc.identifier.isi | WOS:000264937000009 | - |
dc.publisher.place | United Kingdom | en_HK |
dc.identifier.scopusauthorid | Chau, M=7006073763 | en_HK |
dc.identifier.scopusauthorid | Lu, Y=26321148700 | en_HK |
dc.identifier.scopusauthorid | Fang, X=9273103600 | en_HK |
dc.identifier.scopusauthorid | Yang, CC=7407740308 | en_HK |
dc.identifier.issnl | 0306-4573 | - |