File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Characteristics of character usage in Chinese Web searching

TitleCharacteristics of character usage in Chinese Web searching
Authors
KeywordsCharacter usage
Chinese
Search log analysis
Web mining
Zipf distribution
Issue Date2009
PublisherElsevier Ltd. The Journal's web site is located at http://www.elsevier.com/locate/infoproman
Citation
Information Processing And Management, 2009, v. 45 n. 1, p. 115-130 How to Cite?
AbstractThe use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3-6) had similar structures with β-values in the range of 0.66-0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines. © 2008 Elsevier Ltd. All rights reserved.
Persistent Identifierhttp://hdl.handle.net/10722/60215
ISSN
2023 Impact Factor: 7.4
2023 SCImago Journal Rankings: 2.134
ISI Accession Number ID
Funding AgencyGrant Number
University of Hong Kong
Funding Information:

This research has been supported in part by a Seed Funding for Basic Research ( PI: M. Chau) from the University of Hong Kong. We would like to thank Timmy Yu from Timway Hong Kong Search Engine Limited for his help in providing the search log data used in this study. We also thank Jackey Ng, Raygen Lam, and Ivan Ho from the University of Hong Kong for their help in data processing.

References

 

DC FieldValueLanguage
dc.contributor.authorChau, Men_HK
dc.contributor.authorLu, Yen_HK
dc.contributor.authorFang, Xen_HK
dc.contributor.authorYang, CCen_HK
dc.date.accessioned2010-05-31T04:06:03Z-
dc.date.available2010-05-31T04:06:03Z-
dc.date.issued2009en_HK
dc.identifier.citationInformation Processing And Management, 2009, v. 45 n. 1, p. 115-130en_HK
dc.identifier.issn0306-4573en_HK
dc.identifier.urihttp://hdl.handle.net/10722/60215-
dc.description.abstractThe use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3-6) had similar structures with β-values in the range of 0.66-0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines. © 2008 Elsevier Ltd. All rights reserved.en_HK
dc.languageengen_HK
dc.publisherElsevier Ltd. The Journal's web site is located at http://www.elsevier.com/locate/infopromanen_HK
dc.relation.ispartofInformation Processing and Managementen_HK
dc.subjectCharacter usageen_HK
dc.subjectChineseen_HK
dc.subjectSearch log analysisen_HK
dc.subjectWeb miningen_HK
dc.subjectZipf distributionen_HK
dc.titleCharacteristics of character usage in Chinese Web searchingen_HK
dc.typeArticleen_HK
dc.identifier.emailChau, M: mchau@hkucc.hku.hken_HK
dc.identifier.authorityChau, M=rp01051en_HK
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1016/j.ipm.2008.06.005en_HK
dc.identifier.scopuseid_2-s2.0-55949105178en_HK
dc.identifier.hkuros159270en_HK
dc.relation.referenceshttp://www.scopus.com/mlt/select.url?eid=2-s2.0-55949105178&selection=ref&src=s&origin=recordpageen_HK
dc.identifier.volume45en_HK
dc.identifier.issue1en_HK
dc.identifier.spage115en_HK
dc.identifier.epage130en_HK
dc.identifier.isiWOS:000264937000009-
dc.publisher.placeUnited Kingdomen_HK
dc.identifier.scopusauthoridChau, M=7006073763en_HK
dc.identifier.scopusauthoridLu, Y=26321148700en_HK
dc.identifier.scopusauthoridFang, X=9273103600en_HK
dc.identifier.scopusauthoridYang, CC=7407740308en_HK
dc.identifier.issnl0306-4573-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats