File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Out-domain Chinese new word detection with statistics-based character embedding

TitleOut-domain Chinese new word detection with statistics-based character embedding
Authors
KeywordsChinese character embedding
Chinese new word detection
Chinese word boundary detection
Issue Date2019
PublisherCambridge University Press. The Journal's web site is located at http://journals.cambridge.org/action/displayJournal?jid=NLE
Citation
Natural Language Engineering, 2019, v. 25 n. 2, p. 239-255 How to Cite?
AbstractUnlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.
Persistent Identifierhttp://hdl.handle.net/10722/277570
ISSN
2022 Impact Factor: 2.5
2020 SCImago Journal Rankings: 0.290
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorLIANG, Y-
dc.contributor.authorYANG, M-
dc.contributor.authorZhu, J-
dc.contributor.authorYiu, SM-
dc.date.accessioned2019-09-20T08:53:34Z-
dc.date.available2019-09-20T08:53:34Z-
dc.date.issued2019-
dc.identifier.citationNatural Language Engineering, 2019, v. 25 n. 2, p. 239-255-
dc.identifier.issn1351-3249-
dc.identifier.urihttp://hdl.handle.net/10722/277570-
dc.description.abstractUnlike English and other Western languages, many Asian languages such as Chinese and Japanese do not delimit words by space. Word segmentation and new word detection are therefore key steps in processing these languages. Chinese word segmentation can be considered as a part-of-speech (POS)-tagging problem. We can segment corpus by assigning a label for each character which indicates the position of the character in a word (e.g., “B” for word beginning, and “E” for the end of the word, etc.). Chinese word segmentation seems to be well studied. Machine learning models such as conditional random field (CRF) and bi-directional long short-term memory (LSTM) have shown outstanding performances on this task. However, the segmentation accuracies drop significantly when applying the same approaches to out-domain cases, in which high-quality in-domain training data are not available. An example of out-domain applications is the new word detection in Chinese microblogs for which the availability of high-quality corpus is limited. In this paper, we focus on out-domain Chinese new word detection. We first design a new method Edge Likelihood (EL) for Chinese word boundary detection. Then we propose a domain-independent Chinese new word detector (DICND); each Chinese character is represented as a low-dimensional vector in the proposed framework, and segmentation-related features of the character are used as the values in the vector.-
dc.languageeng-
dc.publisherCambridge University Press. The Journal's web site is located at http://journals.cambridge.org/action/displayJournal?jid=NLE-
dc.relation.ispartofNatural Language Engineering-
dc.rightsNatural Language Engineering. Copyright © Cambridge University Press.-
dc.rightsThis article has been published in a revised form in [Journal] [http://doi.org/XXX]. This version is free to view and download for private research and study only. Not for re-distribution, re-sale or use in derivative works. © copyright holder.-
dc.subjectChinese character embedding-
dc.subjectChinese new word detection-
dc.subjectChinese word boundary detection-
dc.titleOut-domain Chinese new word detection with statistics-based character embedding-
dc.typeArticle-
dc.identifier.emailYiu, SM: smyiu@cs.hku.hk-
dc.identifier.authorityYiu, SM=rp00207-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1017/S1351324918000463-
dc.identifier.scopuseid_2-s2.0-85061363509-
dc.identifier.hkuros305931-
dc.identifier.volume25-
dc.identifier.issue2-
dc.identifier.spage239-
dc.identifier.epage255-
dc.identifier.isiWOS:000462866100002-
dc.publisher.placeUnited Kingdom-
dc.identifier.issnl1351-3249-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats