File Download
There are no files associated with this item.
Supplementary
-
Citations:
- Appears in Collections:
Conference Paper: Is there Core Chinese Word List? - Exploratory Network Analysis of Modern Standard Chinese vocabulary
| Title | Is there Core Chinese Word List? - Exploratory Network Analysis of Modern Standard Chinese vocabulary 中文核心字詞研究——《現代漢語詞典》詞語網絡分析 |
|---|---|
| Authors | |
| Issue Date | 18-Feb-2024 |
| Abstract | Identifying a limited number of frequently used words as basic or core words in language studies is common. Native and non-native-speaking students learn basic words to expand their vocabulary size and develop their language proficiency. A usual methodology to build such a basic word list is measuring word frequency from a corpus of texts. However, the result list would be sensitive to the size of the corpus and the type of texts chosen. The current study explores an alternative method to build a basic or core Chinese word list by network analysis of Modern Standard Chinese word formation with double-syllable and multi-syllable words instead of relying on a corpus of texts. Network Analysis is a method of examining data of relation. The technique helps identify basic or core Chinese words. The study retrieves all double-syllable and multi-syllable words, with the exclusion of reduplicated words, from "The Contemporary Chinese Dictionary (现代汉语词典)". Two networks of Modern Standard Chinese word formation are modeled. A directed network, G1 = (V, E1) where V = { characters/ morphemes of words } and E2 = { adjacency of word-formation of V }, models the adjacent relations of morphemes of double and multi-syllable words. Another undirected network, G2 = (V, E2), where V = { characters/ morphemes of words } remains the same but E2 = { word-formation of V }, models the word-formation of all double-syllable and multi-syllable words. The two networks are analyzed by network analysis software (mainly UCINET and igraph package for R) to calculate the metrics of the networks, including density, various types of centrality (degree, closeness, betweenness), and various types of cohesion such as k-core and community to gain new light on features of Chinese word formation system. The study found that double and multi-syllable words of Modern Standard Chinese follow power law distribution, which means that a tiny number of characters/ morphemes (less than 500) forms almost half of all words (more than 30000). These characters/ morphemes can be taken as the Core Chinese Word Modern Standard Chinese. The study echoes the character-based and word-based perspectives of the Chinese language. 常用字是語料語言學與語文教學的重要概念,它指出了字詞不是平等的,一些字詞比另一些字詞更常見,更重要,也更須先學。傳統方法以海量的語料庫為材料,分析字詞運用頻率,以尋找常用字。語料庫必須海量,否則字詞出現頻率偏差,降低常用字結果的有效性。海量語料庫的代價是成本極高,而且,即使如何海量,始終避免不了哪些材料應入選語料庫的主觀性和代表性問題。 本研究以網絡分析的新方法計算常用語素,或簡稱常用字。以《現代漢語詞典(第七版)》為語料,根據其中所收70414詞語,去除其中11428個單音節詞語,以餘下的58986個雙音節及多音節語詞為分析語料。根據語料,發展了兩個詞語網絡以為現代漢語建模: 一、有向網絡 (directed network):G1= (V, E1) ,V = { 雙音節及多音節詞語中的語素 } , E1 = { 雙音及多音節詞語中 V 前後相鄰緊接的關係 },分析現代漢語中,哪些語素能夠與另外哪些語素相鄰緊接,例如「核心詞」,「核」與「心」、「心」與「詞」相鄰緊接,而「核」與「詞」不是; 二、無向網絡 (undirected network):G2 = (V, E2) ,V = { 雙音節及多音節詞語中的語素} , E2 = { 可以互相構詞的 V 的構詞關係 },分析現代漢語中,哪些語素能夠與另外哪些語素構成詞語,例如「核心詞」,「核」、「心」與「詞」構成一個詞語。 無向網絡等於整部《現代漢語詞典》中的全部雙音及多節詞語,有向網絡把握這些雙音及多節詞語的前後關係。運用不同網絡計算方法分析兩個網絡,包括:度數中心度 (degree centrality)、中介中心度 (betweenness)、k 核 (k-core),以及社群算法 (community detection algorithm) 等,分析現代漢語語素在詞語中的幂律分布 (power law distribution)——即現代漢語絕大部份詞語僅由極少數語素構成的,以辨識現代漢語最基礎的語素,亦即核心單純詞,可簡稱為字詞。 結果發現,現代漢語語素構造詞語時,形成一個由核心向外圍擴展的多層結構,有向網絡一分為36層,無向網絡二分為50層,愈外圍層次的語素以及所構成的詞語愈罕用,如「殆、殇、殍、毂……」,愈核心層次常用程度愈高,如「子、人、心、生……」。研究發現,不論有向抑或無向網絡,最核心層次語素不足500個,卻差不多涵蓋了全部七萬個詞 |
| Persistent Identifier | http://hdl.handle.net/10722/369175 |
| DC Field | Value | Language |
|---|---|---|
| dc.contributor.author | Lam, Wai Ip | - |
| dc.date.accessioned | 2026-01-21T00:35:11Z | - |
| dc.date.available | 2026-01-21T00:35:11Z | - |
| dc.date.issued | 2024-02-18 | - |
| dc.identifier.uri | http://hdl.handle.net/10722/369175 | - |
| dc.description.abstract | <p>Identifying a limited number of frequently used words as basic or core words in language studies is common. Native and non-native-speaking students learn basic words to expand their vocabulary size and develop their language proficiency. A usual methodology to build such a basic word list is measuring word frequency from a corpus of texts. However, the result list would be sensitive to the size of the corpus and the type of texts chosen. </p><p>The current study explores an alternative method to build a basic or core Chinese word list by network analysis of Modern Standard Chinese word formation with double-syllable and multi-syllable words instead of relying on a corpus of texts. Network Analysis is a method of examining data of relation. The technique helps identify basic or core Chinese words. </p><p>The study retrieves all double-syllable and multi-syllable words, with the exclusion of reduplicated words, from "The Contemporary Chinese Dictionary (现代汉语词典)". Two networks of Modern Standard Chinese word formation are modeled. A directed network, G1 = (V, E1) where V = { characters/ morphemes of words } and E2 = { adjacency of word-formation of V }, models the adjacent relations of morphemes of double and multi-syllable words. Another undirected network, G2 = (V, E2), where V = { characters/ morphemes of words } remains the same but E2 = { word-formation of V }, models the word-formation of all double-syllable and multi-syllable words.</p><p>The two networks are analyzed by network analysis software (mainly UCINET and igraph package for R) to calculate the metrics of the networks, including density, various types of centrality (degree, closeness, betweenness), and various types of cohesion such as k-core and community to gain new light on features of Chinese word formation system. </p><p>The study found that double and multi-syllable words of Modern Standard Chinese follow power law distribution, which means that a tiny number of characters/ morphemes (less than 500) forms almost half of all words (more than 30000). These characters/ morphemes can be taken as the Core Chinese Word Modern Standard Chinese. </p><p>The study echoes the character-based and word-based perspectives of the Chinese language.<br><br></p> | - |
| dc.description.abstract | 常用字是語料語言學與語文教學的重要概念,它指出了字詞不是平等的,一些字詞比另一些字詞更常見,更重要,也更須先學。傳統方法以海量的語料庫為材料,分析字詞運用頻率,以尋找常用字。語料庫必須海量,否則字詞出現頻率偏差,降低常用字結果的有效性。海量語料庫的代價是成本極高,而且,即使如何海量,始終避免不了哪些材料應入選語料庫的主觀性和代表性問題。 本研究以網絡分析的新方法計算常用語素,或簡稱常用字。以《現代漢語詞典(第七版)》為語料,根據其中所收70414詞語,去除其中11428個單音節詞語,以餘下的58986個雙音節及多音節語詞為分析語料。根據語料,發展了兩個詞語網絡以為現代漢語建模: 一、有向網絡 (directed network):G1= (V, E1) ,V = { 雙音節及多音節詞語中的語素 } , E1 = { 雙音及多音節詞語中 V 前後相鄰緊接的關係 },分析現代漢語中,哪些語素能夠與另外哪些語素相鄰緊接,例如「核心詞」,「核」與「心」、「心」與「詞」相鄰緊接,而「核」與「詞」不是; 二、無向網絡 (undirected network):G2 = (V, E2) ,V = { 雙音節及多音節詞語中的語素} , E2 = { 可以互相構詞的 V 的構詞關係 },分析現代漢語中,哪些語素能夠與另外哪些語素構成詞語,例如「核心詞」,「核」、「心」與「詞」構成一個詞語。 無向網絡等於整部《現代漢語詞典》中的全部雙音及多節詞語,有向網絡把握這些雙音及多節詞語的前後關係。運用不同網絡計算方法分析兩個網絡,包括:度數中心度 (degree centrality)、中介中心度 (betweenness)、k 核 (k-core),以及社群算法 (community detection algorithm) 等,分析現代漢語語素在詞語中的幂律分布 (power law distribution)——即現代漢語絕大部份詞語僅由極少數語素構成的,以辨識現代漢語最基礎的語素,亦即核心單純詞,可簡稱為字詞。 結果發現,現代漢語語素構造詞語時,形成一個由核心向外圍擴展的多層結構,有向網絡一分為36層,無向網絡二分為50層,愈外圍層次的語素以及所構成的詞語愈罕用,如「殆、殇、殍、毂……」,愈核心層次常用程度愈高,如「子、人、心、生……」。研究發現,不論有向抑或無向網絡,最核心層次語素不足500個,卻差不多涵蓋了全部七萬個詞 | - |
| dc.language | chi | - |
| dc.relation.ispartof | The 13th International Conference on Han Characters Education and Research (14/02/2024-18/02/2024, Hawaii) | - |
| dc.title | Is there Core Chinese Word List? - Exploratory Network Analysis of Modern Standard Chinese vocabulary | - |
| dc.title | 中文核心字詞研究——《現代漢語詞典》詞語網絡分析 | - |
| dc.type | Conference_Paper | - |
