You can edit almost every page by Creating an account and confirming your email.

Chinese Characters Sorting

From EverybodyWiki Bios & Wiki



English dictionaries and indexes are normally sorted into alphabetical orders for quick lookup. But Chinese is written in tens of thousands of different characters, not alphabets, and we have to rely on more complicated methods for lexicographical sorting. [lower-alpha 1]

The sorting methods adopted by Chinese dictionaries can be traditionally divided into three categories: (1) form-based sorting, including stroke-based sorting and component-based sorting, which further includes radical-based sorting, etc., (2) sound-based sorting, including Pinyin-based sorting and Bopomofo-based sorting, and (3) meaning-based sorting. [1] In modern Chinese, we also have frequency lists where words or characters are sorted according to their frequencies of use in a text corpus.

Chinese dictionaries include character dictionaries (zidian) and word dictionaries (cidian). Chinese word sorting is based on character sorting. Single-character words are arranged by character sorting directly, and multi-character words can be sorted character by character in a similar way. In the following sections, there is a more detailed introduction to the sorting methods currently in use, focused on those which are more popular and effective, and supported by applicational examples.

Form-based sorting

In this category of sorting methods, words are arranged according to various features of the forms or shapes of Chinese characters. Comparing with sound-based sorting, form-based sorting has the advantages of (a) allowing word lookup without knowing its pronunciation, and (b) effective collation of large character sets without support from other methods. There are two subcategories of form-based sorting, including stroke-based sorting and component-based sorting.

Stroke-based sorting

Strokes (pinyin: Bǐhuà; traditional Chinese: 筆畫; simplified Chinese: 笔画 ) are the most basic writing units of Chinese characters. The important methods to sort Chinese characters by their strokes include:

Stroke-count sorting

This method arranges characters according to their numbers of strokes ascendingly. A character with fewer strokes is put before those with more strokes. For example, the different characters in "漢字筆劃, 汉字笔画 " (Chinese character strokes) are sorted into "汉(5)字(6)画(8)笔(10)[筆(12)畫(12)]漢(14)", where stroke counts are put in brackets. (Please note that both 筆 and 畫 are of 12 strokes and their order is not determinable by stroke-count sorting.). Stroke-count sorting was used in Kangxi Chinese Character Dictionary to arrange the radicals and the characters under each radical when the dictionary was compiled in the 1710s. [2]

Stroke-count-stroke-order sorting

This is a combination of stroke-count sorting and stroke-order sorting. Characters are first arranged by stroke-counts ascendingly. Then stroke-order sorting is employed to sort characters with the same number of strokes. The characters are firstly arranged by their first strokes according to an order of stroke groups (such as “heng (横), shu (竖), pie (撇), dian (点), zhe (折)”, or “dian (点), heng (横), shu (竖), pie (撇), zhe (折)”), if the first strokes belong to the same group, then sort by their second strokes in a similar way, and so on. In our example of the previous section, both 筆 and 畫 are of 12 strokes. 筆 starts with stroke ㇓of the pie (撇) group, and 畫 starts with ㇕ of the zhe (折) group, and pie is before zhe in the groups order, so 筆 comes before 畫. Hence the different characters in "汉字笔画, 漢字筆劃" are finally sorted into "汉(5)字(6)画(8)笔(10)筆(12)畫(12)漢(14)", where each character is put at its unique position.

Stroke-count-stroke-order sorting was used in Xinhua Zidian (新华字典, Xinhua Chinese Character Dictionary) and Xiandai Hanyu Cidian (现代汉语词典, Contemporary Chinese Word Dictionary) before the national standard for stroke-based sorting was released in 1999.

GB13000.1 Character Set Chinese Character Order (Stroke-Based Order)

GB13000.1 Character Set Chinese Character Order (Stroke-Based Order) (GB13000.1字符集汉字字序(笔画序)规范)[3] is a standard released by the National Language Commission of China in 1999 for Chinese characters sorting by strokes. This is an enhanced version of stroke-count-stroke-order sorting. According to this standard, the characters are first sorted by stroke counts, followed by stroke order (of the five families of heng, shu, pie, dian and zhe). Then if there are characters of the same stroke count and stroke order, they will be sorted by the primary-secondary stroke order. For example, 子 and 孑 have the same five-group stroke order (㇐ and ㇀ both belong to the heng family), but according to primary-secondary stroke order rule, primary stroke ㇐ is before secondary stroke ㇀. So 子 comes before 孑. If two characters are of the same stroke count, stroke order and primary-secondary stroke, then sort them according to the mode of stroke combination. Stroke separation precedes stroke connection, and connection precedes intersecting. For example: 八 is prior to 人, and 人 is prior to 乂. And there are other sorting rules in the standard for more accurate sorting.

YES sorting

YES [lower-alpha 2] is a simplified stroke-based sorting method free from stroke counting and grouping, without compromise in accuracy. And it has been successfully applied to the indexing of all the characters in Xinhua Zidian and Xiandai Hanyu Cidian. In this joint index you can look up a Chinese character to find its pinyin and Unicode, in addition to the page numbers in the two popular dictionaries[4]

Component-based sorting

In this category, characters are sorted by one or more components.

Radical-based sorting

A radical (bùshǒu, 部首, or section head) is a common component shared by a group of characters. The radical usually lies on the upper part or left side of a character and helps to represent its meaning.[5] [6] For example, 江(river),湖(lake),海(sea) all have the radical of 氵(水,water), which indicates they are related to water; 推(push),拉(pull),打(beat) share the radical of 扌(手, hand), and are actions involving hands. In a radical-based method, all the characters sharing a radical are put under that radical to form a radical family or section. Different families are arranged by their leading radicals by stroke-based sorting, and characters inside a family are also sorted by their strokes.

In many contemporary dictionaries, including Xinhua Zidian, Xiandai Hanyu Cidian and Oxford Chinese Dictionary[7], the radical-based character lookup system consists of three indexes or tables: a radical index, a character lookup index, and an index of characters with radicals difficult to find, all sorted in stroke-based order. To lookup a character (such as 家, home) in a dictionary (e.g., Xinhua Zidian, version 12), first find out its radical (the component 宀 at the top). Count its number of strokes (3 strokes) and find it in the radical index in stroke-based order. When found, get its page number (p49) on the right side. Then, according to the page number, find the radical family in the character lookup table in stroke-based order. Count the number of strokes in the remaining parts of the character (except radical 宀, there are 7 strokes in 家) and find the target character within the family. And the page number on the right (217) is the page number in the dictionary main body for the entry of the character (characters entries in the main body of Xinhua Zidian are sorted by Pinyin). Characters with radicals difficult to find out can be looked up in the Index of Characters with Radicals Difficult to Find in stroke-based order.

The first radical system in history was created by a Chinese Scholar Xu Shen in his Shuowen Jiezi (说文解字,説文解字) Dictionary almost two thousand years ago in the Eastern Han Dynasty. This dictionary is still available today, with a total number of 540 radicals. Another milestone is the Kangxi radical system employed in the Kangxi Dictionary in 1716 in the era of Emperor Kangxi, with the number of radicals reduced to 214. The Kanxi radical sorting method is still in use in China, Japan and Korea. It is also used for the official orders of Unicode CJK Unified Ideographs. The latest standard radical table of Chinese Mainland is Table of Indexing Chinese Character Components with a list of 201 radicals.

Four-corner sorting

Chinese characters are written in the form of a square block. The Four-Corner Method assigns a 4-digit code to a character, each digit representing one corner of the block. The four corner digits appear in the sequence of "upper-left, upper-right, lower-left and lower-right". For example, the code of character 顏 (meaning "face") is 0128, where the first digit 0 represents the upper-left component 亠 , 1 for the upper right 一, 2 for the lower-left ㇓, and 8 represents the lower-right 八.

A fifth digit can be added to represent an extra part above the lower-right corner to gain higher sorting accuracy. For example the extended code of character 佳 is 24214, where the fifth digit represents component 十 above the final 一 in the lower-right corner.

When a set of characters are encoded in four-corner codes, they are sorted ascendingly into a four-corner order by the first four digits (followed by the fifth digits if they exist).

Cangjie-code sorting

In this method, Chinese characters are arranged alphabetically by their codes used in Cangjie input method. The Cangjie code of a character is a string of English letters each representing a selected Cangjie component in the character. For example, the Cangjie codes of the characters in 漢字排檢法 (Methods for Chinese character sorting and retrieving) are 漢(ETLO)字(JND)排(QLMY)檢(DOMO)法(EGI), and can be sorted into a Cangjie-code order of 檢(DOMO)法(EGI)漢(ETLO)字(JND)排(QLMY).

Sound-based sorting

There are two sound representation systems in Mandarin Chinese or Putonghua, i.e., Pinyin and Bopomofo. Accordingly we have two methods of sound-based sorting for modern standard Chinese.

Pinyin-based sorting

In this method, Chinese characters are sorted by their Pinyin (pīnyīn, 拼音) alphabetically, for example, 汉字拼音排序法 (Pinyin sorting method of Chinese characters) is sorted into "法(fǎ)汉(hàn)排(pái)拼(pīn)序(xù)音(yīn)字(zì)" with pinyin in brackets. Pinyin expressions of similar letters are ordered by their tones in the order of tone 1, tone 2, tone 3, tone 4 and tone 5 (light tone), such as "妈(mā), 麻(má), 马(mǎ), 骂(mà), 吗(ma)". Characters of the same sound, i.e., same Pinyin letters and tones, are normally sorted by a stroke-based method.

Words of multiple characters can be sorted in two different ways [8]. One is to sort character by characters, if the first characters are the same, then sort by the second character, and so on. For example, 归并(guībìng),归还(guīhuán),规划(guīhuà),鬼话(guǐhuà),桂花(guìhuā). This method is used in Xiandai Hanyu Cidian. Another method is to sort according to the pinyin letters of the whole words, followed by sorting on tones when word pinyins are the same. For example, 归并(guībìng),规划(guīhuà),鬼话(guǐhuà),桂花(guìhuā),归还(guīhuán). This method is used in the ABC Chinese–English Dictionary.

Pinyin-based sorting is very convenient for looking up words which you know its pronunciation and Pinyin expressions. But you cannot find words which you do not know how to pronounce.

Bopomofo-based sorting

Bopomofo, or Phonetic Symbols (zhùyīn fúhào, 注音符號, 注音符号), is a Chinese phonetic system created by the Commission on the Unification of Pronunciation (讀音統一會) in 1913, and formally issued by the Ministry of Education of the Chinese Government in 1918. It consists of a table (or alphabet) of 37 letters or symbols in the order of "ㄅㄆㄇㄈㄉㄊㄋㄌㄍㄎㄏㄐㄑㄒㄓㄔㄕㄖㄗㄘㄙㄚㄛㄜㄝㄞㄟㄠㄡㄢㄣㄤㄥㄦㄧㄨㄩ" and 5 tone diacritics of “ˉ, ˊ, ˇ, ˋ, ˙”.

Chinese characters can be sorted according to the Bopomofo expressions of their sounds by their order in the alphabet table, first by letters, then by tones in the order of first tone, second tone, third tone, fourth tone, and fifth tone (also called neutral tone, light tone). For example, the Bopomofo order for the characters in “注音字母排序法 (Bopomofo-based sorting)” are “排(ㄆㄞˊ)母(ㄇㄨˇ)法(ㄈㄚˇ)序(ㄒㄩˋ)注(ㄓㄨˋ)字(ㄗˋ)音 (ㄧㄣ)”. Characters of the same sounds are normally sorted by a stroke-based method.

The first dictionary sorted in Bopomofo is 國語辭典 (Guoyu Dictionary) published in 1937,[9] followed by many other dictionaries. Bopomofo is more popular in Taiwan than in Chinese Mainland, where Pinyin is predominant.

Dialect-sound sorting

In addition to the sounds of standard Chinese, Chinese characters can be sorted by the sounds of dialects as well. For example, by Jyutpin (Cantonese Pinyin) of the Cantonese dialect popular in Hong Kong.

In Jyutpin, the sound of a Chinese character is represented by a string of English letters, followed by a number of 1, 2, 3, 4, 5 or 6 to represent the tone. For instance, the Jyutpin order of the characters in “粵拼排檢法 (Jyutpin-based sorting and retrieving)” is “法[faat3]檢[gim2]粵[jyut6]排[paai4]拼[ping3]”, where Jyutping expressions are in square brackets”.

The most serious limitation of sound-based sorting methods is their lack of support to look up words with unknown pronunciation. And that is why dictionaries collated by sounds often provides some indexes in form-based orders.

Meaning-based sorting

Meaning-based sorting, also called semantics-based sorting, arranges characters and words in a hierarchical structure of semantic categories. The first surviving Chinese dictionary Erya (date from the 3rd century BC) is arranged by semantic classification. The words were divided into nine categories, each with a large number of entries. An entry is a list of synonyms, which are explained by a word commonly used. For instance, entry "林、烝、天、地、皇、王、後、辟、公、侯,君也。", where the ending 君也 says (the previous words are) synonyms of "君 (king)".[10]

Modern semantically-sorted dictionaries include "同义词词林" [11] and "实用广州话分类词典" [12]. Their classification systems are much more accurate and detailed than the ancient dictionaries, but still need indexes of radicals or strokes. That means meaning-based sorting is not powerful enough to function as an independent sorting method.

Semantics-based sorting involves these questions: What are the categories and subcategories to use? How to put a word into its category and subcategories? How to arrange the categories and subcategories in order? How to arrange the words in the lowest subcategories in order? And the answers to these questions may vary between the user and compiler of the dictionary, and that will lead to difficulties in word lookup.

In fact, radical-based sorting is meaning-based to a certain degree, because in many cases the radical represents the semantic category of a character, e.g., radical 氵(water) in character 江(river), 扌(hand) in 推 (push), 木(wood) in 椅(chair). [13] [14]

Frequency-based sorting

This category of sorting methods sort Chinese characters by their frequency of uses, normally in descending order. That means the most frequently-used character is at the top of the list. A frequency list is created from a Text corpus. In Corpus linguistics, the frequency of a character is the ratio percentage of its number of occurrences in the corpus to the total number of characters of the corpus.

The first frequency list of Chinese characters based on a corpus was created by Chen Heqin (陳鶴琴).[15] In the 1920s, he and his assistants spent two years manually counting the characters in a corpus of 554,478 characters, and obtained 4,261 different characters with frequency information. [16] The top 10 characters in their frequency list are (in descending order):

"的(of), 不(no, not), 一(one, a/an), 了(had, done), 是 (be), 我(I, me), 上(on, up), 他(he, him), 有(have, has), 人(person, people)".

In 2001, the Chinese University of Hong Kong published a number of frequency lists on the Web,[17] entitled "Hong Kong, Mainland China and Taiwan Chinese Frequency: a trans-reginal diachronic survey". The frequency data came from a grand corpus with a number of sub-corpora representing the Chinese languages in the three regions of Hong Kong, Mainland China and Taiwan and in the two time periods of the 1960's and 1980/1990's. Each sub-corpus includes about 5,000 different characters, as shown by their frequency lists.

From the data of these frequency lists, we can discover that the 100 most frequently-used characters in the 1980/90's cover (i.e., have an accumulated frequency of) 41.00% of the Hong Kong texts of that period, 41.34% of the Mainland texts, and 41.88% of the Taiwan texts. That is more than 4 out of every 10 characters for the three regions. The 1000 most frequently-used characters in the 1980/90's cover 89.25% of the Hong Kong texts of that period, 90.26% of the Mainland texts, and 88.74% of the Taiwan texts. And similar results can also be found from the frequency lists of the 1960s.

As a matter of fact, both meaning-based sorting and frequency-based sorting are employed in other languages as well, though often at word level, not at character level.

Sorting of words

A Chinese word consists of one or more characters. Single-character words can be sorted by character sorting, and multi-character words can be sorted character by character in a similar way. For example, according to the methods for Pinyin, Radical and Stroke-based sorting used in Xiandai Hanyu Cidian (version 7), the five words of [爱,好,好事,好人,好人家] would be arranged in the following orders:

  • Pinyin-based sorting: "爱(ài), 好(hǎo), 好人(hǎorén), 好人家(hǎorénjiā), 好事(hǎoshì)".
  • Radical-based sorting: "好(radical 女 of 3 strokes), 好事(事: radical 一 of 1 stroke), 好人(人: radical 人 of 2 strokes), 好人家, 爱(radical 爪 of 4 strokes)".
  • Stroke-based sorting: "好(6 strokes), 好人(人:2 strokes), 好人家, 好事(事:8 strokes), 爱(10 strokes)".

Computer sorting

Chinese texts can be automatically sorted on the computer as well. For example, on Microsoft Windows and Office[18], users can sort their Chinese characters or words in the optional orders of:

  • Unicode order. (This is generally speaking Kangxi Radical order.)
  • Pinyin order. (More popular in Chinese Mainland)
  • Bopomofo order. (More popular in Taiwan)
  • Stroke-based order. (More widely used in Hong Kong)

See also

Notes

  1. Chinese dictionary#Traditional Chinese lexicography (paragraph 2).
  2. YES is the acronym of Yi Er San, the pinyin of 一二三, which is the Chinese name of the sorting method.

References

  1. Su, Peicheng (苏培成) (2014). 现代汉字学纲要 (Essentials of Modern Chinese Characters) (in 中文) (3rd ed.). Beijing: 商务印书馆 (The Commercial Press, Shangwu). pp. 183–207. ISBN 978-7-100-10440-1. Search this book on
  2. Su 2014, p. 187.
  3. "《GB13000.1字符集汉字字序(笔画序)规范》" (PDF) (in 中文). 中华人民共和国教育部 国家语言文字工作委员会. October 1, 1999.
  4. Zhang, Xiaoheng et. al (张小衡, 李笑通) (2013). 一二三笔顺检字手册 (Handbook of the YES Sorting Method) (in 中文). Beijing: 语文出版社 (The Language Press). ISBN 978-7-80241-670-3. Search this book on
  5. Su 2014, p. 185-188.
  6. Qiu, Xigui (裘锡圭) (2013). 文字学概要 (Chinese Writing) (in 中文) (2nd ed.). Beijing: 商务印书馆 (Shangwu). ISBN 978-7-100-09369-9. Search this book on
  7. Kleeman, Julie (and Harry Yu) (2010). Oxford Chinese Dictionary. Beijing: Oxford University Press). ISBN 978-0-19-920761-9. Search this book on
  8. Su 2014, pp. 201-202.
  9. Zhongguo cidian bianzuanchu 中國辭典編纂處, eds. Guoyu cidian (國語辭典 "Dictionary of the National Language"). 8 vols. Shanghai: Commercial Press. 1937.
  10. Su 2014, p. 184.
  11. Mei, Jiaju (梅家驹等) (1996). 同义词词林 (Dictionary of Synonyms) (in 中文). Shanghai: 上海辞书出版社 (Shanghai Dictionary Press). ISBN 978-7-532-60396-1. Search this book on
  12. Mai, Yun (麦耘,谭步云) (1997). 实用广州话分类词典 (Practical Cantonese Classified Dictionary) (in 中文). Guangzhou: 广东人民出版社 (Guangdong People's Press). ISBN 978-9-620-70305-8. Search this book on
  13. Su 2014, p. 190.
  14. Qiu 2013, pp. 102-108.
  15. Su 2014, p. 35.
  16. Chen, Heqin (陳鶴琴) (1928). 語體文應用字彙 (Applied Lexis of Vernacular Chinese) (in 中文). Beijing: Shangwu (The Commercial Press). Search this book on
  17. "Chinese Character Frequency Statistics for Hong Kong, Mainland China and Taiwan - A Trans-Regional, Diachronic Survey: 香港、大陸、台灣 - 跨地區、跨年代漢語常用字頻統計".
  18. "FAQ: Sorting Chinese characters in Word and Excel :: Pinyin Joe".

External links


This article "Chinese Characters Sorting" is from Wikipedia. The list of its authors can be seen in its historical and/or the page Edithistory:Chinese Characters Sorting. Articles copied from Draft Namespace on Wikipedia could be seen on the Draft Namespace of Wikipedia and not main one.