The Pitfalls and Complexities of
Chinese to Chinese Conversion
President, The CJK Dictionary Institute
Chief of Software Development, CJK Dictionary Institute
Abstract 1. Introduction 2. The Four Conversion Levels 3. Discussion and Analysis 4. A New Conversion Technology Acknowledgements References Appendixes About the Authors The CJK Dictionary Institute
Standard Chinese is written in two forms: Simplified Chinese (SC), used in the People’s Republic of China (PRC) and Singapore; and Traditional Chinese (TC), used in Taiwan, Hong Kong, Macau, and among most overseas Chinese. A common fallacy is that there is a straightforward correspondence between the two systems, and that conversion between them merely requires mapping from one character set to another, such as from GB 2312-80 to Big Five.
Although many code conversion tools are mere implementations of this fallacy, nothing can be further from the truth. There are major differences between the systems on various levels: character sets, encoding methods, orthography (choice of characters), vocabulary (choice of words), and even semantics (word meanings).
With the growing importance of East Asia in the world economy, localization and translation companies face an urgent need to convert between SC and TC, but must contend with such obstacles as: (1) current conversion tools which produce unacceptable results; (2) the lack of knowledge to develop good conversion tools; (3) no access to high quality dictionary data; and (4) the high cost of manual conversion.
In 1996, the CJK Dictionary Institute (CJKI) launched a project to investigate these issues in-depth, and to build a comprehensive SC↔TC database (now at three million SC and TC entries) whose goal is to enable conversion software to achieve near 100% accuracy.
This paper explains the complex issues involved, and shows how this new, Unicode-based technology can significantly reduce the time and costs of Chinese localization and translation projects.
The forms of Chinese characters (汉字 han4zi4) underwent a great deal of change over the several thousand years of their history. Many calligraphic styles, variant forms, and typeface designs have evolved over the years. Some of the full, complex forms were elevated to the rank of “correct characters” (正字 zheng4zi4), while the bewildering plethora of variants were often relegated to the status of “vulgar forms” (俗字 su2zi4).
Soon after the establishment of the People’s Republic of China in 1949, the new regime launched a vigorous campaign to implement large-scale written language reforms. In the 1950s, Mao Zedong and Zhou Enlai led the way by announcing that character simplification was a high priority task. In 1952, the Committee on Language Reform was established to study the problem in-depth, and to undertake the task of compiling lists of simplified characters.
As a result of these activities, various written language reforms were undertaken, the most important of which include: the development of a standardized romanization system known as pinyin, limiting the number of characters in daily use, and the drastic simplification of thousands of character forms. Although at one point the ultimate goal was to abolish the use of Chinese characters altogether and to replace them with a romanized script, this policy was abandoned in favor of character form simplification.
Various simplified character lists were published in the subsequent years, the most well-known of which is the “definitive” Comprehensive List of Simplified Characters (简化字总表 jian3hua4zi4zong3biao3) published in 1964, which was reissued several times with minor revisions. The latest edition, published in 1986, lists 2244 simplified characters [Zongbiao 1986].
Taiwan and Hong Kong, and most overseas Chinese, did not follow the path of simplification. Taiwan, in particular, has adhered fairly strictly to the traditional forms. The Taiwanese Ministry of Education has published various character lists, such as the 常用國字標準字體表 (chang2yong4guo2zi4biao1zhun3zi4ti3biao3), which enumerates 4808 characters, as guidelines for correct character forms.
Although the most important difference between Simplified Chinese and Traditional Chinese lies in character form, there are, as we shall see, also differences in character sets, encoding methods, and choice of vocabulary.
From a practical point of view, the term Simplified Chinese typically refers to a Chinese text that meets the following conditions:
Similarly, the term Traditional Chinese typically refers to a Chinese text that meets the following conditions:
Only the first of these is a necessary condition. “Simplified” Chinese, by definition, cannot be written with the traditional character forms, except in those cases where a traditional form has no corresponding simplified form. Similarly, “Traditional” Chinese must not be written in the simplified forms, with some minor exceptions, such as in certain proper nouns. Character sets and encoding methods are less restricted, as described in section 1.4 below.
There is also some variation in vocabulary usage. Taiwanese texts, for example, may include some PRC-style vocabulary, while Singaporean texts may follow Taiwanese-style, rather than PRC-style, computer terminology. Nevertheless, on the whole, the terms Simplified Chinese and Traditional Chinese are used as defined above.
The language reforms in the PRC have had a major impact on the Chinese written language. From the point of view of processing Chinese data, the most relevant issues are:
Item (2) above is the central issue in SC-to-TC conversion, and is what this paper focuses on. The “classical” example given in such discussions are the traditional characters 發 and 髮, etymologically two distinct characters, which were merged into the single simplified form 发. The table below shows these and other examples of SC forms that map to multiple TC forms.
|SC Source||TC Target||Meaning||TC Example|
|发 fa1||發||emit||出發 start off|
|发 fa4||髮||hair||頭髮 hair|
|干 gan1||乾||dry||乾燥 dry|
|干 gan4||幹||trunk||精幹 able, strong|
|干 gan1||干||intervene||干渉 interfere with|
|干 gan4||榦||tree trunk||楨榦 central figure|
|面 mian4||麵||noodles||湯麵 noodle soup|
|面 mian4||面||face||面具 mask|
|后 hou4||後||after||後天 day after tomorrow|
|后 hou4||后||queen||王后 queen|
As can be seen, successfully converting such SC forms to their corresponding TC forms depends on the context, usually the word, in which they occur. Often, the conversion cannot be done by merely mapping one codepoint to another, but must be based on larger linguistic units, such as words.
There are hundreds of other simplified forms that correspond to two or more traditional ones, leading to ambiguous, one-to-many mappings that depend on the context. In this paper, such mappings may be referred to as polygraphic, since one simplified character, or graph, may correspond to more than one traditional (graphic) character, or vice versa.
This paper does not aim to present a detailed treatment of Chinese character sets and encoding methods. This can be found in Ken Lunde’s outstanding book CJKV Information Processing [Lunde 1999]. This section gives only a brief overview of some of the important issues, since our main goal is to deal with the higher level linguistic issues.
SC typically uses the GB 2312-80 (GB0) character set, or its expanded version called GBK, and is typically encoded in EUC-CN. For Internet data transmission, it is often encoded in HZ, or in the older zW. TC is typically encoded in Big Five, and less frequently in EUC-TW based on the Taiwanese CNS 11643-1992 (Chinese National Standard) character set.
In Japan, some wordprocessors handle Chinese characters via the JIS X 0208:1997 character set plus extensions. Similarly, it is possible to encode Chinese in the Korean character set KS X 1001:1992. However, in neither case are sufficient numbers of TC or SC characters available to adequately serve for general Chinese usage. This by no means exhausts the list of character sets for encoding Chinese (CCCII is an older Taiwanese standard still in use), and shows how complicated the situation is.
From the point of view of SC↔TCcode conversion, one major issue is that GB 2312-80 is incompatible with Big Five. The former contains 6763 characters, as opposed to 13,053 characters in the latter. Approximately one-third of the GB 2312-80 characters are simplified forms not present in Big Five. This leads to many missing characters on both sides, as shown in the table below.
|Hanzi||GB0 (EUC)||Big Five||Unicode|
The difficulties in SC↔TC conversion are not limited to the GB 2312-80 and Big Five character sets. In fact, Big Five contains only a subset of traditional forms, while GB 2312-80, surprisingly, does not contain some simplified forms, as shown in the table below.
|SC Unicode||SC Source||TC Target||TC Unicode|
The international standard ISO-2022:1994 [ISO 1994] attempted to address these incompatibility issues by establishing a portmanteau encoding system in which escape sequence mechanisms signal a switch between character sets, but this fell short of a complete solution.
The advent of the international character set Unicode/ISO 10646 [Unicode 1996] has solved many of the problems associated with SC↔TC code conversion. With a Unicode-enabled system, it is possible to represent all Big Five and GB 2312-80 codepoints, and to display them in the same document, since Unicode is a superset of both these standards. This greatly simplifies SC↔TC conversion at the codepoint level. Although there are some issues that still need to be addressed (e.g. numerous characters have been excluded from the current version [Meyer 1998]), Unicode has effectively solved the problems caused by incompatibility between the Big Five and GB 2312-80 character sets.
The process of automatically converting SC to TC (and, to a lesser extent, TC to SC) is full of complexities and pitfalls. The conversion can be implemented on four levels, in increasing order of sophistication, from a simplistic code conversion that generates numerous errors, to a sophisticated approach that takes the semantic and syntactic context into account and aims to achieve near-perfect results. Each of these levels is described below.
|Level 1||Code||Character-to-character, code-based substitution|
|Level 2||Orthographic||Word-to-word, character-based conversion|
|Level 3||Lexemic||Word-to-word, lexicon-based conversion|
|Level 4||Contextual||Word-to-word, context-based translation|
The easiest, but most unreliable, way to convert SC to TC, or vice versa, is to do so on a codepoint-to-codepoint basis; that is, to do a simple substitution by replacing a source codepoint of one character set (such as GB 2312-80 0xB9FA for SC 国) with a target codepoint of another character set (such as Big Five 0xB0EA for TC 國) by looking the source up in a hard-coded, one-to-one mapping table.
This kind of conversion can be described as character-to-character, code-based substitution, and is referred to as code conversion, because the units participating in the conversion process are limited to single codepoints. That is, the text stream is not parsed into higher level linguistic units, but is treated merely as a sequence of code values of discrete multiple-byte characters.
The following is an example of a one-to-one code mapping table.
|SC Source||GB0 (EUC)||TC Target||BIG5||Omitted Candidates|
|干||B8C9||幹||A47A||乾 干 榦|
Since such tables map each source character to only one target character, the other possible candidates (shown in the “Omitted Candidates” column) are ignored, which frequently results in incorrect conversion.
For example, an SC string such as 头发 ‘hair’ is not treated as a single unit, but is converted character by character. Since SC 头 maps only to TC 頭, the conversion succeeds. On the other hand, since SC 发 ‘hair’ maps to both TC 髮 ‘hair’ and TC 發 ‘emit’, the conversion may fail. That is, if the table maps 发 to 發, which is often the case, the result will be the nonsensical 頭發. ‘head’ + ‘emit’. On the other hand, if the table maps 发 to 髮, 头发 will correctly be converted to 頭髮, but other common words, such as SC 出发 ‘depart’, will be converted to the nonsensical 出髮 ‘go out’ + ‘hair’.
These problems are compounded if each element of a compound word maps to more than one character (polygraphic compounds), since the number of permutations grows geometrically, as shown in the table below.
|SC Source||Meaning||Correct TC||Other TC Candidates|
|出发||start off||出發||出髮 齣髮 齣發|
|干燥||dry||乾燥||干燥 幹燥 榦燥|
|暗里||secretly||暗裡||暗里 闇里 闇裡 暗裏 闇裏|
|千里||long distance||千里||韆里 千裡 韆裡 千裏 韆裏|
|秋千||a swing||鞦韆||秋千 秋韆 鞦千|
It is self-evident that, when there are several candidates to chose from, there is a high probability that a one-to-one code converter will output the incorrect combination. This demonstrates that code conversion cannot be relied upon to give accurate results without (often significant) human intervention.
Code conversion can be implemented in three different ways, in increasing order of sophistication:
Although this approach frequently leads to correct results, it is likely to fail in the many cases where the second (or third) alternative of multiple target mappings is itself of high frequency, as in the case of 发, which maps to both TC 發 and 髮.
We have investigated several systems based on the frequency approach, and found numerous errors and omissions. The greatest difficulty in building a frequency-based code converter is that accurate and comprehensive mapping tables, based on reliable statistics, did not hitherto exist, and require extensive research to develop. Appendix C shows an example of incorrect mappings found in a well-known converter, compared with the mapping tables developed by the CJKI.
Some major Chinese electronic dictionaries and wordprocessors, which claim to support TC, seem to be based on the simplistic approach. Some Chinese input systems take an approach that combines both (1) and (2). Approach (3), which is implemented in one of our in-house code converters, is rarely found.
To sum up, code conversion has the following disadvantages:
The next level of sophistication in SC↔TC conversion can be described as word-to-word, character-based conversion. We call this orthographic conversion, because the units participating in the conversion process consist of orthographic units: that is, characters or meaningful combinations of characters that are treated as single entries in dictionaries and mapping tables.
In this paper, we refer to these as word-units. Word-units represent meaningful linguistic units such as single-character words (free forms), word elements such as affixes (bound morphemes), multi-character compound words (free and bound), and even larger units such as idiomatic phrases. For brevity, we will sometimes use word as a synonym for word-unit if no confusion is likely to arise.
Orthographic conversion is carried out on a word-unit basis in four steps:
For example, the SC phrase 梳头发 (shu1 tou2fa0) ‘comb one’s hair’, is first segmented into the word-units 梳 ‘comb’ (single-character free morpheme) and 头发 ‘hair’ (two-character compound), each is looked up in the mapping table, and they are converted to the target string 梳頭髮. The important point is that 头发 is not decomposed, but is treated as a single word-unit. (Actually, this example is complicated by the fact that 梳頭 ‘comb one’s hair’ is also a legitimate word-unit.)
The following is an example of an orthographic (word-unit) mapping table. Appendix A gives a more detailed table.
|SC Word-Unit||TC Word-Unit||Pinyin||Meaning|
It is important to note that in both code conversion and orthographic conversion, the results must be in orthographic correspondence with the source. That is, the source and target are merely orthographic variants of the same underlying lexeme (see section 2.3.1 below). This means that each source character must be either identical to, or in exact one-to-one correspondence with, the target character.
For example, in converting SC 计算机 (ji4suan4ji1) to TC 計算機 ‘computer’, 计 corresponds to 計, 算 corresponds to 算 (identical glyph), and 机 corresponds to 機 on a one-to-one basis. No attempt is made to “translate” SC 计算机 to TC 電腦 (dian4nao3), as is done in lexemic (Level 3) conversion.
Orthographic conversion works well as long the source and target words are in orthographic correspondence, as in the case of SC 头发 and TC 頭髮. Unfortunately, Taiwan, Hong Kong, and the PRC have sometimes taken different paths in coining technical terminology. As a result, there are numerous cases where SC and TC have entirely different words for the same concept. Probably the best known of these is computer, which is normally 计算机 (ji4suan4ji1) in SC but always 電脳 (dian4nao3) in TC.
The next level of sophistication in SC↔TC conversion is to take these differences into account by “translating” from one to the other, which can be described as word-to-word, lexicon-based conversion. We call this lexemic conversion, because the units participating in the conversion process consist of semantic units, or lexemes.
A lexeme is a basic unit of vocabulary, such as a single-character word, affix, or compound word. In this paper, it also denotes larger units, such as idiomatic phrases. For practical purposes, it is similar to the word-units used in orthographic conversion, but the term lexeme is used here to emphasize the semantic nature of the conversion process.
In a sense, converting one lexeme to another is like translating between two languages, but we call it lexemic conversion, not “translation,” since it is limited to words and phrases of closely-related varieties of a single standard language, and no change is made in the word order (as is done in normal bilingual translation).
Let us take the SC string 信息处理 (xin1xi4 chu3li3) ‘information processing’, as an example. It is first segmented into the lexemes 信息 and 处理, each is looked up in a lexemic mapping table, and they are then converted to the target string 資訊處理 (zi1xun4 chu3li3).
It is important to note that 信息 and 資訊 are not in orthographic correspondence; that is, they are distinct lexemes in their own right, not just orthographic variants of the same lexeme. This is not unlike the difference between American English ‘gasoline’ and British English ‘petrol’.
The difference between 处理 and 處理, on the other hand, is analogous to the difference between American English ‘color’ and the British English ‘colour’, which are orthographic variants of the same lexeme. This analogy to English must not be taken too literally, since the English and Chinese writing systems are fundamentally different.
Lexemic conversion is similar to orthographic conversion, but differs from it in two important ways:
The following is an example of a lexemic mapping table.
|English||SC Lexeme||SC Pinyin||TC Lexeme||TC Pinyin|
As can be seen, the above table maps the semantic content of the lexemes of one variety of Chinese to the other, and in that respect is identical in structure to a bilingual glossary.
Another aspect of lexemic conversion is the treatment of proper nouns. The conversion of proper nouns from SC to TC, and vice versa, poses special problems, both in the segmentation process, and in the compilation of mapping tables. A major difficulty is that many non-Chinese (and even some Chinese) proper nouns are not in orthographic correspondence. In such cases, both code converters and orthographic converters will invariably produce incorrect results.
The principal issues in converting proper nouns are:
Following is an example of a mapping table for non-Chinese names that are not in orthographic correspondence.
|English||SC Source||Correct TC||Incorrect TC|
There are numerous other examples of this kind. These differences are not only extremely interesting in themselves, but have practical consequences. That is, since code and orthographic converters ignore them, they produce the unacceptable results shown in the “Incorrect TC” column above.
Below is an example of two-dimensional mappings, as explained in item (3) above:
|SC Source||Pinyin||TC as Name||TC as Word|
|周||zhou1||周||周 週 賙|
This means that SC 发, when used as a name, must always be converted to TC 發, never to TC 髮. This is quite difficult, since the segmenter must be intelligent enough to distinguish between a character used as a word as opposed to a proper noun. This is a complex issue that deserves a paper in its own right.
The highest level of sophistication in SC↔TC conversion can be described as word-to-word, context-based translation. We call this contextual conversion, because the semantic and syntactic context must be analyzed to correctly convert certain ambiguous polysemous lexemes that map to multiple target lexemes.
As we have seen, orthographic converters have a major advantage over code converters in that they process word-units, rather than single codepoints. Thus SC 特征 (te4zheng1) ‘characteristic’, for example, is correctly converted to TC 特徵 (not to the incorrect 特征). Similarly, lexemic converters process lexemes. For example, SC 光盘 (guang1pan2) ‘CD-ROM’ is converted to the lexemically equivalent TC 光碟 (guang1die2), not to its orthographically equivalent but incorrect 光盤.
This works well most of the time, but there are special cases in which a polysemous SC lexeme maps to multiple TC lexemes, any of which may be correct, depending on the semantic context. We will refer to these as ambiguous polygraphic compounds.
One-to-many mappings of polysemous SC compounds occur both on the orthographic level and the lexemic level. SC 文件 (wen2-jian4) is a case in point. In the sense of ‘document’, it maps to itself, that is, to TC 文件; but in the sense of ‘data file’, it maps to TC 檔案 (dang4'an4). This could occur in the TC-to-SC direction too. For example, TC 資料 (zi1liao4) maps to SC 资料 in the sense of ‘material(s); means’, but to SC 数据 (shu4ju4) in the sense of ‘data’.
To our knowledge, converters that can automatically convert ambiguous polygraphic compounds do not exist. This requires sophisticated technology that is similar to that used in bilingual machine translation. Such a system would typically be capable of parsing the text stream into phrases, identifying their syntactic functions, segmenting the phrases into lexemes and identifying their parts of speech, and performing semantic analysis to determine the specific sense in which an ambiguous polygraphic compound is used.
The CJKI is currently developing a “pseudo-contextual” conversion system that offers a partial solution to this difficult task. It does not do syntactic and semantic analysis, but aims to achieve a high level of accuracy by a semi-automatic process that requires user interaction. To this end we are:
The following is an example of a mapping table for ambiguous polygraphic compounds, both on the orthographic and the lexemic levels.
|SC Source||TC Alternative 1||TC Alternative 2|
|编制||編制 organize; establish||編製 make by knitting|
|制作||制作 creation (music etc.)||製作 manufacture|
|白干||白幹 do in vain||白干 strong liquor|
|阴干||陰乾 let pickles dry||陰干 even numbers|
|文件||檔案 (data) file||文件 document|
Our ultimate goal is to develop a contextual converter that will achieve near-perfect conversion accuracy. Such a converter should be capable of, among other things, to:
The following is an SC sentence that will no doubt confuse even the most sophisticated conversion engine:
Hey, Fa! Could you please send this fax?
Fa nodded his head and sent the fax.
The most advanced converters today could not possibly do better than:
A Chinese speaker will find it humorous that the converter confused the independent SC words 头 (tou2) ‘head’ and 发 (fa1) ‘send’ with the compound word 头发 (tou2fa0) ‘hair’. The ideal contextual converter should be able to identify these as independent words that happen to be contiguous, and, hopefully, should be able to generate the correct:
Ironically, a simplistic code converter, precisely because it does not recognize word-units, will probably give the correct results in this case, but for the wrong reasons! Admittedly, this is a contrived example. But it is a perfectly natural Chinese sentence, and clearly demonstrates the pitfalls and complexities of Chinese-to-Chinese conversion.
Following is an example of SC-to-TC lexemic (Level 3) conversion.
Simplified Chinese (普通话简体字)
Traditional Chinese (臺灣的國語繁體字)
According to the Computer Weekly, the director of the Georgia Software Research Institute William Kennedy, and the director of Canton University’s Information Processing Institute Professor Dongfeng Zhou, held a press conference in Hong Kong on the topics “The Internet Today” and “The Future of the Information Superhighway.” They also discussed the plans of both institutes to build a “Database of Internet Information.”
The above passage, which is an example of SC-to-TC lexemic conversion, has several interesting features that demonstrate the principal challenges that must be overcome to achieve near-perfect conversion. Below we will examine the various issues related to the conversion process for each of the first three levels.
Let us first consider what would happen if the above passage were converted with a plain code converter. We did this with a popular wordprocessor developed by a Chinese university, and got the following (highly unacceptable) results:
The above brief passage contains six orthographic errors, enclosed in braces, and 11 lexemic errors, enclosed in square brackets. 29 out of 105 characters, or about 28%, were converted incorrectly. For now, we will ignore lexemic errors (such as 计算机 being converted to 計算機), all of which were converted incorrectly. The table below shows the orthographic errors (“TC Result”), the correct TC equivalents, and other potential candidates.
|SC Source||TC Result||Correct TC||Correct||Other Candidates|
|发表||發表||發表||yes||發表 髮表 發錶 髮錶|
SC compound words consisting of characters that map to only one TC character have only one TC candidate, and were therefore converted with 100% accuracy. Some compounds containing polygraphic characters, such as SC 发( which maps to TC 發 and 髮), were sometimes converted correctly, as in the case of 发表 to 發表. But in other cases, as in SC 周 (which maps to TC 周, 週 and 賙), they were often converted incorrectly, as happened with 周报 being converted to 周報, as well as in five other cases.
The above analysis demonstrates how unreliable code conversion can be.
The failure to convert SC 周报, 并且 and other words correctly could be resolved by using Level 2 orthographic conversion. Such compounds are recognized as word-units by the segmenter, are looked up in the orthographic mapping tables, and are then unambiguously converted to their correct TC equivalents.
The following is an example of a table that maps SC word-units to TC word-units on the orthographic level.
|SC Source||TC Target||Pinyin||English|
|东丰||東豐||dong1feng1||Donfgeng (a name)|
Using such tables ensures correct conversion on a word-unit level, and avoids the problems inherent in one-to-one code converters.
As we have seen, code and orthographic converters are incapable of dealing with lexemic differences, such as between SC 计算机 and TC 電腦, since these are distinct lexemes for the same concept. There are also many non-Chinese proper nouns that are not transliterated with the same characters. For example, SC 佐治亚 (zuo3zhi4ya4), a phonetic transliteration of ‘Georgia’, should map to TC 喬治亞 (qiao2zhi4ya4), not to its orthographically equivalent 佐治亞.
As the “Correct” column in the table below shows, all the SC lexemes and proper nouns which are not in orthographic correspondence with their TC equivalents were converted incorrectly.
|English||SC Lexeme||SC Pinyin||TC Lexeme||TC Pinyin||Correct|
The above analysis demonstrates that the use of lexemic mapping tables is essential to the attainment of a high level of conversion accuracy.
The one-to-many mapping problem is not limited to the SC-to-TC direction. In fact, most of the difficulties encountered in SC-to-TC conversion are present in TC-to-SC conversion as well. However, the one-to-many mappings on the orthographic level are far less numerous in the TC-to-SC direction.
Nevertheless, we have found a few dozen polygraphic traditional characters that map to two simplified forms, as shown in the table below.
|TC Source||SC Target||Meaning||SC Example|
|徵 zheng1||征||go on journey||长征|
|徵 zhi3||徵||ancient note||宫商角徵羽|
|於 yu2||于||at, in||关于|
|於 yu2||於||Yu (a surname)||於先生|
Some of these characters, such as TC 著 which maps to SC 著 and 着, are of high frequency and are found in hundreds of compound words, so that TC-to-SC conversion is not as trivial as may appear at first sight.
It is worthwhile noting that TC-to-SC mappings are not always reversible. For example, SC 后 (hou4) ‘after; queen’ maps to both TC 後 (hou4) ‘after’ and to TC 后 (hou4) ‘queen’, whereas the TC surname 後 maps only to SC 後. This means that SC-to-TC mapping tables must be maintained separately from TC-to-SC mapping tables.
What is the extent of this problem? Let us look at some statistics. A number of surveys, such as [Xiandai 1986], have demonstrated that the 2000 most frequent SC characters account for approximately 97% of all characters occurring in contemporary SC corpora. Of these, 238 simplified forms, or almost 12%, are polygraphic; that is, they map to two or more traditional forms. This is a significant percentage, and is one of the principal difficulties in converting SC to TC accurately.
Going in the other direction, from TC to SC, the scope of the problem is much more limited, but we have found that about 20 of the 2000 most frequent Big Five characters, based on a corpus of more than 170 million TC characters [Huang 1994], map to multiple SC characters.
But these figures tell only part of the story, because they are based on single characters. To properly grasp the full magnitude of this problem, we must examine the occurrence of all word-units that contain polygraphic characters.
Some preliminary calculations based on our comprehensive Chinese lexical database, which currently contains approximiately three million items, show that more than 20,000 of the approximately 97,000 most common SC word-units contain at least one polygraphic character, which leads to one-to-many SC-to-TC mappings. This represents an astounding 21%. A similar calculation for TC-to-SC mappings resulted in 3025, or about 3.5%, out of the approximately 87,000 most common TC word-units. These figures demonstrate that merely converting one codepoint to another, especially in the SC-to-TC direction, will lead to unacceptable results.
Since many high-frequency polygraphic characters are components of hundreds, or even thousands, of compound words, incorrect conversion will be a common occurrence unless the one-to-many mappings are disambiguated by (1) segmenting the byte stream into semantically meaningful units (word-units or lexemes) and, (2) analyzing the context to determine the correct choice out of the multiple candidates.
In 1996, the Tokyo-based CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography [Halpern 1994, 1998], launched a project whose ultimate goal is to develop a Chinese-to-Chinese conversion system that gives near-perfect results. This has been a major undertaking that required considerable investment of funds and human resources.
To this end, we have engaged in the following research and development activities:
To achieve a high level of conversion accuracy, our mapping tables are comprehensive, and include approximately three million general vocabulary lexemes, technical terms, and proper nouns. They also include various other attributes, such as pinyin readings, grammatical information, part of speech, and semantic classification codes.
Below is a brief description of the principal components of the conversion system, especially of our mapping tables:
Chinese-to-Chinese conversion has become increasingly important to the localization, translation, and publishing industries, as well as to software developers aspiring to penetrate the East Asian market. But, as we have seen, the issues are complex and require a major effort to build mapping tables and to develop segmentation technology.
The CJK Dictionary Institute finds itself in a unique position to provide software developers with high quality Chinese lexical resources and reliable conversion technology, thereby eliminating expensive manual labor and significantly reducing costs. We are convinced that our ongoing research and development efforts in this area are inexorably leading us toward achieving the elusive goal of building the perfect converter.
We would like to extend our heartfelt gratitude to the various individuals who have contributed to this paper by reviewing it and offering their comments and constructive criticisms. This includes, in alphabetical order, Glenn Adams, James Breen, Sijin Cheng, Carl Hoffman, Timothy Huang, Ken Lunde, Dirk Meyer, Frank Qian, Tsuguya Sasaki, David Westbrook, and Christian Wittern. Several members of the review team are recognized authorities in the field of CJK information processing.
Special recognition is due to Glenn Adams and James Breen, who have reviewed the paper carefully and made many invaluable suggestions.
|GB Code||Source SC||Target TC||Big Five Codes|
|B0B5||暗||暗 闇||B774 EEEE|
|B2C5||才||才 纔||A47E C5D7|
|B3D4||吃||吃 喫||A659 B3F0|
|B5D6||抵||抵 牴 觝||A9E8 ACBB DBD3|
|B6AC||冬||冬 鼕||A556 C35D|
|B7E1||丰||豐 丰 風||C2D7 A4A5 ADB7|
|B8F6||个||個 箇||ADD3 BAE7|
|C0DB||累||累 纍||B2D6 F5EC|
|C3B9||霉||霉 黴||BE60 C5F0|
|CAAC||尸||屍 尸||ABCD A472|
|D5F7||征||徵 征||BC78 A9BA|
|DAD6||谥||諡 謚||EBAC EEB0|
|F3BD||蠼||蠼 蠷||F96E F8BE|
|Big Five||Source TC||Target SC||GB 2312-80 (EUC)|
|ADB7||風||风 丰||B7E7 B7E1|
|B0AE||乾||干 乾||B8C9 C7AC|
|BC78||徵||征 徵||D5F7 E1E7|
|SC Source||TC Target|
|GB Code||SC Source||Incorrect TC||Correct TC|
|B7E1||丰||丰 豐||豐 丰 風|
|D4C6||云||云 雲||雲 云|
|BCB8||几||几 幾||幾 几|
|B8B4||复||復 複||復 複 覆 复|
|B8C9||干||干 幹||幹 乾 干 榦|
|B5D6||抵||抵||抵 牴 觝|
|BDDC||杰||杰 傑||傑 杰|
President, The CJK Dictionary Institute
Editor in Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University
Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary [Halpern 1990] for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.
Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.
Following is a list of Jack Halpern’s principal publications in the field of CJK lexicography.
JOUNI KERMAN 華留萬陽貳 (ケルマン・ヨウニ)
Chief of Software Development, CJK Dictionary Publishing Society
Research Fellow, Showa Women’s University
Born in 1967 in Finland, Jouni Kerman took a broad interest in languages, computer programming, and economics since his early teens. Besides his native tongue Finnish, he has studied English, Swedish, French, German, Italian, Japanese, Mandarin and Cantonese. He has received a Monbusho scholarship for advanced studies of Japanese from the Japanese Ministry of Education in 1992, and in 1996 he graduated from Helsinki School of Economics and Business Administration with a Master’s Degree.
In 1996, Jouni Kerman joined the Kanji Dictionary Publishing Society in Tokyo on a Research Fellow grant from Showa Women’s University to develop a page composition system for The Kodansha Kanji Learner’s Dictionary [Halpern 1999].
The CJK Dictionary Institute
The CJK Dictionary Institute (CJKI) consists of a group of linguists and other experts who specialize in CJK (Chinese, Japanese, Korean) lexicography. The Institute is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary (http://kanji.org), which has become a standard reference work for studying Japanese.
The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methods are used to compile and maintain a Unicode-based database that is serving as a source of data for:
The base currently contains about 2.1 million Japanese and 2.5 million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and proper nouns. We firmly believe our that database of proper nouns, which has over one million Japanese and one million Chinese items, is without peer both in terms of quantity and quality. Our single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/datasrc.htm for a list of data resources.
CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.
Visit our website at:
|The CJK Dictionary Institute|
|Komine Building, 34-14, 2-chome, Tohoku, Niiza-shi|
|Saitama 352-0001 JAPAN|
|Phone: +81-48-473-3508 Fax: +81-48-486-5023|