|Index to This Document|
This report provides an overview of the linguistic and orthographic issues related to the processing of Japanese texts, especially webpages. It has two basic aims:
The existence of orthographic variants and the morphological complexity of Japanese pose a special challenge to the developers of linguistic tools, especially in the field of information retrieval, since the same word or phrase can be written in multiple, often unpredictable, ways. This report focuses on the major types of orthographic variation in Japanese, and provides a brief analysis of the linguistic issues to be considered by software developers.
The Japanese orthography is highly irregular. Because of the large number of orthographic variants and easily confused homophones, the Japanese writing system is an order of magnitude more complex than any other major language, including Chinese. A major factor is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways. For more information, see Outline of Japanese Writing System (from the author's New Japanese-English Character Dictionary).
Table 1 shows the orthographic variants of the words 取り扱い toriatsukai 'handling'. Study it carefully to get a good grasp of the various issues involved.
|toriatsukai||Type of variant|
|とり扱い||replace kanji with hiragana|
|取りあつかい||replace kanji with hiragana|
One of the most important types of orthographic variation in Japanese occurs in kana endings, called 送り仮名 okurigana, that are attached to a kanji base or stem, as shown in Table 2.
|English||Reading||"Standard" Form||Variant 1||Variant 2||Variant 3|
Okurigana variants are very common. Because usage is often unpredictable, they are a nuisance in any kind of Japanese language processing When normalizing Japanese orthographic variants, special attention must be given to register all okurigana variants.
Japanese is written in a mixture of four scripts: kanji (Chinese characters), two syllabic scripts called hiragana and katakana, and romaji (the Latin alphabet). Orthographic variation across scripts is as common as it is unpredictable, so that the same word can be written in hiragana, katakana or kanji, or even in a mixture of two scripts. The table below shows the major cross-script variation patterns in Japanese.
|Type of Variation||Var 1||Var 2||Var 3|
|Kanji vs. Hiragana||
|Kanji vs. Katakana||
|Kanji vs. hiragana vs. katakana||
|Katakana vs. hybrid||
|Kanji vs. katakana vs. hybrid||
|Kanji vs. hybrid||
|Hiragana vs. katakana||
Though the Japanese writing system underwent major reforms in the postwar period and the character forms have by now been standardized, there is still a significant number of character form variants in common use, especially in proper names. Moreover, some classical works such as the Buddhist scriptures are written in the traditional character forms.
|Type of Variation||English||Reading||Standard||Variant|
|Abbreviated form||largely||oohaba ni||大幅に||大巾に|
|Variant form||10 years old||jussai||十才||十歳|
Recent years have seen a sharp increase in the use of katakana, a syllabary used mostly to write Western loanwords. Katakana orthography is often irregular, and it is quite common for the same word to be written in multiple ways.
The hiragana syllabary is used mostly to write grammatical elements and some native Japanese words. In recent years there has been a considerable increase in the use of hiragana. Though hiragana orthography is quite regular, there is a certain amount of irregularity.
Some of the major types of kana variation are shown in the table below.
|Type of Variation||English||Reading||Standard|
|Multiple kana||team||chiimu |
|づ vs ず||to continue||tsuzuku||つづく||つずく|
The above is only a brief introduction to the most important types of kana variation. There are various others, such as an optional middle dot (nakaguro) and small kana variants (クォ vs. クオ) in katakana words, the use of traditional (じ vs. ぢ) and historical (い vs. ゐ) kana, and more.
An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones (words pronounced the same but written differently), especially kun (native Japanese) homophones. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. The majority of kun homophones are often close or even identical in meaning and thus easily confused, i.e., noboru means 'go up' when written 上る but 'climb' when written 登る, as shown in the table below.
|Reading||Hom. 1||Hom. 2||Hom. 3||Hom. 4||Meaning|
In processing Japanese texts, a central problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses, while the meanings of some homophones are identical or nearly identical. To make matters worse, the distinctions are sometimes so subtle that many authors ignore the kanji and use hiragana instead.
An advanced form of variant expansion is synonym expansion and cross-language information retrieval (CLIR). The details of this are beyond the scope of this report. In a nutshell, synonym expansion generates a picklist of synonyms and other semantically related variants; cross-language expansion generates a picklist of Japanese equivalents to a foreign language source term. These are shown in the table below, along with some miscellaneous variant types such as abbreviations and loanwords.
For more information on synonym expansion and CLIR, see my article Cross-Synonym and Cross-Language Searching in Japanese.
One of the central components necessary for building tools for processing Japanese orthographic and other variants is a database of hard-coded mapping tables of orthographic variants and lexical databases for synonym and cross-language expansion, fine-tuned to the needs of variant expansion and normalization. Below is a list of components required for building such a database, which have been developed by our team of lexicographers.
See the following links for more information: