The Complexities of Japanese Homophones


Jack Halpern
CEO
The CJK Dictionary Institute, Inc.
株式会社 日中韓辭典研究所
Revised: June 23, 2001



Index to This Document
  1. Some Definitions
  2. Introduction
  3. Overview of Japanese Homophony
  4. Homophone Processing
  5. About the Author
  6. List of Publications
  7. The CJK Dictionary Institute

1. Some Definitions

A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.


  1. Homophone: One of two or more words that are pronounced the same but differ in writing and usually in meaning (e.g. principal and principle).
  2. Homograph: One of two or more words that are written the same but differ in pronunciation and (usually) in meaning (misleadingly also called heteronyms) (e.g. minute "60 seconds" and minute "very small").
  3. Homonym: One of two or more words that are identical in writing and/or pronunciation but differ in meaning (sometimes called homologues) (e.g. light "not heavy" and light "not dark").
  4. Orthographic Variant: One of two or more words that are written differently but are identical in pronunciation and meaning (sometimes called heterographs) (e.g. judgement and judgment).

2. Introduction

Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included. A major source of complexity in processing Japanese texts is the presence of an extremely large number of homophones.

This article presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that treatment of homophones in Japanese texts deserves special attention in the realms of NLP, MT, IR and IME applications.

Here is an example of how complex the problem is. Let us say take the phrase Hi no sasanai yashiki (A Mansion with no Sunshine), which could be the name of a novel or a film. Here are twelve legitimate ways (some more likely than others) of how to write this.

  1. 日の差さない屋敷
  2. 日の射さない屋敷
  3. 日のささない屋敷
  4. 日の射さない邸
  5. 日の差さない邸
  6. 日のささない邸
  7. 陽の射さない屋敷
  8. 陽の差さない屋敷
  9. 陽のささない屋敷
  10. 陽の射さない邸
  11. 陽の差さない邸
  12. 陽のささない邸

We did a survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly, we received six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, even native speakers of Japanese cannot possibly be expected to know which specific variant is used in the official title.

3. Overview of Japanese Homophony

An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.

On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:

Kun Homophones
Easily DistinguishedEasily Confused
hashi noboru


bridge
end, edge
chopsticks
上る
登る
昇る
go up (steps, a hill)
climb, scale
ascend, rise (up to the sky)

Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.

To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and identification more difficult.

4 Homophone Processing

By "homophone processing" we mean such operations as cross-homophone searching, homophone disambiguation in IME systems, and homophone identification in MT applications. Homophone processing requires a semantically classified database of homophones and a homophone expansion algorithm.

The process of retrieving or identifying Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. But there are factors that make the processing of Japanese homophones far more challenging than in any other language. From a text processing point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.

For example, when a search engine user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.

Kun Homophones for sasu
No.English "Standard"
Form
Sometimes
also
Often
also
1 to offer 差す   さす
2 to hold up 差す   さす
3 to pour into 差す 注す さす
4 to color 差す 注す さす
5 to shine on 差す 射す さす
6 to aim at 指す 差す  
6 to point to 指す さす  
7 to stab 刺す さす  
8 to leave unfinishedさす 止す  

To sum up, Japanese homophones have certain characteristics that present difficulties in Japanese text processing:

  1. Since many kun homophones are nearly synonymous or even identical in meaning, they are easily confused. As a result, there is no way to predict which particular homophone will appear in a text.
  2. The distinction between some homophones is so subtle that many authors sidestep the irksome task of selecting the appropriate kanji and resort to hiragana.
  3. Since Japanese has only a small stock of phonemes, the number of homophones is very large.

Implementing homophone processing technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.


About the Author

JACK HALPERN     春遍雀來     (ハルペン・ジャック)

President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK).

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

The CJK Dictionary Institute

The The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

  1. Dozens of lexicographic works, including electronic dictionaries.
  2. Search engine applications, such as morphological analyzers and simplified to/from traditional Chinese conversion systems.
  3. CJK input method editors (IME) and front-end processors (FEP).
  4. Machine translation, online translation tools and speech technology software.
  5. Pedagogical, linguistic and computational lexicography research.

DESK currently has over two million Japanese and about 2.5 million simplified and traditional Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.


President
Jack Halpern
The CJK Dictionary Institute, Inc.
日中韓辭典研究所

34-14, 2-chome, Tohoku, Niiza-shi
Saitama 352-0001 JAPAN
Phone: +81-48-473-3508
Fax: +81-42-587-3318
Email: jack [at] cjki.org
WWW: http://www.cjk.org