The Challenges of Intelligent Japanese Searching
知的日本語検索の諸課題


Jack Halpern
President
©2004-2024 The CJK Dictionary Institute, Inc.





Index to This Document
  1. Introduction
  2. Cross-orthographic searching
  3. Cross-homophone searching
  4. Homograph disambiguation
  5. Cross-synonym searching
  6. Cross-language searching
  7. Morphological Analysis
  8. Lexical databases
  9. Conclusions

Abstract

The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies such as cross-language, cross-script, cross-orthographic, and cross-synonym searching, and demostrates that lexical databases must play a central role in their implementation.



1. Introduction

1.1 Brief Outline of Japanese Writing System

It is often said that Japanese has the most complex writing system in the world. As we shall see below, this claim is fully justified. Contemporary Japanese is written in a mixture of four scripts, each of which has a distinct function.

  1. Thousands of logographic characters, called 漢字 kanji, derived from Chinese.
  2. A native syllabic script called 平仮名 hiragana.
  3. Another native syllabic script called 片仮名 katakana.
  4. Recently the Latin alphabet, called ローマ字 roomaji, has become increasingly common.

Kanji are used to write the core of the Japanese vocabulary. This includes words of Chinese origin, words coined in Japan on the Chinese model, such as 山脈 sanmyaku 'mountain range', as well as native Japanese words, such as 山 yama 'mountain'. Kanji have three basic properties: form, sound, and meaning. Each character may be pronounced according to several distinct pronunciations, called readings. A character may have one or several Chinese derived on readings, or one or several (sometimes dozens) native Japanese kun readings, and each reading may have numerous meanings associated with it.

Hiragana is used mostly to write grammatical elements, such as inflectional verb endings, and sometimes for writing native Japanese words. For example, in 見た mita the kanji 見 represents the stem of the verb 見る miru 'see' and た ta is a verb ending for forming the past tense. The hiragana endings attached to a kanji stem are called 送り仮名 okurigana.

Katakana is used mostly to write Western loanwords, such as プリンター purintaa 'printer', and onomatopoeic words, such as カチッと kachitto 'with a click'. The Latin alphabet is used for writing acronyms, for some loanwords instead of katakana, and for stylistic effects, especially in the names of shops and magazines .

A running Japanese text normally consists of a mixture of kanji and kana, as shown below:

漢字を組み合わせることによって多数の熟語が作り出せます。
Kanji o kumiawaseru koto ni yotte tasuu no jukugo ga tsukuridasemasu.
Numerous compound words can be formed by combining Chinese characters.

In the above sentence, case particles such as を o (object marker), as well as verb endings (-わせる -waseru in 組み合わせる kumiawaseru 'combine'), are written in hiragana, whereas nouns, such as 熟語 jukugo 'compound word', are written in kanji.

A fuller description of the Japanese writing system can found in the front matter of the author's New Japanese-English Character Dictionary.


1.2 Intelligent Japanese Searching

Several factors contribute to the difficulties of Japanese information retrieval and query processing. To build a truly sophisticated, "intelligent" Japanese search engine, various challenges must be overcome. Here are some of the major issues:

  1. The lack of a standard, universally accepted, orthography; that is, the presence of a large number of orthographic variants and easily confused homophones.
  2. The morphological structure of the language, which poses a formidable challenge to the development of accurate segmentation and conflation technologies.
  3. The need to support advanced linguistic technologies such as cross-orthographic searching.
  4. Miscellaneous technical requirements such as transcoding between multiple character sets and encodings, support for Unicode, and input method editors.

Each of the above are major issues that deserves a paper in its own right. In this paper, we will focus on one of the central linguistic issues; that is,

The lack of a standard, universally accepted, orthography.

To our knowledge, few if any search engines have addressed orthographic variation, nor any of the other linguistic issues described in this paper. Let us take a very quick look at the current state of search engine technology.

The earliest search engines, such as Altavista, Yahoo!, and Excite, are often referred to as first-generation search engines. Such an engine searches its index for the search term entered by the user, then generates a page with a list of relevant, and often irrelevant, links ranked by frequency of occurrence of the search term.

Second-generation search engines, such as Direct Hit and Northern Light, take a more intelligent approach to ensure relevancy by ordering the search results by various criteria such as popularity, semantic categories, link frequency, and page ranks. An excellent example of this is Google, which ranks results by the number of links from pages which themselves have a high rank.

None of these engines, including the very few that claim to be third generation, support but a bare minimum of computational linguistic features. Here, we will define the direction of third-generation search engines by focusing on the future of Japanese search and retrieval technology, especially such advanced linguistic technologies as cross-language, cross-script, cross-orthographic, and cross-synonym searching (described below).


1.3 The Big Picture

Superficially, it would seem as if search engines need only search for the actual keywords provided by the user. In fact, from personal discussions with the executives of several leading search engine companies, it is clear that they deliberately follow a policy of "not to cast a wide net", so that searching for "travelled to Britain" will not match "traveled to Britain", not to speak of "travel to the U.K."

Ostensibly, the justification for such a policy is to prevent flooding the user with irrelevant results. The real reason, no doubt, is that they do not possess the technology for linguistically sophisticated searching. What such a policy often does achieve is the proverbial "throwing the baby out with the bathwater", since many relevant results are indiscriminately ignored along with the irrelevant ones.

Now let us step back and have a closer look at the big picture, strictly from the user's point of view. That is, let us pose the most relevant question of all:

What, exactly, is it that a search engine user really wants?

In this paper we will demonstrate that, especially in the case of Japanese, the user is far more interested in the inherent meaning (semantic content) represented by the search term, rather than in the accidental form (written representation) of any of its orthographic variants.

In many of the major languages of the world, orthographic variation is not a major issue, since their orthographies tend to be stable. Though English is notorious for its spelling irregularities, spelling variants (such as 'judgement' vs. 'judgment', 'wordprocessor' vs. 'word processor') are more of an annoyance than a major obstacle. For the most part, users can expect the orthographic representation of the search term to have little or no variation.

Not so with Japanese. Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included (Simplified Chinese has a remarkably stable orthography).

Should the user of a Japanese search engine be required to be intimately familiar with these complexities? Obviously not. Ideally, users should only be concerned with quickly finding the information they are seeking, not with the intricacies of the Japanese writing system.

That, in a nutshell, is where the real power of an "intelligent" Japanese search engine comes in. It relieves the user of the burden of dealing with the details of how the search term should be written, and lets her focus on the real issue at hand: defining the content of the information to be retrieved.


2. Cross-Orthographic Searching

This section presents a brief overview of Japanese orthographic variation, focusing on those issues most relevant to information retrieval. The highly irregular Japanese orthography is a major obstacle to efficient searching. Our aim is to describe the complexities of the Japanese orthography, and to demostrate that the intelligent search engine should be capable of retrieving all the orthographic variants of the search term; in other words, of performing cross-orthographic searching.


2.1 Orthographic Chaos?

The Japanese orthography is highly unstable, bordering on the chaotic. A major factor that contributes to this state of affairs is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways.

Study the table below, which shows the orthographic variants of the words 取り扱い toriatsukai 'handling' and 当たり外れ ataraihazure 'hit or miss'.


Some Orthographical Variants
toriatsukai atarihazure Type of variant
取り扱い 当たり外れ "standard" form
取扱い 当り外れ okurigana variant
  当外れ okurigana variant
取扱 当外 all kanji
とり扱い 当たりはずれ replace kanji with hiragana
取りあつかいあたり外れ replace kanji with hiragana
とりあつかいあたりはずれ all hiragana

It is important to note that these variants are not contrived examples for the sake of illustration. All the above forms do occur in contemporary Japanese, though some are less frequent than others. Even in carefully edited publications, not to speak of sloppily written webpages, there is no reliable way to predict which particular variant will occur, as this often depends on the whim of the author or editorial policy.

How does such orthographic variation affect the search engine user? Let us examine this issue a bit more in depth, looking at it entirely from the user's point of view. Let us say that a user is looking for a novel called Hi no sasanai yashiki (A Mansion with no Sunshine). Here are twelve legitimate ways (some more likely than others) of how to write this phrase.

  1. 日の差さない屋敷
  2. 日の射さない屋敷
  3. 日のささない屋敷
  4. 日の射さない邸
  5. 日の差さない邸
  6. 日のささない邸
  7. 陽の射さない屋敷
  8. 陽の差さない屋敷
  9. 陽のささない屋敷
  10. 陽の射さない邸
  11. 陽の差さない邸
  12. 陽のささない邸

We did a quick survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly (or not surprisingly), we recveived six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, users of Japanese search engines cannot possibly be expected to know which specific variant is used in the official title of the book.

Now let us say that the user is searching for the Japanese equivalent of the proverbial "A hen that lays golden eggs." Theoretically, the "standard" way of writing this is:

金の卵を産む鶏
Kin no tamago wo umu niwatori

In reality, three of the keywords above have the following common variants:


EnglishReading"Standard"
Form
Variant 1Variant 2Variant 3
egg tamago 玉子 たまご タマゴ
hen niwatori にわとりニワトリ  
to lay umu 産む生む    

Combining the permutations for the above three words yields 24 possible ways of writing the original phrase, as shown in the table below:


"Kin no tamago wo umu niwatori"
'A hen that lays golden eggs'
にわとり ニワトリ
1. 金の卵を産む鶏 9. 金の卵を産むにわとり 17. 金の卵を産むニワトリ
2. 金の卵を生む鶏 10.金の卵を生むにわとり 18. 金の卵を生むニワトリ
3. 金の玉子を産む鶏 11.金の玉子を産むにわとり 19. 金の玉子を産むニワトリ
4. 金の玉子を生む鶏 12. 金の玉子を生むにわとり 20. 金の玉子を生むニワトリ
5. 金のたまごを産む鶏 13. 金のたまごを産むにわとり 21. 金のたまごを産むニワトリ
6. 金のたまごを生む鶏 14. 金のたまごを生むにわとり 22. 金のたまごを生むニワトリ
7. 金のタマゴを産む鶏 15. 金のタマゴを産むにわとり 23. 金のタマゴを産むニワトリ
8. 金のタマゴを生む鶏 16. 金のタマゴを生むにわとり 24. 金のタマゴを生むニワトリ

Again, rest assured that this is not a contrived example designed to make a point. Not only is each of the variants for tamago, niwatori and umu of frequent occurrence in written Japanese -- they actually occur within the phrase in question, as can be easily verified by searching with your favorite search engine. Clearly, the user has no hope of finding all these variants unless the search engine can perform sophisticated cross-orthographic searching, as discussed below.


2.2 Okurigana Variants

A major factor that exacerbates the difficulties of Japanese information retrieval is the orthographic variations that occurs in kana endings, called 送り仮名 okurigana, that are attached to a kanji base or stem, as shown in the table below:

Okurigana Variants
English Reading "Standard" Form Variant 1 Variant 2 Variant 3
publish kakiarawasu 書き表す 書き表わす 書表わす 書表す
perform okonau 行う 行なう    
Tokyo-bound Tookyoo-yuki 東京行き 東京行    

Okurigana variants are very common. Though the Japanese government publishes guidelines, actual usage is unpredictable and depends on editorial policy or personal preference. The intelligent Japanese search engine should be able to retrieve all such variants, regardless of how the search term is input.

2.3 Cross-script Searching

As explained in Section 1, Japanese is written in a mixture of four different scripts. Orthographic variation across scripts is a common occurrence, so that the same word can be written in hiragana, katakana, kanji, or even in the Latin alphabet. To compound these difficulties, the same word is sometimes written in two scripts, such as hiragana and kanji.

Study the table below carefully, which shows all the major cross-script variation patterns in that occur in Japanese:


Cross-Script Orthographic Variation
No.English Reading Kanji Hiragana Katakana Latin Hybrid
1many peopleoozei 大勢 おおぜい      
2say iu (yuu) 言う いう      
3sulfur ioo 硫黄   イオウ    
4cat neko ねこ ネコ    
5kilo kiroguramu     キログラム kg  
6shirt waishatsu     ワイシャツ   Yシャツ
7skin hifu 皮膚   ヒフ   皮フ
8comet suisei 彗星       すい星
9glittering pikapika   ぴかぴか ピカピカ    
10open oopun     オープン open
OPEN
 

The above table shows that almost any combination of scripts can occur: kanji vs. hiragana, hiragana vs. katakana, Latin vs. katakana, etc. Cross-script variation is as common as it is unpredictable. From an information retrieval point of view, what is particularly irksome is the recent tendency to write many common kun words (like #2 above), and even on words (like #1 above), in hiragana instead of kanji, based on the widespread misconception that hiragana is "easier" to read.

When inputting a keyword, the user cannot be expected to be aware of these script differences. It goes without saying that the intelligent search engine should be capable of performing cross-script searching; that is, should be able to retrieve all such variants, regardless of the script the keyword is provided in.


2.4 Kanji Variants

Though written Japanese underwent major reforms in the postwar period, resulting in the simplification and standardization of character forms, there is nevertheless a significant number of character form variants in common use, especially in proper names. Classical Japanese literature and religious texts such as the Buddhist scriptures are written almost exclusively in the traditional old forms.


Kanji Variants and Traditional Forms
English Reading Standard Variant Comment
largely oohaba ni abbreviated form
10 years oldjussai variant form
proper name Nakajima variant form
developmenthattatsu traditional form

Since the use of variant forms is not uncommon, the intelligent search engine should be able to retrieve all such forms. For searching classical texts and religious scriptures, retrieving the traditional forms based on the simplified forms is especially important.


2.5 Phonetic Substitutes

Japanese has numerous orthographic variants based on the principle of phonetic substitution. For example, 盲 is interchangeable with 妄 in such compounds as 妄想 (=盲想) moosoo 'wild idea', but not in 盲従 moojuu 'blind obedience'. One such variant, in this case 妄, is a phonetically replaced character, and the other, in this case 盲, is a phonetic replacement character. Such characters have the same reading and are often similar in meaning.


Phonetic Substitutes
English Reading Phonetic
Replacement
Phonetically
Replaced
fermentation hakkoo
satire fuushi
linking renkei
linking moosoo
abuse ranyoo

Though the older, phonetically replaced characters are gradually going out of use, their frequency of occurrence is sufficiently high to warrant support by search engines.


2.6 Katakana Variants

The katakana syllabary is used mostly to write Western loanwords, onomatopoeic words, names of plants and animals, non-Japanese personal and place names, for emphasis, and for slang. Recent years have seen an enormous increase in katakana use, especially in technical terminology.

Unfortunately, katakana orthography is often irregular, so that the same word may be written in multiple ways. Basically, the katakana transliteration of a loanword is an attempt to approximate the pronunciation of its etymon (the foreign word from which it is derived). Although there are general guidelines for loanword orthography, in practice there is considerable variation.

Katakana variation can be classified into the following types:

  1. The presence or absence of a macron (a dash-like symbol that indicates long vowels).
  2. The presence or absence of nakaguro -- a middle dot between katakana words).
  3. Replacing macrons with actual vowels to indicate long vowels.
  4. A single foreign sound may be transcribed by multiple kana characters.
  5. Miscellaneous katakana variants.

Typology of Katakana Variation
Variation
Type
English Reading Standard
Form
Variants
1. Macron computer konpyuuta
konpyuutaa
コンピューコンピューター
  user yuuza
yuuzaa
ユーザー ユー
2. Nakaguro online onrain オンライン オンライン
  ice cube aisukyuubu アイスキューブアイスキューブ
3. Long vowels eye shadow aishadoo アイシャドー アイシャドウ
  maid meedo メーメイ
4. Multiple kana Diesel diizeru
jiizeru
ディーゼル ーゼル
ーゼル
  team chiimu
tiimu
ームティーム
  violin baiorin
vaiorin
イオリン ヴァイオリン
5. Others quota kuootaa
kwootaa
クオータークォーター
  Jerusalem ierusaremu ルサレムイェルサレム

The above is only a brief introduction to the complexities of katakana variation, which is as common as it is unpredictable. To relieve the user from the burden of guessing the correct variant, the intelligent Japanese search engine should be capable of retrieving all katakana variants of the search term.

2.7 Hiragana Variants

As explained in Section 1, the hiragana syllabary is used mostly to write grammatical elements and some native Japanese words, such as adverbs and particles. In recent years there has been a considerable increase in the use of hiragana, both for stylistic effects and because of the popular belief that hiragana is easier to read than kanji.

Unfortunately, both for humans and computers, hiragana strings are considerably more difficult to segment than the equivalent kanji-kana texts. Since there are no delimiters between words, identifying the lexemes in a hiragana string is often a futile exercise in disambiguation.

If we discount the okurigana irregularities and cross-script variations explained in Section 2.1 and Section 2.2, the hiragana orthography is, in itself, quite regular. Nevertheless, there is a certain amount of hiragana irregularities, as explained below:

  1. The use of traditional kana orthography for the case particles は, へ and を, instead of わ, え and お.
  2. The use of traditional お instead of う to indicate long o in certain words.
  3. The use of voiced ぢ and づ in place of じ and ず when the former are (1) proceeded by ち and つ or (2) appear in voiced compounds derived from ち and つ. But for some words, as shown below, じ and ず are preferred. Actual usage is unpredictable.
  4. The use of historical kana orthography in prewar texts, classical literature and Buddhist scriptures.
  5. Miscellaneous hiragana variants, such as the use of the kana repetition symbol ゝ.

Typology of Hiragana Variation
Variation
Type
English Reading Standard
Form
Variants
1. Particles Hello konnichi wa こんにちこんにち
2. Traditional way toori
  big ookii きい きい
3. ぢ and づ to shrink chijimu
  to continue tsuzuku
  nosebleed hanaji はな はな
  to nod unazuku うなうな
4. Historical to use mochiiru
  long oo koo こうかう, かふ
  smell nioi おい
(匂い)
ほひ
5. Others here koko

Although hiragana variation is a relativity minor issue, the intelligent Japanese search engine should be capable of retrieving hiragana variants, regardless of how the keyword is input.


3. Cross-Homophone Searching

This section presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that the intelligent search engine should be capable of optionally retrieving the various homophonic variants of the search term; in other words, of performing cross-homophone searching.


3.1. Some Definitions

A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.

  1. Homophone: One of two or more words that are pronounced the same but differ in writing and usually in meaning (e.g. principal and principle).
  2. Homograph: One of two or more words that are written the same but differ in pronunciation and (usually) in meaning (misleadingly also called heteronyms) (e.g. minute "60 seconds" and minute "very small").
  3. Homonym: One of two or more words that are identical in writing and/or pronunciation but differ in meaning (sometimes called homologues) (e.g. light "not heavy" and light "not dark").
  4. Orthographic Variant: One of two or more words that are written differently but are identical in pronunciation and meaning (sometimes called heterographs) (e.g. judgement and judgment).

3.2 Overview of Japanese Homophony

An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.

On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:

Kun Homophones
Easily DistinguishedEasily Confused
hashi noboru


bridge
end, edge
chopsticks
上る
登る
昇る
go up (steps, a hill)
climb, scale
ascend, rise (up to the sky)

Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.

To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and searching more difficult (see 2.3 Cross-script Searching for details).

3.3 Intelligent Homophone Searching

Cross-homophone searching requires a semantically classified database of homophones and a homophone expansion algorithm. The process of searching for Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. In English, however, cross-homophone searching is clearly undesirable. A user searching for right is most certainly not interested in finding write.

Not so in Japanese. From an information retrieval point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.

This means that, when the user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. We have already seen in Section 2.2 above how hopelessly difficult it is for the user to select the appropriate homophones when searching for the book title Hi no sasanai yashiki. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.

Kun Homophones for sasu
No.English "Standard"
Form
Sometimes
also
Often
also
1 to offer 差す   さす
2 to hold up 差す   さす
3 to pour into 差す 注す さす
4 to color 差す 注す さす
5 to shine on 差す 射す さす
6 to aim at 指す 差す  
6 to point to 指す さす  
7 to stab 刺す さす  
8 to leave unfinishedさす 止す  

To sum up, Japanese homophones have certain characteristics that exacerbate the difficulties of retrieving them:

  1. Since many kun homophones are nearly synonymous or even identical in meaning, they are easily confused. As a result, there is no way to predict which particular homophone will appear in a text.
  2. The distinction between some homophones is so subtle that many authors sidestep the irksome task of selecting the appropriate kanji and resort to hiragana.
  3. Since Japanese has only a small stock of phonemes, the number of homophones is very large.

As things stand now, the entire burden of homophone searching falls upon the user. The intelligent search engine, by performing cross-homophone searching at the user's request, will relieve the user of this burden by retrieving all the homophones in the relevant homophone group.

Implementing such technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.


4. Homograph Disambiguation

4.1 Overview of Japanese Homography

Below is a brief overview of Japanese homography. A homograph is one of two or more words that are written the same but differ in pronunciation and (usually) in meaning e.g. minute "60 seconds" and minute "very small".

Japanese has numerous kanji that have multiple on and kun readings, which gives rise to a large number of homographs. The table below lists some typical examples:


Japanese Homographs
Num.HomographReadingEnglish
1 一時 ichiji one o'clock; temporarily
  一時 hitotoki a while
  一時 ittoki a moment; 12th part of day
2 一章 isshoo one chapter
  一章 kazuaki a first name
3 仮名 kana kana syllabary
仮名 kamei fictitious name, pseudonym
仮名 karina alias, assumed name
仮名 kemyoo fictitious name, pseudonym
4 化学 kagaku chemistry
化学 bakegaku chemistry

Unlike English homographs, which differ in meaning, the meanings of Japanese homographs could be identical (化学 above), totally different (一章 above), or partially synonymous (一時 above).

4.2 Intelligent Homograph Searching

Since the number of homographs in Japanese is very large (we found over 20,000 in our databases), it follows that a failure to identify specific homographs will often lead to irrelevant results. However, it is self-evident that, since homographs are written in exactly the same way, retrieving the one semantically relevant to the search term is a very difficult task. This is called homograph disambiguation, and is an important issue in text-to-speech synthesis. This is similar to searching for a polysemous word used in a specific sense, such as for 'table' in the sense of "article of furniture," as opposed to 'table' in the sense of "rectangular array of data."

The truly intelligent search engine should be capable of performing homograph disambiguation.


5. Cross-Synonym Searching

The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.

The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.


5.1 Overview of Japanese Synonymy

From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.

  1. Synonymy
    A relation between a set of words that are similar (near-synonyms) or identical (absolute-synonyms) in meaning.

    Relation English Reading Japanese
    Shared concept money kane
    Synonymscurrency tsuuka 通貨
    cash genkin 現金
    bank note shihei 紙幣

  2. Hyponymy and Hyperonymy
    A relation between a set of specific (subordinate) terms, called hyponyms, and a generic (superordinate) term, called the hyperonym. The hyperonym is more general and includes the senses of the hyponyms.

    Relation English Reading Japanese
    Hyperonym sound oto
    Hyponymsvoice koe
    echo hankyoo 反響
    noise sooon 騒音


  3. Meronomy
    A relation between a set of subordinate words, called meronyms, whose meanings are in a partitive (part-of) relation to a more comprehensive concept, called a holonym.

    Relation English ReadingJapanese
    Holonym city shi
    Meronyms ward ku
    town section choo
    town subsection choome 丁目

  4. Complementarity
    A relation between a set words that contrast with each other and are mutually exclusive:

    Relation English Reading Japanese
    Shared concept siblings kyoodai 兄弟
    Complementary
    terms
    older brother ani
    younger brother otooto
    older sister ane
    younger sister imooto

  5. Antonymy
    A relation between words, called antonyms, of opposite meanings, such as 清潔な seiketsu na 'clean' and 汚い kitanai 'dirty'. Antonyms are probably not of interest in information retrieval.

5.2 The Semasiological Approach

Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.

It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.

As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:

  1. "Kennedy was killed on ..."
  2. "The murder of Kennedy was ..."
  3. "JFK had to be eliminated because ..."

To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.


5.3 The Onomasiological Approach

From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept [cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."

The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as the onomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.


5.4 Intelligent Synonym Searching

Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.

How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.

To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:


Concept: [to cause to die]
English Reading Japanese
to kill korosu 殺す
to commit murder satsujin o okasu殺人を犯す
to execute shokei suru 処刑する
to murder satsugai suru 殺害する
to shoot to death shasatsu suru 射殺する
to assassinate ansatsu suru 暗殺する
to bump off yaru やる, 殺る
to butcher barasu ばらす

Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.


6. Cross-Language Searching

Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:


Cross-Language Searching
Search Term Search Results Reading
Japanese economy 日本(の)経済 Nihon (no) keizai
Tokyo 東京 Tookyoo
happy 幸福
幸せ
koofuku
shiawase
NEC 日本電気
NEC
Nihon Denki
en-ii-shii

Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:


English to Katakana Conversion
Search keyword Search results
computer コンピュータ
コンピューター
WWW ワールドワイドウェブ
ウェブ
WWW
Diesel ディーゼル
ジーゼル

Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.


7. Morphological Analysis

This section briefly describes morphological analysis, an essential component of any Japanese search engine, and miscellaneous search technologies not covered in other sections.

7.1. Morphological and Lexemic Analysis

To perform query processing and search and retrieval operations, a Japanese search engine must be capable of processing a Japanese text on two levels: morphological and lexemic. Morphological analysis refers to computational procedures such as stemming and conflation that operate on the morphemic level (described below). The more difficult lexemic analysis refers to identifying word boundaries by segmenting a text stream into meaningful semantic units (such as lexemes) for dictionary lookup and indexing purposes.

Segmentation and morphological analysis are central to Japanese search engine technology, and each deserves a paper in its own right. Below, we will briefly describe some of the issues.

7.2. Conflation and Stemming

The Japanese language is agglutinative; that is, it forms words by putting together basic elements, called morphemes, that retain their original forms and meanings with little change during the combination process (more information on Japanese morphology.). Inflection in Japanese typically consists of adding to a stem conjugational endings to indicate various grammatical functions, such as tense. The resulting word is another word-form of the underlying lexeme, not a new word in itself, as shown in table below (only basic forms are given):


Conjugation Paradigm for 書く kaku 'to write'
Category Affirmative Reading Negative Reading
Non-past 書く kaku 書かない kakanai
Non-past polite 書きます kakikamasu 書きません kakimase
Past 書いた kaita 書かなかった kakanakatta
Past polite 書きました kakimashita 書きませんでしたkakimasen deshita
Gerund 書いて kaite 書かないで kakanaide
Continuative 書き kaki    
Conditional 書けば kakeba 書かなければ kakanakereba
Imperative 書け kake 書くな kaku na
Tentative 書こう kakoo    
Tentative polite書きましょうkakimashoo    

A major issue in the indexing and retrieval of Japanese texts is the extensive morphological variation in word-forms. A Japanese search engine must not only be capable of segmenting the search term into meaningful semantic units (such as lexemes and multi-word terms), but must also be capable of ignoring morphological variants like those shown in the above table.

A computational procedure designed to match morphological variants by reducing them to a single form for retrieval purposes is called a conflation algorithm. A procedure for processing a word, often by removing the inflectional endings to find its stem, is called stemming. Conflation and stemming make it possible to retrieve any inflectional form from any of the others, ensuring that potentially relevant documents are not lost.

Because of the morphological complexity of Japanese, it goes without saying that the intelligent Japanese search engine must be capable of both conflation and stemming, since they are a prerequisite for implementing the various search technologies described in this paper, such as cross-orthographic and cross-language searching.


7.3 Regular Expressions

Adding double-byte-enabled regular expression functionality to Japanese search engines will provide users with a tool for highly flexible searching far more powerful than Boolean expressions. A detailed treatment of regular expressions is outside the scope of this paper. A full analysis can be found in an excellent book on the subject, Mastering Regular Expressions by Jeffrey Friedl.

Below are a few examples of some metacharacters use in regular expressions (based on the POSIX standard):


Some Metacharacters in Regular Expressions
Metacharacter Description
.
any single character
[ ]
any one of the characters in the brackets
[^]
any characters except for those after the caret
^
beginning of line
$
end of line
\<
beginning of word
\>
end of word
\t
tab character
\n
newline character


7.4 Miscellaneous Search Techniques

This paper covers the major issues in building an intelligent Japanese search engine, but is by no means exhaustive. There are various other possibilities, such as:

  1. Loanword conversion. Katakana loanwords to native words conversion; for example, the search term ドリンク dorinku 'drink' can be used to retrieve the corresponding kun and on equivalents 飲み物 nomimono and 飲料 inryoo. This is a limited implementation of cross-synonym searching.
  2. Lexeme-based retrieval. Perform lexeme-based, rather than word-based, retrieval. For example, in searching for the keyword "high school", exclude webpages in which "high" is separated from "school" since these are unrelated to the lexeme 'high school'.
  3. Character normalization. The relativity trivial normalization of character types, such as ignoring half-width/full-width differences (e.g. OPEN/OPEN) and various symbols and punctuation marks.
  4. Syntactic phrases. Retrieving syntactic phrases, such as 研究をする kenkyuu o suru 'to conduct research', from their lexemic equivalents (研究する), and vice versa.

8. Lexical Databases

Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.

The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.

Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:

  1. General Vocabulary. A comprehensive database of about 450,000 entries covering general vocabulary. The rich set of grammatical attributes is fine-tuned to support search engine applications, especially morphological analyzers and word segmenters (more information).
  2. Technical Terminology. A comprehensive Japanese-English-Japanese database of over 320,000 entries covering a broad spectrum of fields ranging from computer science to biotechnology (more information).
  3. Katakana Loanwords. About 50,000 loanwords and other Japanese words written in katakana, with special focus on computer and Internet terminology (more information).
  4. Japanese Names. About 600,000 Japanese (and Chinese) personal and place names semantically classified and ranked by frequency (more information).
  5. Western Names. An English-Japanese database of about 60,000 non-Japanese personal and place names, semantically classified and accompanied by English equivalents (more information).
  6. Japanese Companies. About 600,000 Japanese company and organization names ranked by frequency with English equivalents when appropriate (more information).
  7. Orthographic Variants. A database of about 60,000 orthographic variants, with full coverage of okurigana, kanji, and kana variants, designed to support cross-script and cross-orthographic searching (more information).
  8. Homophone Groups. A database of semantically classified homophone groups designed to support cross-homophone searching (more information and detailed description).
  9. Homograph Groups. A database of about 34,000 homographs designed to support homograph disambiguation (more information).
  10. Synonym Groups. A database of semantically classified synonym groups consisting of kanji synonyms, homonyms and meronyms serving as a basis for a Japanese thesaurus designed to support cross-synonym searching (more information).
  11. English-Japanese Dictionary. An English-Japanese lexical database of over 100,000 entries covering general vocabulary and important proper names. This can be expanded to cover Western names and technical terms.
  12. Kanji-English Dictionary. A Kanji-English database that includes the comprehensive features of New Japanese-English Character Dictionary, which has become a standard reference work in Japanese education circles ( detailed description).
  13. Kanji Database. A single-character database that covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes ( more information).
  14. Orthographic Variation Rules A comprehensive collection of rules of katakana, hiragana, and kanji orthographic variation which that can be used to generate variants not listed in the database.

9. Conclusions

As we have seen, cross-orthographic searching is essential for intelligent Japanese information retrieval and query processing. However, the current lineup of first- and second-generation Japanese search engines is incapable of cross-orthographic searching, not to speak of the other, more advanced, retrieval technologies discussed in this paper. We have also seen that, because of the complexities and irregularities of the Japanese writing system, the implementation of intelligent retrieval technologies requires not only computational linguistic tools such as morphological analyzers, but also lexical databases fine-tuned to the needs of Japanese search engines.

Below is an outline of our vision for the future directions of third-generation Japanese search engine technology. The minimum requirements for what we shall refer to as a Level 1 Intelligent Japanese Search Engine (IJSE) are as follows:

  1. A linguistically sophisticated Japanese Morphological Analyzer capable of segmenting a Japanese text stream into meaningful units such as lexemes.
  2. Stemming and conflation technology for canonicalization and dictionary lookup.
  3. A comprehensive database and algorithms to support cross-orthographic searching. This could include some or all of the following variation types:

    1. Okurigana variants
    2. Cross-script variation
    3. Kanji variants
    4. Phonetic substitutes
    5. Katakana variants
    6. Hiragana variants

The requirements for what we shall refer to as a Level 2 IJSE include, in addition to those for a Level 1 IJSE, are as follows:

  1. Support for regular expressions
  2. Databases and algorithms to support cross-homophone searching.
  3. Databases and algorithms to support cross-synonym searching.
  4. Databases and algorithms to support cross-language searching.

Going one step further in the quest for the ideal Japanese search engine, here are some suggested directions for what we shall refer to as a Level 3 IJSE:

  1. A Voice User Interface (VUI) is especially useful for native Japanese users, the majority of whom have little or no keyboard experience.
  2. Databases and algorithms to support homograph disambiguation.
  3. Japanese to/from English web translation tools seamlessly integrated into the user interface.

The lack of sophisticated tools to cope with the complexities of the Japanese script places users of Japanese search engines and major portals such as e-commerce sites at a distinct disadvantage. The information retrieval industry in general, and the search engine industry in particular, is in urgent need of third-generation retrieval technology capable of meeting the challenges of intelligent Japanese searching.

The The CJK Dictionary Institute finds itself in a unique position to provide the comprehensive, high quality lexical resources and the software infrastructure required for building intelligent Japanese information retrieval technology.

About the Author

JACK HALPERN     春遍雀來     (ハルペン・ジャック)

President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

The CJK Dictionary Institute

The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

  1. Dozens of lexicographic works, including electronic dictionaries.
  2. Search engine applications, such as morphological analyzers and simpified to/from traditional Chinese conversion systems.
  3. CJK input method editors (IME) and front-end processors (FEP).
  4. Machine translation, online translation tools and speech technology software.
  5. Pedagogical, linguistic and computational lexicography research.

DESK currently has over two million Japanese and about one million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.


President
Jack Halpern
The CJK Dictionary Institute, Inc.
日中韓辭典研究所

34-14, 2-chome, Tohoku, Niiza-shi
Niiza-shi, Saitama 352-0001 JAPAN
Phone: +81-48-473-3508
Fax: +81-48-486-5032
http://www.cjk.org