The Challenges of Intelligent Japanese Searching
知的日本語検索の諸課題

Jack Halpern
President
©2004-2024 The CJK Dictionary Institute, Inc.

Index to This Document
Introduction Cross-orthographic searching Cross-homophone searching Homograph disambiguation Cross-synonym searching Cross-language searching Morphological Analysis Lexical databases Conclusions About the Author List of Publications The CJK Dictionary Institute

Index to This Document

Introduction
Cross-orthographic searching
Cross-homophone searching
Homograph disambiguation
Cross-synonym searching
Cross-language searching
Morphological Analysis
Lexical databases
Conclusions

About the Author
List of Publications
The CJK Dictionary Institute

Abstract

The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies such as cross-language, cross-script, cross-orthographic, and cross-synonym searching, and demostrates that lexical databases must play a central role in their implementation.

1. Introduction

1.1 Brief Outline of Japanese Writing System

It is often said that Japanese has the most complex writing system in the world. As we shall see below, this claim is fully justified. Contemporary Japanese is written in a mixture of four scripts, each of which has a distinct function.

Thousands of logographic characters, called 漢字 kanji, derived from Chinese.
A native syllabic script called 平仮名 hiragana.
Another native syllabic script called 片仮名 katakana.
Recently the Latin alphabet, called ローマ字 roomaji, has become increasingly common.

Kanji are used to write the core of the Japanese vocabulary. This includes words of Chinese origin, words coined in Japan on the Chinese model, such as 山脈 sanmyaku 'mountain range', as well as native Japanese words, such as 山 yama 'mountain'. Kanji have three basic properties: form, sound, and meaning. Each character may be pronounced according to several distinct pronunciations, called readings. A character may have one or several Chinese derived on readings, or one or several (sometimes dozens) native Japanese kun readings, and each reading may have numerous meanings associated with it.

Hiragana is used mostly to write grammatical elements, such as inflectional verb endings, and sometimes for writing native Japanese words. For example, in 見た mita the kanji 見 represents the stem of the verb 見る miru 'see' and た ta is a verb ending for forming the past tense. The hiragana endings attached to a kanji stem are called 送り仮名 okurigana.

Katakana is used mostly to write Western loanwords, such as プリンター purintaa 'printer', and onomatopoeic words, such as カチッと kachitto 'with a click'. The Latin alphabet is used for writing acronyms, for some loanwords instead of katakana, and for stylistic effects, especially in the names of shops and magazines .

A running Japanese text normally consists of a mixture of kanji and kana, as shown below:

漢字を組み合わせることによって多数の熟語が作り出せます。
Kanji o kumiawaseru koto ni yotte tasuu no jukugo ga tsukuridasemasu.
Numerous compound words can be formed by combining Chinese characters.

In the above sentence, case particles such as を o (object marker), as well as verb endings (-わせる -waseru in 組み合わせる kumiawaseru 'combine'), are written in hiragana, whereas nouns, such as 熟語 jukugo 'compound word', are written in kanji.

A fuller description of the Japanese writing system can found in the front matter of the author's New Japanese-English Character Dictionary.

1.2 Intelligent Japanese Searching

Several factors contribute to the difficulties of Japanese information retrieval and query processing. To build a truly sophisticated, "intelligent" Japanese search engine, various challenges must be overcome. Here are some of the major issues:

The lack of a standard, universally accepted, orthography; that is, the presence of a large number of orthographic variants and easily confused homophones.
The morphological structure of the language, which poses a formidable challenge to the development of accurate segmentation and conflation technologies.
The need to support advanced linguistic technologies such as cross-orthographic searching.
Miscellaneous technical requirements such as transcoding between multiple character sets and encodings, support for Unicode, and input method editors.

Each of the above are major issues that deserves a paper in its own right. In this paper, we will focus on one of the central linguistic issues; that is,

The lack of a standard, universally accepted, orthography.

To our knowledge, few if any search engines have addressed orthographic variation, nor any of the other linguistic issues described in this paper. Let us take a very quick look at the current state of search engine technology.

The earliest search engines, such as Altavista, Yahoo!, and Excite, are often referred to as first-generation search engines. Such an engine searches its index for the search term entered by the user, then generates a page with a list of relevant, and often irrelevant, links ranked by frequency of occurrence of the search term.

Second-generation search engines, such as Direct Hit and Northern Light, take a more intelligent approach to ensure relevancy by ordering the search results by various criteria such as popularity, semantic categories, link frequency, and page ranks. An excellent example of this is Google, which ranks results by the number of links from pages which themselves have a high rank.

None of these engines, including the very few that claim to be third generation, support but a bare minimum of computational linguistic features. Here, we will define the direction of third-generation search engines by focusing on the future of Japanese search and retrieval technology, especially such advanced linguistic technologies as cross-language, cross-script, cross-orthographic, and cross-synonym searching (described below).

1.3 The Big Picture

Superficially, it would seem as if search engines need only search for the actual keywords provided by the user. In fact, from personal discussions with the executives of several leading search engine companies, it is clear that they deliberately follow a policy of "not to cast a wide net", so that searching for "travelled to Britain" will not match "traveled to Britain", not to speak of "travel to the U.K."

Ostensibly, the justification for such a policy is to prevent flooding the user with irrelevant results. The real reason, no doubt, is that they do not possess the technology for linguistically sophisticated searching. What such a policy often does achieve is the proverbial "throwing the baby out with the bathwater", since many relevant results are indiscriminately ignored along with the irrelevant ones.

Now let us step back and have a closer look at the big picture, strictly from the user's point of view. That is, let us pose the most relevant question of all:

What, exactly, is it that a search engine user really wants?

In this paper we will demonstrate that, especially in the case of Japanese, the user is far more interested in the inherent meaning (semantic content) represented by the search term, rather than in the accidental form (written representation) of any of its orthographic variants.

In many of the major languages of the world, orthographic variation is not a major issue, since their orthographies tend to be stable. Though English is notorious for its spelling irregularities, spelling variants (such as 'judgement' vs. 'judgment', 'wordprocessor' vs. 'word processor') are more of an annoyance than a major obstacle. For the most part, users can expect the orthographic representation of the search term to have little or no variation.

Not so with Japanese. Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included (Simplified Chinese has a remarkably stable orthography).

Should the user of a Japanese search engine be required to be intimately familiar with these complexities? Obviously not. Ideally, users should only be concerned with quickly finding the information they are seeking, not with the intricacies of the Japanese writing system.

That, in a nutshell, is where the real power of an "intelligent" Japanese search engine comes in. It relieves the user of the burden of dealing with the details of how the search term should be written, and lets her focus on the real issue at hand: defining the content of the information to be retrieved.

2. Cross-Orthographic Searching

This section presents a brief overview of Japanese orthographic variation, focusing on those issues most relevant to information retrieval. The highly irregular Japanese orthography is a major obstacle to efficient searching. Our aim is to describe the complexities of the Japanese orthography, and to demostrate that the intelligent search engine should be capable of retrieving all the orthographic variants of the search term; in other words, of performing cross-orthographic searching.

2.1 Orthographic Chaos?

The Japanese orthography is highly unstable, bordering on the chaotic. A major factor that contributes to this state of affairs is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways.

Study the table below, which shows the orthographic variants of the words 取り扱い toriatsukai 'handling' and 当たり外れ ataraihazure 'hit or miss'.

Some Orthographical Variants
toriatsukai	atarihazure	Type of variant
取り扱い	当たり外れ	"standard" form
取扱い	当り外れ	okurigana variant
	当外れ	okurigana variant
取扱	当外	all kanji
とり扱い	当たりはずれ	replace kanji with hiragana
取りあつかい	あたり外れ	replace kanji with hiragana
とりあつかい	あたりはずれ	all hiragana

It is important to note that these variants are not contrived examples for the sake of illustration. All the above forms do occur in contemporary Japanese, though some are less frequent than others. Even in carefully edited publications, not to speak of sloppily written webpages, there is no reliable way to predict which particular variant will occur, as this often depends on the whim of the author or editorial policy.

How does such orthographic variation affect the search engine user? Let us examine this issue a bit more in depth, looking at it entirely from the user's point of view. Let us say that a user is looking for a novel called Hi no sasanai yashiki (A Mansion with no Sunshine). Here are twelve legitimate ways (some more likely than others) of how to write this phrase.

日の差さない屋敷
日の射さない屋敷
日のささない屋敷
日の射さない邸
日の差さない邸
日のささない邸
陽の射さない屋敷
陽の差さない屋敷
陽のささない屋敷
陽の射さない邸
陽の差さない邸
陽のささない邸

We did a quick survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly (or not surprisingly), we recveived six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, users of Japanese search engines cannot possibly be expected to know which specific variant is used in the official title of the book.

Now let us say that the user is searching for the Japanese equivalent of the proverbial "A hen that lays golden eggs." Theoretically, the "standard" way of writing this is:

金の卵を産む鶏
Kin no tamago wo umu niwatori

In reality, three of the keywords above have the following common variants:

English	Reading	"Standard" Form	Variant 1	Variant 2	Variant 3
egg	tamago	卵	玉子	たまご	タマゴ
hen	niwatori	鶏	にわとり	ニワトリ
to lay	umu	産む	生む

Combining the permutations for the above three words yields 24 possible ways of writing the original phrase, as shown in the table below:

*"Kin no tamago wo umu niwatori"*
'A hen that lays golden eggs'
鶏	にわとり	ニワトリ
1. 金の卵を産む鶏	9. 金の卵を産むにわとり	17. 金の卵を産むニワトリ
2. 金の卵を生む鶏	10.金の卵を生むにわとり	18. 金の卵を生むニワトリ
3. 金の玉子を産む鶏	11.金の玉子を産むにわとり	19. 金の玉子を産むニワトリ
4. 金の玉子を生む鶏	12. 金の玉子を生むにわとり	20. 金の玉子を生むニワトリ
5. 金のたまごを産む鶏	13. 金のたまごを産むにわとり	21. 金のたまごを産むニワトリ
6. 金のたまごを生む鶏	14. 金のたまごを生むにわとり	22. 金のたまごを生むニワトリ
7. 金のタマゴを産む鶏	15. 金のタマゴを産むにわとり	23. 金のタマゴを産むニワトリ
8. 金のタマゴを生む鶏	16. 金のタマゴを生むにわとり	24. 金のタマゴを生むニワトリ

Again, rest assured that this is not a contrived example designed to make a point. Not only is each of the variants for tamago, niwatori and umu of frequent occurrence in written Japanese -- they actually occur within the phrase in question, as can be easily verified by searching with your favorite search engine. Clearly, the user has no hope of finding all these variants unless the search engine can perform sophisticated cross-orthographic searching, as discussed below.

2.2 Okurigana Variants

A major factor that exacerbates the difficulties of Japanese information retrieval is the orthographic variations that occurs in kana endings, called 送り仮名 okurigana, that are attached to a kanji base or stem, as shown in the table below:

Okurigana Variants
English Reading "Standard" Form Variant 1 Variant 2 Variant 3
publish kakiarawasu 書き表す書き表わす書表わす書表す
perform okonau 行う行なう
Tokyo-bound Tookyoo-yuki 東京行き東京行

Okurigana variants are very common. Though the Japanese government publishes guidelines, actual usage is unpredictable and depends on editorial policy or personal preference. The intelligent Japanese search engine should be able to retrieve all such variants, regardless of how the search term is input.

Okurigana Variants
English	Reading	"Standard" Form	Variant 1	Variant 2	Variant 3
publish	kakiarawasu	書き表す	書き表わす	書表わす	書表す
perform	okonau	行う	行なう
Tokyo-bound	Tookyoo-yuki	東京行き	東京行

2.3 Cross-script Searching

As explained in Section 1, Japanese is written in a mixture of four different scripts. Orthographic variation across scripts is a common occurrence, so that the same word can be written in hiragana, katakana, kanji, or even in the Latin alphabet. To compound these difficulties, the same word is sometimes written in two scripts, such as hiragana and kanji.

Study the table below carefully, which shows all the major cross-script variation patterns in that occur in Japanese:

Cross-Script Orthographic Variation
No.	English	Reading	Kanji	Hiragana	Katakana	Latin	Hybrid
1	many people	oozei	大勢	おおぜい
2	say	iu (yuu)	言う	いう
3	sulfur	ioo	硫黄		イオウ
4	cat	neko	猫	ねこ	ネコ
5	kilo	kiroguramu			キログラム	kg
6	shirt	waishatsu			ワイシャツ		Ｙシャツ
7	skin	hifu	皮膚		ヒフ		皮フ
8	comet	suisei	彗星				すい星
9	glittering	pikapika		ぴかぴか	ピカピカ
10	open	oopun			オープン	open OPEN

The above table shows that almost any combination of scripts can occur: kanji vs. hiragana, hiragana vs. katakana, Latin vs. katakana, etc. Cross-script variation is as common as it is unpredictable. From an information retrieval point of view, what is particularly irksome is the recent tendency to write many common kun words (like #2 above), and even on words (like #1 above), in hiragana instead of kanji, based on the widespread misconception that hiragana is "easier" to read.

When inputting a keyword, the user cannot be expected to be aware of these script differences. It goes without saying that the intelligent search engine should be capable of performing cross-script searching; that is, should be able to retrieve all such variants, regardless of the script the keyword is provided in.

2.4 Kanji Variants

Though written Japanese underwent major reforms in the postwar period, resulting in the simplification and standardization of character forms, there is nevertheless a significant number of character form variants in common use, especially in proper names. Classical Japanese literature and religious texts such as the Buddhist scriptures are written almost exclusively in the traditional old forms.

Kanji Variants and Traditional Forms
English	Reading	Standard	Variant	Comment
largely	oohaba ni	大幅に	大巾に	abbreviated form
10 years old	jussai	十才	十歳	variant form
proper name	Nakajima	中島	中嶋	variant form
development	hattatsu	発達	發達	traditional form

Since the use of variant forms is not uncommon, the intelligent search engine should be able to retrieve all such forms. For searching classical texts and religious scriptures, retrieving the traditional forms based on the simplified forms is especially important.

2.5 Phonetic Substitutes

Japanese has numerous orthographic variants based on the principle of phonetic substitution. For example, 盲 is interchangeable with 妄 in such compounds as 妄想 (=盲想) moosoo 'wild idea', but not in 盲従 moojuu 'blind obedience'. One such variant, in this case 妄, is a phonetically replaced character, and the other, in this case 盲, is a phonetic replacement character. Such characters have the same reading and are often similar in meaning.

Phonetic Substitutes
English	Reading	Phonetic Replacement	Phonetically Replaced
fermentation	hakkoo	発酵	醗酵
satire	fuushi	風刺	諷刺
linking	renkei	連係	連繋
linking	moosoo	盲想	妄想
abuse	ranyoo	乱用	濫用

Though the older, phonetically replaced characters are gradually going out of use, their frequency of occurrence is sufficiently high to warrant support by search engines.

2.6 Katakana Variants

The katakana syllabary is used mostly to write Western loanwords, onomatopoeic words, names of plants and animals, non-Japanese personal and place names, for emphasis, and for slang. Recent years have seen an enormous increase in katakana use, especially in technical terminology.

Unfortunately, katakana orthography is often irregular, so that the same word may be written in multiple ways. Basically, the katakana transliteration of a loanword is an attempt to approximate the pronunciation of its etymon (the foreign word from which it is derived). Although there are general guidelines for loanword orthography, in practice there is considerable variation.

Katakana variation can be classified into the following types:

The presence or absence of a macron (a dash-like symbol that indicates long vowels).
The presence or absence of nakaguro -- a middle dot between katakana words).
Replacing macrons with actual vowels to indicate long vowels.
A single foreign sound may be transcribed by multiple kana characters.
Miscellaneous katakana variants.

Typology of Katakana Variation
Variation Type	English	Reading	Standard Form	Variants
1. Macron	computer	konpyuuta konpyuutaa	コンピュータ	コンピューター
	user	yuuza yuuzaa	ユーザー	ユーザ
2. Nakaguro	online	onrain	オンライン	オン・ライン
	ice cube	aisukyuubu	アイスキューブ	アイス・キューブ
3. Long vowels	eye shadow	aishadoo	アイシャドー	アイシャドウ
	maid	meedo	メード	メイド
4. Multiple kana	Diesel	diizeru jiizeru	ディーゼル	ジーゼルヂーゼル
	team	chiimu tiimu	チーム	ティーム
	violin	baiorin vaiorin	バイオリン	ヴァイオリン
5. Others	quota	kuootaa kwootaa	クオーター	クォーター
	Jerusalem	ierusaremu	エルサレム	イェルサレム

The above is only a brief introduction to the complexities of katakana variation, which is as common as it is unpredictable. To relieve the user from the burden of guessing the correct variant, the intelligent Japanese search engine should be capable of retrieving all katakana variants of the search term.

2.7 Hiragana Variants

As explained in Section 1, the hiragana syllabary is used mostly to write grammatical elements and some native Japanese words, such as adverbs and particles. In recent years there has been a considerable increase in the use of hiragana, both for stylistic effects and because of the popular belief that hiragana is easier to read than kanji.

Unfortunately, both for humans and computers, hiragana strings are considerably more difficult to segment than the equivalent kanji-kana texts. Since there are no delimiters between words, identifying the lexemes in a hiragana string is often a futile exercise in disambiguation.

If we discount the okurigana irregularities and cross-script variations explained in Section 2.1 and Section 2.2, the hiragana orthography is, in itself, quite regular. Nevertheless, there is a certain amount of hiragana irregularities, as explained below:

The use of traditional kana orthography for the case particles は, へ and を, instead of わ, え and お.
The use of traditional お instead of う to indicate long o in certain words.
The use of voiced ぢ and づ in place of じ and ず when the former are (1) proceeded by ち and つ or (2) appear in voiced compounds derived from ち and つ. But for some words, as shown below, じ and ず are preferred. Actual usage is unpredictable.
The use of historical kana orthography in prewar texts, classical literature and Buddhist scriptures.
Miscellaneous hiragana variants, such as the use of the kana repetition symbol ゝ.

Typology of Hiragana Variation
Variation Type	English	Reading	Standard Form	Variants
1. Particles	Hello	konnichi wa	こんにちは	こんにちわ
2. Traditional	way	toori	とおり	とうり
	big	ookii	おおきい	おうきい
3. ぢ and づ	to shrink	chijimu	ちぢむ	ちじむ
	to continue	tsuzuku	つづく	つずく
	nosebleed	hanaji	はなぢ	はなじ
	to nod	unazuku	うなずく	うなづく
4. Historical	to use	mochiiru	用いる	用ゐる
	long oo	koo	こう	かう, かふ
	smell	nioi	におい (匂い)	にほひ
5. Others	here	koko	ここ	こゝ

Although hiragana variation is a relativity minor issue, the intelligent Japanese search engine should be capable of retrieving hiragana variants, regardless of how the keyword is input.

3. Cross-Homophone Searching

This section presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that the intelligent search engine should be capable of optionally retrieving the various homophonic variants of the search term; in other words, of performing cross-homophone searching.

3.1. Some Definitions

A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.

Homophone: One of two or more words that are pronounced the same but differ in writing and usually in meaning (e.g. principal and principle).
Homograph: One of two or more words that are written the same but differ in pronunciation and (usually) in meaning (misleadingly also called heteronyms) (e.g. minute "60 seconds" and minute "very small").
Homonym: One of two or more words that are identical in writing and/or pronunciation but differ in meaning (sometimes called homologues) (e.g. light "not heavy" and light "not dark").
Orthographic Variant: One of two or more words that are written differently but are identical in pronunciation and meaning (sometimes called heterographs) (e.g. judgement and judgment).

3.2 Overview of Japanese Homophony

An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.

On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:

・

Kun Homophones
Easily Distinguished		Easily Confused
hashi		noboru
橋端箸	bridge end, edge chopsticks	上る登る昇る	go up (steps, a hill) climb, scale ascend, rise (up to the sky)

Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.

To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and searching more difficult (see 2.3 Cross-script Searching for details).

3.3 Intelligent Homophone Searching

Cross-homophone searching requires a semantically classified database of homophones and a homophone expansion algorithm. The process of searching for Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. In English, however, cross-homophone searching is clearly undesirable. A user searching for right is most certainly not interested in finding write.

Not so in Japanese. From an information retrieval point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.

This means that, when the user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. We have already seen in Section 2.2 above how hopelessly difficult it is for the user to select the appropriate homophones when searching for the book title Hi no sasanai yashiki. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.

Kun Homophones for sasu
No. English "Standard"
Form Sometimes
also Often
also
1 to offer 差すさす
2 to hold up 差すさす
3 to pour into 差す注すさす
4 to color 差す注すさす
5 to shine on 差す射すさす
6 to aim at 指す差す
6 to point to 指すさす
7 to stab 刺すさす
8 to leave unfinished さす止す

Kun Homophones for *sasu*
No.	English	"Standard" Form	Sometimes also	Often also
1	to offer	差す		さす
2	to hold up	差す		さす
3	to pour into	差す	注す	さす
4	to color	差す	注す	さす
5	to shine on	差す	射す	さす
6	to aim at	指す	差す
6	to point to	指す	さす
7	to stab	刺す	さす
8	to leave unfinished	さす	止す

To sum up, Japanese homophones have certain characteristics that exacerbate the difficulties of retrieving them:

Since many kun homophones are nearly synonymous or even identical in meaning, they are easily confused. As a result, there is no way to predict which particular homophone will appear in a text.
The distinction between some homophones is so subtle that many authors sidestep the irksome task of selecting the appropriate kanji and resort to hiragana.
Since Japanese has only a small stock of phonemes, the number of homophones is very large.

As things stand now, the entire burden of homophone searching falls upon the user. The intelligent search engine, by performing cross-homophone searching at the user's request, will relieve the user of this burden by retrieving all the homophones in the relevant homophone group.

Implementing such technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.

4. Homograph Disambiguation

4.1 Overview of Japanese Homography

Below is a brief overview of Japanese homography. A homograph is one of two or more words that are written the same but differ in pronunciation and (usually) in meaning e.g. minute "60 seconds" and minute "very small".

Japanese has numerous kanji that have multiple on and kun readings, which gives rise to a large number of homographs. The table below lists some typical examples:

Japanese Homographs
Num.	Homograph	Reading	English
1	一時	ichiji	one o'clock; temporarily
	一時	hitotoki	a while
	一時	ittoki	a moment; 12th part of day
2	一章	isshoo	one chapter
	一章	kazuaki	a first name
3	仮名	kana	kana syllabary
	仮名	kamei	fictitious name, pseudonym
	仮名	karina	alias, assumed name
	仮名	kemyoo	fictitious name, pseudonym
4	化学	kagaku	chemistry
	化学	bakegaku	chemistry

Unlike English homographs, which differ in meaning, the meanings of Japanese homographs could be identical (化学 above), totally different (一章 above), or partially synonymous (一時 above).

4.2 Intelligent Homograph Searching

Since the number of homographs in Japanese is very large (we found over 20,000 in our databases), it follows that a failure to identify specific homographs will often lead to irrelevant results. However, it is self-evident that, since homographs are written in exactly the same way, retrieving the one semantically relevant to the search term is a very difficult task. This is called homograph disambiguation, and is an important issue in text-to-speech synthesis. This is similar to searching for a polysemous word used in a specific sense, such as for 'table' in the sense of "article of furniture," as opposed to 'table' in the sense of "rectangular array of data."

The truly intelligent search engine should be capable of performing homograph disambiguation.

5. Cross-Synonym Searching

The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.

The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.

5.1 Overview of Japanese Synonymy

From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.

Synonymy
A relation between a set of words that are similar (near-synonyms) or identical (absolute-synonyms) in meaning.

Relation English Reading Japanese
Shared concept money kane 金
Synonyms currency tsuuka 通貨
cash genkin 現金
bank note shihei 紙幣
Hyponymy and Hyperonymy
A relation between a set of specific (subordinate) terms, called hyponyms, and a generic (superordinate) term, called the hyperonym. The hyperonym is more general and includes the senses of the hyponyms.

Relation English Reading Japanese
Hyperonym sound oto 音
Hyponyms voice koe 声
echo hankyoo 反響
noise sooon 騒音
Meronomy
A relation between a set of subordinate words, called meronyms, whose meanings are in a partitive (part-of) relation to a more comprehensive concept, called a holonym.

Relation English Reading Japanese
Holonym city shi 市
Meronyms ward ku 区
town section choo 町
town subsection choome 丁目

Relation	English	Reading	Japanese
Shared concept	money	kane	金
Synonyms	currency	tsuuka	通貨
cash	genkin	現金
bank note	shihei	紙幣

Relation	English	Reading	Japanese
Hyperonym	sound	oto	音
Hyponyms	voice	koe	声
echo	hankyoo	反響
noise	sooon	騒音

Relation	English	Reading	Japanese
Holonym	city	shi	市
Meronyms	ward	ku	区
town section	choo	町
town subsection	choome	丁目

Complementarity
A relation between a set words that contrast with each other and are mutually exclusive:

Relation	English	Reading	Japanese
Shared concept	siblings	kyoodai	兄弟
Complementary terms	older brother	ani	兄
	younger brother	otooto	弟
	older sister	ane	姉
	younger sister	imooto	妹

Antonymy
A relation between words, called antonyms, of opposite meanings, such as 清潔な seiketsu na 'clean' and 汚い kitanai 'dirty'. Antonyms are probably not of interest in information retrieval.

5.2 The Semasiological Approach

Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.

It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.

As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:

"Kennedy was killed on ..."
"The murder of Kennedy was ..."
"JFK had to be eliminated because ..."

To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.

5.3 The Onomasiological Approach

From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept [cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."

The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as the onomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.

5.4 Intelligent Synonym Searching

Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.

How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.

To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:

Concept: [to cause to die]
English	Reading	Japanese
to kill	korosu	殺す
to commit murder	satsujin o okasu	殺人を犯す
to execute	shokei suru	処刑する
to murder	satsugai suru	殺害する
to shoot to death	shasatsu suru	射殺する
to assassinate	ansatsu suru	暗殺する
to bump off	yaru	やる, 殺る
to butcher	barasu	ばらす

Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.

6. Cross-Language Searching

Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:

Cross-Language Searching
Search Term	Search Results	Reading
Japanese economy	日本(の)経済	Nihon (no) keizai
Tokyo	東京	Tookyoo
happy	幸福幸せ	koofuku shiawase
NEC	日本電気ＮＥＣ	Nihon Denki en-ii-shii

Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:

English to Katakana Conversion
Search keyword	Search results
computer	コンピュータコンピューター
WWW	ワールドワイドウェブウェブＷＷＷ
Diesel	ディーゼルジーゼル

Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.

7. Morphological Analysis

This section briefly describes morphological analysis, an essential component of any Japanese search engine, and miscellaneous search technologies not covered in other sections.

7.1. Morphological and Lexemic Analysis

To perform query processing and search and retrieval operations, a Japanese search engine must be capable of processing a Japanese text on two levels: morphological and lexemic. Morphological analysis refers to computational procedures such as stemming and conflation that operate on the morphemic level (described below). The more difficult lexemic analysis refers to identifying word boundaries by segmenting a text stream into meaningful semantic units (such as lexemes) for dictionary lookup and indexing purposes.

Segmentation and morphological analysis are central to Japanese search engine technology, and each deserves a paper in its own right. Below, we will briefly describe some of the issues.

7.2. Conflation and Stemming

The Japanese language is agglutinative; that is, it forms words by putting together basic elements, called morphemes, that retain their original forms and meanings with little change during the combination process (more information on Japanese morphology.). Inflection in Japanese typically consists of adding to a stem conjugational endings to indicate various grammatical functions, such as tense. The resulting word is another word-form of the underlying lexeme, not a new word in itself, as shown in table below (only basic forms are given):

Conjugation Paradigm for 書く kaku 'to write'
Category	Affirmative	Reading	Negative	Reading
Non-past	書く	kaku	書かない	kakanai
Non-past polite	書きます	kakikamasu	書きません	kakimase
Past	書いた	kaita	書かなかった	kakanakatta
Past polite	書きました	kakimashita	書きませんでした	kakimasen deshita
Gerund	書いて	kaite	書かないで	kakanaide
Continuative	書き	kaki
Conditional	書けば	kakeba	書かなければ	kakanakereba
Imperative	書け	kake	書くな	kaku na
Tentative	書こう	kakoo
Tentative polite	書きましょう	kakimashoo

A major issue in the indexing and retrieval of Japanese texts is the extensive morphological variation in word-forms. A Japanese search engine must not only be capable of segmenting the search term into meaningful semantic units (such as lexemes and multi-word terms), but must also be capable of ignoring morphological variants like those shown in the above table.

A computational procedure designed to match morphological variants by reducing them to a single form for retrieval purposes is called a conflation algorithm. A procedure for processing a word, often by removing the inflectional endings to find its stem, is called stemming. Conflation and stemming make it possible to retrieve any inflectional form from any of the others, ensuring that potentially relevant documents are not lost.

Because of the morphological complexity of Japanese, it goes without saying that the intelligent Japanese search engine must be capable of both conflation and stemming, since they are a prerequisite for implementing the various search technologies described in this paper, such as cross-orthographic and cross-language searching.

7.3 Regular Expressions

Adding double-byte-enabled regular expression functionality to Japanese search engines will provide users with a tool for highly flexible searching far more powerful than Boolean expressions. A detailed treatment of regular expressions is outside the scope of this paper. A full analysis can be found in an excellent book on the subject, Mastering Regular Expressions by Jeffrey Friedl.

Below are a few examples of some metacharacters use in regular expressions (based on the POSIX standard):

Some Metacharacters in Regular Expressions
Metacharacter	Description
.	any single character
[ ]	any one of the characters in the brackets
[^]	any characters except for those after the caret
^	beginning of line
$	end of line
\<	beginning of word
\>	end of word
\t	tab character
\n	newline character

7.4 Miscellaneous Search Techniques

This paper covers the major issues in building an intelligent Japanese search engine, but is by no means exhaustive. There are various other possibilities, such as:

Loanword conversion. Katakana loanwords to native words conversion; for example, the search term ドリンク dorinku 'drink' can be used to retrieve the corresponding kun and on equivalents 飲み物 nomimono and 飲料 inryoo. This is a limited implementation of cross-synonym searching.
Lexeme-based retrieval. Perform lexeme-based, rather than word-based, retrieval. For example, in searching for the keyword "high school", exclude webpages in which "high" is separated from "school" since these are unrelated to the lexeme 'high school'.
Character normalization. The relativity trivial normalization of character types, such as ignoring half-width/full-width differences (e.g. ＯＰＥＮ/OPEN) and various symbols and punctuation marks.
Syntactic phrases. Retrieving syntactic phrases, such as 研究をする kenkyuu o suru 'to conduct research', from their lexemic equivalents (研究する), and vice versa.

8. Lexical Databases

Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.

The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.

Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:

General Vocabulary. A comprehensive database of about 450,000 entries covering general vocabulary. The rich set of grammatical attributes is fine-tuned to support search engine applications, especially morphological analyzers and word segmenters (more information).
Technical Terminology. A comprehensive Japanese-English-Japanese database of over 320,000 entries covering a broad spectrum of fields ranging from computer science to biotechnology (more information).
Katakana Loanwords. About 50,000 loanwords and other Japanese words written in katakana, with special focus on computer and Internet terminology (more information).
Japanese Names. About 600,000 Japanese (and Chinese) personal and place names semantically classified and ranked by frequency (more information).
Western Names. An English-Japanese database of about 60,000 non-Japanese personal and place names, semantically classified and accompanied by English equivalents (more information).
Japanese Companies. About 600,000 Japanese company and organization names ranked by frequency with English equivalents when appropriate (more information).
Orthographic Variants. A database of about 60,000 orthographic variants, with full coverage of okurigana, kanji, and kana variants, designed to support cross-script and cross-orthographic searching (more information).
Homophone Groups. A database of semantically classified homophone groups designed to support cross-homophone searching (more information and detailed description).
Homograph Groups. A database of about 34,000 homographs designed to support homograph disambiguation (more information).
Synonym Groups. A database of semantically classified synonym groups consisting of kanji synonyms, homonyms and meronyms serving as a basis for a Japanese thesaurus designed to support cross-synonym searching (more information).
English-Japanese Dictionary. An English-Japanese lexical database of over 100,000 entries covering general vocabulary and important proper names. This can be expanded to cover Western names and technical terms.
Kanji-English Dictionary. A Kanji-English database that includes the comprehensive features of New Japanese-English Character Dictionary, which has become a standard reference work in Japanese education circles ( detailed description).
Kanji Database. A single-character database that covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes ( more information).
Orthographic Variation Rules A comprehensive collection of rules of katakana, hiragana, and kanji orthographic variation which that can be used to generate variants not listed in the database.

9. Conclusions

As we have seen, cross-orthographic searching is essential for intelligent Japanese information retrieval and query processing. However, the current lineup of first- and second-generation Japanese search engines is incapable of cross-orthographic searching, not to speak of the other, more advanced, retrieval technologies discussed in this paper. We have also seen that, because of the complexities and irregularities of the Japanese writing system, the implementation of intelligent retrieval technologies requires not only computational linguistic tools such as morphological analyzers, but also lexical databases fine-tuned to the needs of Japanese search engines.

Below is an outline of our vision for the future directions of third-generation Japanese search engine technology. The minimum requirements for what we shall refer to as a Level 1 Intelligent Japanese Search Engine (IJSE) are as follows:

A linguistically sophisticated Japanese Morphological Analyzer capable of segmenting a Japanese text stream into meaningful units such as lexemes.
Stemming and conflation technology for canonicalization and dictionary lookup.
A comprehensive database and algorithms to support cross-orthographic searching. This could include some or all of the following variation types:

The requirements for what we shall refer to as a Level 2 IJSE include, in addition to those for a Level 1 IJSE, are as follows:

Support for regular expressions
Databases and algorithms to support cross-homophone searching.
Databases and algorithms to support cross-synonym searching.
Databases and algorithms to support cross-language searching.

Going one step further in the quest for the ideal Japanese search engine, here are some suggested directions for what we shall refer to as a Level 3 IJSE:

A Voice User Interface (VUI) is especially useful for native Japanese users, the majority of whom have little or no keyboard experience.
Databases and algorithms to support homograph disambiguation.
Japanese to/from English web translation tools seamlessly integrated into the user interface.

The lack of sophisticated tools to cope with the complexities of the Japanese script places users of Japanese search engines and major portals such as e-commerce sites at a distinct disadvantage. The information retrieval industry in general, and the search engine industry in particular, is in urgent need of third-generation retrieval technology capable of meeting the challenges of intelligent Japanese searching.

The The CJK Dictionary Institute finds itself in a unique position to provide the comprehensive, high quality lexical resources and the software infrastructure required for building intelligent Japanese information retrieval technology.

About the Author

JACK HALPERN 春遍雀來 (ハルペン・ジャック)

President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

Halpern, Jack (1982): “Linguistic Analysis of the Function of Kanji in Modern Japanese,” 27th International Conference of Orientalists in Tokyo.
Halpern, Jack (1985): “Function of Kanji in Modern Japanese, ” Transactions of the International Conference of Orientalists in Japan. The Tōhō Gakkai (The Institute of Eastern Culture). 27th International Conference of Orientalists in Japan in Tokyo.
Halpern, Jack (1985): “Kenkyusha’s New Japanese-English Character Dictionary,” Calico Journal, December 1985.
Halpern, Jack (1987): 漢字の再発見 Kanji no Saihakken ‘Rediscovering Chinese Characters’. Tokyo: Shodensha.
Halpern, Jack (1990): New Japanese-English Character Dictionary (Sixth Printing). Tokyo: Kenkyusha.
Halpern, Jack (1990): “New Japanese-English Character Dictionary: A Semantic Approach to Kanji Lexicography,” Euralex ’90 Proceedings. Actas del IV Congreso Internacional, 157-166. Benalmádena (Málaga): Bibliograf.
Halpern, Jack (1993): NTC’s New Japanese-English Character Dictionary. Chicago: National Textbook Company.
Halpern Jack, Nomura Masaaki, and Fukada Atsushi (1994): “Building a Comprehensive Chinese Character Database,” Euralex ’94 Proceedings. International Congress on Lexicography in Amsterdam.
Halpern, Jack (1995): New Japanese-English Character Dictionary, Electronic Book Edition. Tokyo: Nichigai Associates.
Halpern, Jack (1998): “Building A Comprehensive Database for the Compilation of Integrated Kanji Dictionaries and Tools,” 43rd International Conference of Orientalists in Tokyo.
Halpern, Jack (1999): The Kodansha Kanji Learner’s Dictionary. Tokyo: Kodansha International.
Halpern, Jack and Kerman, Jouni (1999): “The Pitfalls and Complexities of Chinese to Chinese Conversion,” Fourteenth International Unicode Conference in Cambridge, Massachusetts.
Halpern, Jack (2000): “The Challenges of Intelligent Japanese Searching,” Tokyo (forthcoming).
Halpern, Jack: Dictionary of Unified CJK Characters -- for the Unicode Standard. Forthcoming.

The CJK Dictionary Institute

The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

Dozens of lexicographic works, including electronic dictionaries.
Search engine applications, such as morphological analyzers and simpified to/from traditional Chinese conversion systems.
CJK input method editors (IME) and front-end processors (FEP).
Machine translation, online translation tools and speech technology software.
Pedagogical, linguistic and computational lexicography research.

DESK currently has over two million Japanese and about one million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.

President
Jack Halpern
The CJK Dictionary Institute, Inc.
日中韓辭典研究所

34-14, 2-chome, Tohoku, Niiza-shi
Niiza-shi, Saitama 352-0001 JAPAN
Phone: +81-48-473-3508
Fax: +81-48-486-5032
http://www.cjk.org

The Challenges of Intelligent Japanese Searching 知的日本語検索の諸課題