|Index to This Document|
The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies such as cross-language, cross-script, cross-orthographic, and cross-synonym searching, and demostrates that lexical databases must play a central role in their implementation.
It is often said that Japanese has the most complex writing system in the world. As we shall see below, this claim is fully justified. Contemporary Japanese is written in a mixture of four scripts, each of which has a distinct function.
- Thousands of logographic characters, called 漢字 kanji, derived from Chinese.
- A native syllabic script called 平仮名 hiragana.
- Another native syllabic script called 片仮名 katakana.
- Recently the Latin alphabet, called ローマ字 roomaji, has become increasingly common.
Kanji are used to write the core of the Japanese vocabulary. This includes words of Chinese origin, words coined in Japan on the Chinese model, such as 山脈 sanmyaku 'mountain range', as well as native Japanese words, such as 山 yama 'mountain'. Kanji have three basic properties: form, sound, and meaning. Each character may be pronounced according to several distinct pronunciations, called readings. A character may have one or several Chinese derived on readings, or one or several (sometimes dozens) native Japanese kun readings, and each reading may have numerous meanings associated with it.
Hiragana is used mostly to write grammatical elements, such as inflectional verb endings, and sometimes for writing native Japanese words. For example, in 見た mita the kanji 見 represents the stem of the verb 見る miru 'see' and た ta is a verb ending for forming the past tense. The hiragana endings attached to a kanji stem are called 送り仮名 okurigana.
Katakana is used mostly to write Western loanwords, such as プリンター purintaa 'printer', and onomatopoeic words, such as カチッと kachitto 'with a click'. The Latin alphabet is used for writing acronyms, for some loanwords instead of katakana, and for stylistic effects, especially in the names of shops and magazines .
A running Japanese text normally consists of a mixture of kanji and kana, as shown below:
Kanji o kumiawaseru koto ni yotte tasuu no jukugo ga tsukuridasemasu.
Numerous compound words can be formed by combining Chinese characters.
In the above sentence, case particles such as を o (object marker), as well as verb endings (-わせる -waseru in 組み合わせる kumiawaseru 'combine'), are written in hiragana, whereas nouns, such as 熟語 jukugo 'compound word', are written in kanji.
A fuller description of the Japanese writing system can found in the front matter of the author's New Japanese-English Character Dictionary.
Several factors contribute to the difficulties of Japanese information retrieval and query processing. To build a truly sophisticated, "intelligent" Japanese search engine, various challenges must be overcome. Here are some of the major issues:
- The lack of a standard, universally accepted, orthography; that is, the presence of a large number of orthographic variants and easily confused homophones.
- The morphological structure of the language, which poses a formidable challenge to the development of accurate segmentation and conflation technologies.
- The need to support advanced linguistic technologies such as cross-orthographic searching.
- Miscellaneous technical requirements such as transcoding between multiple character sets and encodings, support for Unicode, and input method editors.
Each of the above are major issues that deserves a paper in its own right. In this paper, we will focus on one of the central linguistic issues; that is,
The lack of a standard, universally accepted, orthography.
To our knowledge, few if any search engines have addressed orthographic variation, nor any of the other linguistic issues described in this paper. Let us take a very quick look at the current state of search engine technology.
The earliest search engines, such as Altavista, Yahoo!, and Excite, are often referred to as first-generation search engines. Such an engine searches its index for the search term entered by the user, then generates a page with a list of relevant, and often irrelevant, links ranked by frequency of occurrence of the search term.
Second-generation search engines, such as Direct Hit and Northern Light, take a more intelligent approach to ensure relevancy by ordering the search results by various criteria such as popularity, semantic categories, link frequency, and page ranks. An excellent example of this is Google, which ranks results by the number of links from pages which themselves have a high rank.
None of these engines, including the very few that claim to be third generation, support but a bare minimum of computational linguistic features. Here, we will define the direction of third-generation search engines by focusing on the future of Japanese search and retrieval technology, especially such advanced linguistic technologies as cross-language, cross-script, cross-orthographic, and cross-synonym searching (described below).
Superficially, it would seem as if search engines need only search for the actual keywords provided by the user. In fact, from personal discussions with the executives of several leading search engine companies, it is clear that they deliberately follow a policy of "not to cast a wide net", so that searching for "travelled to Britain" will not match "traveled to Britain", not to speak of "travel to the U.K."
Ostensibly, the justification for such a policy is to prevent flooding the user with irrelevant results. The real reason, no doubt, is that they do not possess the technology for linguistically sophisticated searching. What such a policy often does achieve is the proverbial "throwing the baby out with the bathwater", since many relevant results are indiscriminately ignored along with the irrelevant ones.
Now let us step back and have a closer look at the big picture, strictly from the user's point of view. That is, let us pose the most relevant question of all:
What, exactly, is it that a search engine user really wants?
In this paper we will demonstrate that, especially in the case of Japanese, the user is far more interested in the inherent meaning (semantic content) represented by the search term, rather than in the accidental form (written representation) of any of its orthographic variants.
In many of the major languages of the world, orthographic variation is not a major issue, since their orthographies tend to be stable. Though English is notorious for its spelling irregularities, spelling variants (such as 'judgement' vs. 'judgment', 'wordprocessor' vs. 'word processor') are more of an annoyance than a major obstacle. For the most part, users can expect the orthographic representation of the search term to have little or no variation.
Not so with Japanese. Japanese orthography is so highly irregular that it can be considered, without the slightest fear of being accused of hyperbole, to be a couple of orders of magnitude more complex and more irregular than any other major language, Chinese included (Simplified Chinese has a remarkably stable orthography).
Should the user of a Japanese search engine be required to be intimately familiar with these complexities? Obviously not. Ideally, users should only be concerned with quickly finding the information they are seeking, not with the intricacies of the Japanese writing system.
That, in a nutshell, is where the real power of an "intelligent" Japanese search engine comes in. It relieves the user of the burden of dealing with the details of how the search term should be written, and lets her focus on the real issue at hand: defining the content of the information to be retrieved.
This section presents a brief overview of Japanese orthographic variation, focusing on those issues most relevant to information retrieval. The highly irregular Japanese orthography is a major obstacle to efficient searching. Our aim is to describe the complexities of the Japanese orthography, and to demostrate that the intelligent search engine should be capable of retrieving all the orthographic variants of the search term; in other words, of performing cross-orthographic searching.
The Japanese orthography is highly unstable, bordering on the chaotic. A major factor that contributes to this state of affairs is the complex interaction of the four scripts used to write Japanese, resulting in countless words that can be written in a variety of often unpredictable ways.
Study the table below, which shows the orthographic variants of the words 取り扱い toriatsukai 'handling' and 当たり外れ ataraihazure 'hit or miss'.
|toriatsukai||atarihazure||Type of variant|
|とり扱い||当たりはずれ||replace kanji with hiragana|
|取りあつかい||あたり外れ||replace kanji with hiragana|
It is important to note that these variants are not contrived examples for the sake of illustration. All the above forms do occur in contemporary Japanese, though some are less frequent than others. Even in carefully edited publications, not to speak of sloppily written webpages, there is no reliable way to predict which particular variant will occur, as this often depends on the whim of the author or editorial policy.
How does such orthographic variation affect the search engine user? Let us examine this issue a bit more in depth, looking at it entirely from the user's point of view. Let us say that a user is looking for a novel called Hi no sasanai yashiki (A Mansion with no Sunshine). Here are twelve legitimate ways (some more likely than others) of how to write this phrase.
We did a quick survey on six native Japanese speakers, some of whom are professional translators and writers, asking them how they would write the above phrase. Surprisingly (or not surprisingly), we recveived six different answers, none of which matched the "standard" form found in dictionaries (#1 above). Clearly, users of Japanese search engines cannot possibly be expected to know which specific variant is used in the official title of the book.
Now let us say that the user is searching for the Japanese equivalent of the proverbial "A hen that lays golden eggs." Theoretically, the "standard" way of writing this is:
In reality, three of the keywords above have the following common variants:
|Variant 1||Variant 2||Variant 3|
Combining the permutations for the above three words yields 24 possible ways of writing the original phrase, as shown in the table below:
|1. 金の卵を産む鶏||9. 金の卵を産むにわとり||17. 金の卵を産むニワトリ
Again, rest assured that this is not a contrived example designed to make a point. Not only is each of the variants for tamago, niwatori and umu of frequent occurrence in written Japanese -- they actually occur within the phrase in question, as can be easily verified by searching with your favorite search engine. Clearly, the user has no hope of finding all these variants unless the search engine can perform sophisticated cross-orthographic searching, as discussed below.
A major factor that exacerbates the difficulties of Japanese information retrieval is the orthographic variations that occurs in kana endings, called 送り仮名 okurigana, that are attached to a kanji base or stem, as shown in the table below:
|English||Reading||"Standard" Form||Variant 1||Variant 2||Variant 3|
As explained in Section 1, Japanese is written in a mixture of four different scripts. Orthographic variation across scripts is a common occurrence, so that the same word can be written in hiragana, katakana, kanji, or even in the Latin alphabet. To compound these difficulties, the same word is sometimes written in two scripts, such as hiragana and kanji.
Study the table below carefully, which shows all the major cross-script variation patterns in that occur in Japanese:
The above table shows that almost any combination of scripts can occur: kanji vs. hiragana, hiragana vs. katakana, Latin vs. katakana, etc. Cross-script variation is as common as it is unpredictable. From an information retrieval point of view, what is particularly irksome is the recent tendency to write many common kun words (like #2 above), and even on words (like #1 above), in hiragana instead of kanji, based on the widespread misconception that hiragana is "easier" to read.
When inputting a keyword, the user cannot be expected to be aware of these script differences. It goes without saying that the intelligent search engine should be capable of performing cross-script searching; that is, should be able to retrieve all such variants, regardless of the script the keyword is provided in.
Though written Japanese underwent major reforms in the postwar period, resulting in the simplification and standardization of character forms, there is nevertheless a significant number of character form variants in common use, especially in proper names. Classical Japanese literature and religious texts such as the Buddhist scriptures are written almost exclusively in the traditional old forms.
|largely||oohaba ni||大幅に||大巾に||abbreviated form|
|10 years old||jussai||十才||十歳||variant form|
|proper name||Nakajima||中島||中嶋||variant form|
Since the use of variant forms is not uncommon, the intelligent search engine should be able to retrieve all such forms. For searching classical texts and religious scriptures, retrieving the traditional forms based on the simplified forms is especially important.
Japanese has numerous orthographic variants based on the principle of phonetic substitution. For example, 盲 is interchangeable with 妄 in such compounds as 妄想 (=盲想) moosoo 'wild idea', but not in 盲従 moojuu 'blind obedience'. One such variant, in this case 妄, is a phonetically replaced character, and the other, in this case 盲, is a phonetic replacement character. Such characters have the same reading and are often similar in meaning.
Though the older, phonetically replaced characters are gradually going out of use, their frequency of occurrence is sufficiently high to warrant support by search engines.
The katakana syllabary is used mostly to write Western loanwords, onomatopoeic words, names of plants and animals, non-Japanese personal and place names, for emphasis, and for slang. Recent years have seen an enormous increase in katakana use, especially in technical terminology.
Unfortunately, katakana orthography is often irregular, so that the same word may be written in multiple ways. Basically, the katakana transliteration of a loanword is an attempt to approximate the pronunciation of its etymon (the foreign word from which it is derived). Although there are general guidelines for loanword orthography, in practice there is considerable variation.
Katakana variation can be classified into the following types:
- The presence or absence of a macron (a dash-like symbol that indicates long vowels).
- The presence or absence of nakaguro -- a middle dot between katakana words).
- Replacing macrons with actual vowels to indicate long vowels.
- A single foreign sound may be transcribed by multiple kana characters.
- Miscellaneous katakana variants.
|3. Long vowels||eye shadow||aishadoo||アイシャドー||アイシャドウ|
|4. Multiple kana||Diesel||diizeru|
|5. Others||quota||kuootaa |
The above is only a brief introduction to the complexities of katakana variation, which is as common as it is unpredictable. To relieve the user from the burden of guessing the correct variant, the intelligent Japanese search engine should be capable of retrieving all katakana variants of the search term.
As explained in Section 1, the hiragana syllabary is used mostly to write grammatical elements and some native Japanese words, such as adverbs and particles. In recent years there has been a considerable increase in the use of hiragana, both for stylistic effects and because of the popular belief that hiragana is easier to read than kanji.
Unfortunately, both for humans and computers, hiragana strings are considerably more difficult to segment than the equivalent kanji-kana texts. Since there are no delimiters between words, identifying the lexemes in a hiragana string is often a futile exercise in disambiguation.
If we discount the okurigana irregularities and cross-script variations explained in Section 2.1 and Section 2.2, the hiragana orthography is, in itself, quite regular. Nevertheless, there is a certain amount of hiragana irregularities, as explained below:
|1. Particles||Hello||konnichi wa||こんにちは||こんにちわ|
|3. ぢ and づ||to shrink||chijimu||ちぢむ||ちじむ|
|4. Historical||to use||mochiiru||用いる||用ゐる|
|long oo||koo||こう||かう, かふ|
Although hiragana variation is a relativity minor issue, the intelligent Japanese search engine should be capable of retrieving hiragana variants, regardless of how the keyword is input.
This section presents a brief overview of Japanese homophony. Our aim is to demonstrate that even professional writers and sophisticated users are confused by the subtle distinctions between the numerous homophones in Japanese, and to assert that the intelligent search engine should be capable of optionally retrieving the various homophonic variants of the search term; in other words, of performing cross-homophone searching.
A plethora of abstruse terms are used to describe the orthographic relations between words, including homograph, heteronym, homologue, heterograph and homonym, to name a few. Much confusion prevails, since these terms are often used inconsistently, even by professional linguists. This topic deserves a full paper in its own right. Here, we will keep it simple and define the most important terms.
An important factor that contributes to the complexity of the Japanese writing system is the existence of a large number of homophones. Kooki and kikoo, for instance, each represent about a dozen words in common use, and the only way to distinguish between such compounds as 機構 kikoo 'mechanism' and 帰港 kikoo 'returning to the harbor' is through the characters. Although on (Chinese derived) homophones like the above may occasionally cause confusion in the spoken language, they are easily distinguished in the written language.
On the other hand, the abundance of kun (native Japanese) homophones is a source of confusion even to professional writers and editors. Not only can each kanji have many kun readings, but many kun words can be written in a bewildering variety of ways. In extreme cases, such as the word sasu, a kun word can be written in dozens of ways, though only several of these are in common use. Unlike on homophones, the majority of kun homophones are often close or even identical in meaning and thus easily confused, as shown in the table below:
|Easily Distinguished||Easily Confused|
|go up (steps, a hill)|
ascend, rise (up to the sky)
Another problem with kun homophones is their variable orthography. Two or more characters are often partially or completely interchangeable in some senses but not in others. For example, 解ける tokeru and 溶ける tokeru are interchangeable in the sense of 'melt, thaw' but not in the sense of 'come loose', which is written 解ける. On the other hand, the meanings of some homophones are identical or nearly identical. For example, yawarakai 'soft, subdued; gentle' is written 柔らかい or 軟らかい with exactly the same meaning.
To make matters worse, the distinctions between some homophones are so subtle that many authors don't even try to select the most appropriate kanji and resort to the "easy solution" of using hiragana instead, making the meaning fuzzy and searching more difficult (see 2.3 Cross-script Searching for details).
Cross-homophone searching requires a semantically classified database of homophones and a homophone expansion algorithm. The process of searching for Japanese homophones is not, in itself, any more difficult than searching for such English homophones as right and write. In English, however, cross-homophone searching is clearly undesirable. A user searching for right is most certainly not interested in finding write.
Not so in Japanese. From an information retrieval point of view, the major issue is that for many kun homophones, a universally-accepted orthography does not exist. Theoretically, the choice of character should be based on meaning, but in fact it is often unpredictable and governed by personal preferences.
This means that, when the user enters a query that involves homophones, she can never be sure which particular one to select, since often there is no one right answer. We have already seen in Section 2.2 above how hopelessly difficult it is for the user to select the appropriate homophones when searching for the book title Hi no sasanai yashiki. The table below demonstrates why this is so by showing the complex semantic interrelations between the homophones for sasu.
|2||to hold up||差す||さす|
|3||to pour into||差す||注す||さす|
|5||to shine on||差す||射す||さす|
|6||to aim at||指す||差す|
|6||to point to||指す||さす|
|8||to leave unfinished||さす||止す|
To sum up, Japanese homophones have certain characteristics that exacerbate the difficulties of retrieving them:
As things stand now, the entire burden of homophone searching falls upon the user. The intelligent search engine, by performing cross-homophone searching at the user's request, will relieve the user of this burden by retrieving all the homophones in the relevant homophone group.
Implementing such technology requires a comprehensive database of semantically and etymologically classified homophones. Merely retrieving all homophones will do far more harm than good since it will match numerous irrelevant homophones, such as 変える kaeru 'to change' for 帰る kaeru 'to return'.
Below is a brief overview of Japanese homography. A homograph is one of two or more words that are written the same but differ in pronunciation and (usually) in meaning e.g. minute "60 seconds" and minute "very small".
Japanese has numerous kanji that have multiple on and kun readings, which gives rise to a large number of homographs. The table below lists some typical examples:
|1||一時||ichiji||one o'clock; temporarily|
|一時||ittoki||a moment; 12th part of day|
|一章||kazuaki||a first name|
|仮名||kamei||fictitious name, pseudonym|
|仮名||karina||alias, assumed name|
|仮名||kemyoo||fictitious name, pseudonym|
Unlike English homographs, which differ in meaning, the meanings of Japanese homographs could be identical (化学 above), totally different (一章 above), or partially synonymous (一時 above).
Since the number of homographs in Japanese is very large (we found over 20,000 in our databases), it follows that a failure to identify specific homographs will often lead to irrelevant results. However, it is self-evident that, since homographs are written in exactly the same way, retrieving the one semantically relevant to the search term is a very difficult task. This is called homograph disambiguation, and is an important issue in text-to-speech synthesis. This is similar to searching for a polysemous word used in a specific sense, such as for 'table' in the sense of "article of furniture," as opposed to 'table' in the sense of "rectangular array of data."
The truly intelligent search engine should be capable of performing homograph disambiguation.
The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.
The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.
From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.
Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.
It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.
As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:
- "Kennedy was killed on ..."
- "The murder of Kennedy was ..."
- "JFK had to be eliminated because ..."
To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.
From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept [cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."
The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as theonomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.
Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.
How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.
To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:
|to commit murder||satsujin o okasu||殺人を犯す|
|to execute||shokei suru||処刑する|
|to murder||satsugai suru||殺害する|
|to shoot to death||shasatsu suru||射殺する|
|to assassinate||ansatsu suru||暗殺する|
|to bump off||yaru||やる, 殺る|
Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.
Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:
|Search Term||Search Results||Reading|
|Japanese economy||日本(の)経済||Nihon (no) keizai|
Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:
|Search keyword||Search results|
Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.
This section briefly describes morphological analysis, an essential component of any Japanese search engine, and miscellaneous search technologies not covered in other sections.
To perform query processing and search and retrieval operations, a Japanese search engine must be capable of processing a Japanese text on two levels: morphological and lexemic. Morphological analysis refers to computational procedures such as stemming and conflation that operate on the morphemic level (described below). The more difficult lexemic analysis refers to identifying word boundaries by segmenting a text stream into meaningful semantic units (such as lexemes) for dictionary lookup and indexing purposes.
Segmentation and morphological analysis are central to Japanese search engine technology, and each deserves a paper in its own right. Below, we will briefly describe some of the issues.
The Japanese language is agglutinative; that is, it forms words by putting together basic elements, called morphemes, that retain their original forms and meanings with little change during the combination process (more information on Japanese morphology.). Inflection in Japanese typically consists of adding to a stem conjugational endings to indicate various grammatical functions, such as tense. The resulting word is another word-form of the underlying lexeme, not a new word in itself, as shown in table below (only basic forms are given):
|Past polite||書きました||kakimashita||書きませんでした||kakimasen deshita|
A major issue in the indexing and retrieval of Japanese texts is the extensive morphological variation in word-forms. A Japanese search engine must not only be capable of segmenting the search term into meaningful semantic units (such as lexemes and multi-word terms), but must also be capable of ignoring morphological variants like those shown in the above table.
A computational procedure designed to match morphological variants by reducing them to a single form for retrieval purposes is called a conflation algorithm. A procedure for processing a word, often by removing the inflectional endings to find its stem, is called stemming. Conflation and stemming make it possible to retrieve any inflectional form from any of the others, ensuring that potentially relevant documents are not lost.
Because of the morphological complexity of Japanese, it goes without saying that the intelligent Japanese search engine must be capable of both conflation and stemming, since they are a prerequisite for implementing the various search technologies described in this paper, such as cross-orthographic and cross-language searching.
Adding double-byte-enabled regular expression functionality to Japanese search engines will provide users with a tool for highly flexible searching far more powerful than Boolean expressions. A detailed treatment of regular expressions is outside the scope of this paper. A full analysis can be found in an excellent book on the subject, Mastering Regular Expressions by Jeffrey Friedl.
Below are a few examples of some metacharacters use in regular expressions (based on the POSIX standard):
|any single character|
|any one of the characters in the brackets|
|any characters except for those after the caret|
|beginning of line|
|end of line|
|beginning of word|
|end of word|
This paper covers the major issues in building an intelligent Japanese search engine, but is by no means exhaustive. There are various other possibilities, such as:
Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.
The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.
Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:
As we have seen, cross-orthographic searching is essential for intelligent Japanese information retrieval and query processing. However, the current lineup of first- and second-generation Japanese search engines is incapable of cross-orthographic searching, not to speak of the other, more advanced, retrieval technologies discussed in this paper. We have also seen that, because of the complexities and irregularities of the Japanese writing system, the implementation of intelligent retrieval technologies requires not only computational linguistic tools such as morphological analyzers, but also lexical databases fine-tuned to the needs of Japanese search engines.
Below is an outline of our vision for the future directions of third-generation Japanese search engine technology. The minimum requirements for what we shall refer to as a Level 1 Intelligent Japanese Search Engine (IJSE) are as follows:
The requirements for what we shall refer to as a Level 2 IJSE include, in addition to those for a Level 1 IJSE, are as follows:
- Support for regular expressions
- Databases and algorithms to support cross-homophone searching.
- Databases and algorithms to support cross-synonym searching.
- Databases and algorithms to support cross-language searching.
Going one step further in the quest for the ideal Japanese search engine, here are some suggested directions for what we shall refer to as a Level 3 IJSE:
- A Voice User Interface (VUI) is especially useful for native Japanese users, the majority of whom have little or no keyboard experience.
- Databases and algorithms to support homograph disambiguation.
- Japanese to/from English web translation tools seamlessly integrated into the user interface.
The lack of sophisticated tools to cope with the complexities of the Japanese script places users of Japanese search engines and major portals such as e-commerce sites at a distinct disadvantage. The information retrieval industry in general, and the search engine industry in particular, is in urgent need of third-generation retrieval technology capable of meeting the challenges of intelligent Japanese searching.
The The CJK Dictionary Institute finds itself in a unique position to provide the comprehensive, high quality lexical resources and the software infrastructure required for building intelligent Japanese information retrieval technology.About the Author
JACK HALPERN 春遍雀來 (ハルペン・ジャック)
President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University
Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.
Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.List of Publications
Following is a list of the author’s principal publications in the field of CJK lexicography.
The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.
The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:
- Dozens of lexicographic works, including electronic dictionaries.
- Search engine applications, such as morphological analyzers and simpified to/from traditional Chinese conversion systems.
- CJK input method editors (IME) and front-end processors (FEP).
- Machine translation, online translation tools and speech technology software.
- Pedagogical, linguistic and computational lexicography research.
DESK currently has over two million Japanese and about one million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.
The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.
The CJK Dictionary Institute, Inc.
34-14, 2-chome, Tohoku, Niiza-shi
Niiza-shi, Saitama 352-0001 JAPAN