|Index to This Document|
The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies, focusing on cross-language and cross-synonym searching. Such important areas as cross-script and cross-orthographic searching and homophone expansion are dealt with in separate papers.
The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.
The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.
From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.
Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.
It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.
As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:
- "Kennedy was killed on ..."
- "The murder of Kennedy was ..."
- "JFK had to be eliminated because ..."
To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.
From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept [cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."
The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as the onomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.
Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.
How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.
To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:
|to commit murder||satsujin o okasu||殺人を犯す|
|to execute||shokei suru||処刑する|
|to murder||satsugai suru||殺害する|
|to shoot to death||shasatsu suru||射殺する|
|to assassinate||ansatsu suru||暗殺する|
|to bump off||yaru||やる, 殺る|
Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.
Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:
|Search Term||Search Results||Reading|
|Japanese economy||日本(の)経済||Nihon (no) keizai|
Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:
|Search keyword||Search results|
Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.
Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.
The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.
Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:
JACK HALPERN 春遍雀來 (ハルペン・ジャック)
President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University
Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.
Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.List of Publications
Following is a list of the author’s principal publications in the field of CJK lexicography.
The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.
The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:
- Dozens of lexicographic works, including electronic dictionaries.
- Search engine applications, such as morphological analyzers and simpified to/from traditional Chinese conversion systems.
- CJK input method editors (IME) and front-end processors (FEP).
- Machine translation, online translation tools and speech technology software.
- Pedagogical, linguistic and computational lexicography research.
DESK currently has over two million Japanese and about 2.5 million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.
The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.
The CJK Dictionary Institute, Inc.
34-14, 2-chome, Tohoku, Niiza-shi
Niiza-shi, Saitama 352-0001 JAPAN