Cross-Synonym and Cross-Language Searching in Japanese


Jack Halpern
President
©2001-2008 The CJK Dictionary Institute, Inc.



Index to This Document
  1. Abstract
  2. Cross-synonym searching
  3. Cross-language searching
  4. Lexical databases

Abstract

The Japanese language, which is written in a mixture of four scripts, is said to have the most complex writing system in the world. Such factors as the lack of a standard orthography, the presence of numerous orthographic variants, and the morphological complexity of the language pose formidable challenges to the building of an intelligent Japanese search engine. This paper describes the linguistic issues that need to be addressed by advanced information retrieval technologies, focusing on cross-language and cross-synonym searching. Such important areas as cross-script and cross-orthographic searching and homophone expansion are dealt with in separate papers.

1. Cross-Synonym Searching

The words of a language form a closely-linked network of interdependent units. The meaning of a word or expression cannot really be understood unless its relationships with other closely related words are taken into account. For example, such words as kill, murder, and execute share the meaning of 'put to death', but they differ in usage and connotation.

The Japanese language has an extraordinarily rich stock of synonyms and synonymous expressions. This section presents a brief overview of Japanese synonymy, and demonstrates that the user can greatly benefit from an intelligent search engine capable of retrieving synonyms of the search term; that is, of performing cross-synonym searching or synonym expansion.


1.1 Overview of Japanese Synonymy

From the point of view of building an intelligent search engine, the abundance of Japanese synonyms poses some interesting challenges. Below is a brief introduction to this complex subject, with focus on the different types of sense relations between synonyms and other kinds of semantically related words.

  1. Synonymy
    A relation between a set of words that are similar (near-synonyms) or identical (absolute-synonyms) in meaning.

    Relation English Reading Japanese
    Shared concept money kane
    Synonymscurrency tsuuka 通貨
    cash genkin 現金
    bank note shihei 紙幣

  2. Hyponymy and Hyperonymy
    A relation between a set of specific (subordinate) terms, called hyponyms, and a generic (superordinate) term, called the hyperonym. The hyperonym is more general and includes the senses of the hyponyms.

    Relation English Reading Japanese
    Hyperonym sound oto
    Hyponymsvoice koe
    echo hankyoo 反響
    noise sooon 騒音


  3. Meronomy
    A relation between a set of subordinate words, called meronyms, whose meanings are in a partitive (part-of) relation to a more comprehensive concept, called a holonym.

    Relation English ReadingJapanese
    Holonym city shi
    Meronyms ward ku
    town section choo
    town subsection choome 丁目

  4. Complementarity
    A relation between a set words that contrast with each other and are mutually exclusive:

    Relation English Reading Japanese
    Shared concept siblings kyoodai 兄弟
    Complementary
    terms
    older brother ani
    younger brother otooto
    older sister ane
    younger sister imooto

  5. Antonymy
    A relation between words, called antonyms, of opposite meanings, such as 清潔な seiketsu na 'clean' and 汚い kitanai 'dirty'. Antonyms are probably not of interest in information retrieval.

1.2 The Semasiological Approach

Normally, the user of a dictionary starts out with a word or phrase and expects to find lexical information, such as a definition or a target language equivalent. Similarly, the user of a search engine starts out with a search term (keyword, phrase or Boolean expression) and expects to find cyberinformation, such as webpages, online databases and newsgroups relevant to the search term.

It is important to note that such a search operation has a well-defined direction: word-to-concept (lexeme-to-sense) or, in a search engine environment, keyword-to-cyberinformation. In lexicography, this way of searching is referred to as the semasiological approach. Clearly, this approach is based on the assumption that all the user wants is information including the specific search term provided in the search box.

As any search engine user knows, this is often not the case. Let us assume that a user wants to search for information on Kennedy's assassination. In Alta Vista, she might enter the string "+Kennedy +assassination." But surely this query will not retrieve such phrases as:

  1. "Kennedy was killed on ..."
  2. "The murder of Kennedy was ..."
  3. "JFK had to be eliminated because ..."

To locate such phrases with conventional search engines, the user must resort to the laborious task of building advanced Boolean queries, then spend much time on wading through often irrelevant results.


1.3 The Onomasiological Approach

From the point of view of the user interested in the semantic content of the search results, rather than in their orthographic representation, the semasiological approach is clearly inadequate. When such a user searches for the keyword "Kennedy," surely she is interested in the referent represented by "John Kennedy", "JFK," or "President Kennedy", not just in the lexicalized manifestation of any particular synonym. Similarly, when searching for "assassination," surely the user is interested in finding information on the concept [cause to die], not just in finding any particular phrase such as "the murder of", "was killed by" and "The killing of."

The opposite of the semasiological approach is the onomasiological approach, which reverses the normal semantic paradigm (also know as the onomantic perspective). There is a long tradition of lexicographic works based on this approach, the most well known examples of which are thesauri and synonym dictionaries. These works make it possible to reverse the normal search direction; that is, instead of from word-to-concept, the user can search from concept-to-word.


1.4 Intelligent Synonym Searching

Though the usefulness of the onomasiological approach to dictionary consultation is indisputable, it has not yet become established in search engine information retrieval. The search strategy proposed here, based on onomasiolgical approach, is called synonym expansion or cross-synonym searching. In a sense, the thematic search and topic search technologies currently implemented in web subject directories are also based on the omosmasiolgical approach. But, as any search engine user knows, wading through multilevel hierarchies of subject directories is a time consuming strategy that is too inefficient to be practical.

How does cross-synonym searching work? Obviously, the user still has to enter a search term, consisting of keywords, but with an important difference. That is, the user need not be overly concerned with the specific wording of the query. A query consisting of any expression like "+kill +Kennedy", "JFK's assassination", "The murder of John Kennedy" is expanded into the full set of synonyms and lead to the same or very similar search results.

To implement such technology, a comprehensive database of synonyms is required. A typical (partial) entry in such a database might look like this:


Concept: [to cause to die]
English Reading Japanese
to kill korosu 殺す
to commit murder satsujin o okasu殺人を犯す
to execute shokei suru 処刑する
to murder satsugai suru 殺害する
to shoot to death shasatsu suru 射殺する
to assassinate ansatsu suru 暗殺する
to bump off yaru やる, 殺る
to butcher barasu ばらす

Semantically-classified databases like the above are useful not only for cross-synonym searching, but also in such increasingly important web technologies as the automated categorization of web resources and automatic query expansion (AQE). For cross-synonym searching to be truly effective, it should be combined with cross-orthographic searching and some of the other retrieval technologies described in this paper, as well as such technologies as query expansion with relevance feedback.


2. Cross-Language Searching

Non-Japanese users, such as learners, and even native speakers, can greatly benefit from English-Japanese cross-language searching; that is, inputting an English query to retrieve webpages that include the equivalent word(s) in Japanese, as shown in the table below:


Cross-Language Searching
Search Term Search Results Reading
Japanese economy 日本(の)経済 Nihon (no) keizai
Tokyo 東京 Tookyoo
happy 幸福
幸せ
koofuku
shiawase
NEC 日本電気
NEC
Nihon Denki
en-ii-shii

Cross-language searching has the additional benefit of enabling users without a Japanese input method editor (IME) to retrieve Japanese webpages. This is especially useful when searching for katakana words from the corresponding English words. Since Japanese has countless katakana loanwords derived from English, many of which are of variable orthography, even users with a Japanese IME and native speakers may find it more convenient to input English keywords and have the search engine retrieve all katakana and Latin alphabet variants, as shown in the table below:


English to Katakana Conversion
Search keyword Search results
computer コンピュータ
コンピューター
WWW ワールドワイドウェブ
ウェブ
WWW
Diesel ディーゼル
ジーゼル

Cross-language searching, also known as cross-language information retrieval (CLIR), is a new research area that is becoming increasingly important as the World Wide Web undergoes rapid internationalization. The technical details of this are discussed in an article by Douglas W. Ord. Here, we will only mention that such technology requires access to a comprehensive English-Japanese lexicon designed to meet the needs of the search engine environment.

3. Lexical Databases

Because of the morphological complexity and highly irregular orthography of Japanese, developing the advanced retrieval technologies required for intelligent Japanese searching cannot be based on algorithmic and statistical methods alone. To be effective, such methods must be supplemented by large-scale, up-to-date lexical databases designed to meet the specific needs of search engine applications.

The The CJK Dictionary Institute (CJKI), which specializes in CJK computational lexicography, is engaged in the continuous expansion of a comprehensive CJK lexical database called DESK ( more information below). Currently, DESK has over two million Japanese and one million Chinese entries, and includes a rich set of grammatical and semantic attributes required for developing information retrieval applications, input method editors, and electronic dictionaries.

Below is a brief description of the principal database components useful for developing intelligent Japanese search engines:

  1. General Vocabulary. A comprehensive database of about 450,000 entries covering general vocabulary. The rich set of grammatical attributes is fine-tuned to support search engine applications, especially morphological analyzers and word segmenters (more information).
  2. Technical Terminology. A comprehensive Japanese-English-Japanese database of over 320,000 entries covering a broad spectrum of fields ranging from computer science to biotechnology (more information).
  3. Katakana Loanwords. About 50,000 loanwords and other Japanese words written in katakana, with special focus on computer and Internet terminology (more information).
  4. Japanese Names. About 600,000 Japanese (and Chinese) personal and place names semantically classified and ranked by frequency (more information).
  5. Western Names. An English-Japanese database of about 60,000 non-Japanese personal and place names, semantically classified and accompanied by English equivalents (more information).
  6. Japanese Companies. About 600,000 Japanese company and organization names ranked by frequency with English equivalents when appropriate (more information).
  7. Orthographic Variants. A database of about 60,000 orthographic variants, with full coverage of okurigana, kanji, and kana variants, designed to support cross-script and cross-orthographic searching (more information).
  8. Homophone Groups. A database of semantically classified homophone groups designed to support cross-homophone searching (more information and detailed description).
  9. Homograph Groups. A database of about 34,000 homographs designed to support homograph disambiguation (more information).
  10. Synonym Groups. A database of semantically classified synonym groups consisting of kanji synonyms, homonyms and meronyms serving as a basis for a Japanese thesaurus designed to support cross-synonym searching (more information).
  11. English-Japanese Dictionary. An English-Japanese lexical database of over 100,000 entries covering general vocabulary and important proper names. This can be expanded to cover Western names and technical terms.
  12. Kanji-English Dictionary. A Kanji-English database that includes the comprehensive features of New Japanese-English Character Dictionary, which has become a standard reference work in Japanese education circles ( detailed description).
  13. Kanji Database. A single-character database that covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes ( more information).
  14. Orthographic Variation Rules A comprehensive collection of rules of katakana, hiragana, and kanji orthographic variation which that can be used to generate variants not listed in the database.

About the Author

JACK HALPERN     春遍雀來     (ハルペン・ジャック)

President, The CJK Dictionary Institute
Editor-in-Chief, Kanji Dictionary Publishing Society
Research Fellow, Showa Women’s University

Born in Germany in 1946, Jack Halpern lived in six countries and knows twelve languages. Fascinated by kanji while living in an Israeli kibbutz, he came to Japan in 1973, where he compiled the New Japanese-English Character Dictionary for sixteen years. He is a professional lexicographer/writer and lectures widely on Japanese culture, is winner of first prize in the International Speech Contest in Japanese, and is founder of the International Unicycling Federation.

Jack Halpern is currently the editor-in-chief of the Kanji Dictionary Publishing Society (KDPS), a non-profit organization that specializes in compiling kanji dictionaries, and the head of the The CJK Dictionary Institute (CJKI), which specializes in CJK lexicography and the development of a comprehensive CJK database (DESK). He has also compiled the world’s first Unicode dictionary of CJK characters.

List of Publications

Following is a list of the author’s principal publications in the field of CJK lexicography.

The CJK Dictionary Institute

The CJK Dictionary Institute (CJKI) consists of a small group of researchers that specialize in CJK lexicography. The society is headed by Jack Halpern, editor-in-chief of the New Japanese-English Character Dictionary, which has become a standard reference work for studying Japanese.

The principal activity of the CJKI is the development and continuous expansion of a comprehensive database that covers every aspect of how Chinese characters are used in CJK languages, including Cantonese. Advanced computational lexicography methodology has been used to compile and maintain a Unicode-based database that is serving as a source of data for:

  1. Dozens of lexicographic works, including electronic dictionaries.
  2. Search engine applications, such as morphological analyzers and simpified to/from traditional Chinese conversion systems.
  3. CJK input method editors (IME) and front-end processors (FEP).
  4. Machine translation, online translation tools and speech technology software.
  5. Pedagogical, linguistic and computational lexicography research.

DESK currently has over two million Japanese and about 2.5 million Chinese items, including detailed grammatical, phonological and semantic attributes for general vocabulary, technical terms, and hundreds of thousands of proper nouns. The single-character database covers every aspect of CJK characters, including frequency, phonology, radicals, character codes, and other attributes. See http://www.cjk.org/cjk/samples/ for a list of data resources.

The CJKI has become one of the world’s prime resources for CJK dictionary data, and is contributing to CJK information processing technology by providing software developers with high-quality lexical resources, as well as through its ongoing research activities and consulting services.


President
Jack Halpern
The CJK Dictionary Institute, Inc.
日中韓辭典研究所

34-14, 2-chome, Tohoku, Niiza-shi
Niiza-shi, Saitama 352-0001 JAPAN
Phone: +81-48-473-3508
Fax: +81-48-486-5032
http://www.cjk.org