The CJK Dictionary Institute, Inc. - English Segmentation Not Trivial

Is English Segmentation Trivial?

Jack Halpern
The CJK Dictionary Institute, Inc,
http://www.cjk.org

Index to This Document

What is the problem?
English Morphology
Principal Morphological Processes
Some Issues in English Segmentation
Some Conclusions

1. What is the problem?

Articles on morphological analysis and segmentation often begin with the caveat that Western European languages, as opposed to Chinese and Japanese, are easy to segment (break into meaningful semantic units), since "words" (a vague concept, at best) are delimited by natural boundaries consisting of space, which are absent in Chinese and Japanese. Some even go so far as to consider it a "trivial task."
As we shall see below, this is a gross oversimplification. Though it is true that the challenges presented by Japanese and Chinese are far greater, segmentation of a language like English is hardly trivial. (For information on Japanese word formation, please see http://www.kanji.org/kanji/japanese/writing/wordform.htm).

Below I will briefly explain some of the issues that must be addressed in developing an English Morphological Analyzer (EMA) designed to segment English texts.

2. English Morphology

Languages differ in the processes by which they form new words. An important unit of word formation is the morpheme. A morpheme is a minimal unit of meaning -- a distinctive linguistic unit of relatively stable meaning that cannot be divided into smaller meaningful parts.

First, let us take a quick look at the principal morphological processes in English:

Compounding: combining two or more base forms, e.g. headwaiter

Inflection: adding endings to indicate grammatical functions, e.g. walk + ed gives walked

Affixation: adding affixes to a base form, e.g. great + ness gives greatness)

Conversion: a shift of one word class to another, e.g. the noun chemical is derived from the adjective chemical

Blending: merging two words, like breakfast + lunch to get brunch)

Clipping: truncating a word into one syllable, e.g. laboratory into lab)

Acronyms: combining initial letters, e.g. UNESCO)

3. Principal Morphological Processes

Of particular importance are the three morphological processes described below:

Compounding consists of combining two or more words having their own lexical meaning (having a substantial meaning of their own) to produce a new unit that functions as a single word. Traditionally, a compound word is considered to be a combination of two or more free words, such as headwaiter, which consists of head and waiter, though in practice the elements are not necessarily free.

Affixation (also called derivation) refers to creating a new word by adding to a base form an affix (prefix, infix or suffix) that expresses grammatical meaning but has no lexical meaning. For example, in English the adverb-forming suffix ly is attached to the base form great to form the adverb greatly. It is important to note that units formed through affixation are distinct words in their own right, not merely variants of the word from which they are derived.

Inflection consists of adding word endings or modifying the form of a word in order to indicate various grammatical functions, such as tense (called conjugation,) or number and case (called declension). The resulting word is another form of the original word, not a new word in itself. For example, worked is the past tense of work and books is the plural form of book.

4. Some Issues in English Segmentation

Here are a few reasons why breaking an English text stream into meaningful semantic units is difficult. It of utmost importance to keep in mind that the traditional notion of a space-delimited "word" is often not a useful unit.

For a search engine to be effective, the most important units that must be identified are lexemes (also called lexical units), which can be roughly defined as basic semantic units of vocabulary. Some lexemes, like run, can have several variants (inflected forms) like ran and running. Others, which we will refer to as multi-word lexemes, can consist of several traditional words, e.g. information retrieval.

Space-Delimited Compound Words

"Take off your jacket!"

This is an instance of a verbal compound, a single lexeme, take off. An intelligent search engine should identify it as a single unit, and not break it up. The status of take off as a lexeme in English is equivalent to single-word lexemes like remove and take.

Compound Verbs with Inserted Elements

"Take your jacket off!"
"Send that letter off before he comes"

Here, there are words inserted between the verb and off, so that even an intelligent segmenter that can do (1) will fail to identify the lexemes take off and send off.

Stemming
An intelligent segmenter must be able to identify inflected forms like went, took and took off and determine their stem or root form. This is referred to as stemming or sometimes deinflecting.

Combining the above three problems, a sophisticated segmenter should be able to identify take off in the sentence below:

"He took his shirt off yesterday"

Hyphenated/Solid/Open
Take the following as an example:

word processor
word-processor
wordprocessor

A compound word often first appears in English as free forms in juxtaposition, like (1) above. As it becomes established ("petrified"), forms such as (1) are considered to be noun phrases and begin to appear in glossaries and terminology dictionaries. As they further evolve they become "real" words (become lexicalized) and they make their way into ordinary dictionaries and the layperson's vocabulary.

Some compounds go through a stage of hyphenation, like (2), and then eventually may become solid, like (3). Some lexemes, like school bus and high school, always remain open.

What is very important to understand is that the degree of solidity often has nothing to do with the status of a string as a lexeme. School bus is just as legitimate a lexeme as is headwaiter or word-processor. The presence or absence of spaces or hyphens, that is, the orthography, does not determine the lexemic status of a string.

Proper Nouns
Proper Nouns, like AT&T and New York, must be treated as single units. Republic of China is just as legitimate a proper noun as China is, and there is not the slightest justification for treating them differently.

5. Some Conclusions

The description above is far from complete. Many more issues need to be investigated in depth in order to design an effective EMA that will significantly increase search and indexing accuracy.

What is important to keep in mind is that it is easy to get carried away by the simplicity of breaking on white space and hyphens. As the examples above demonstrate, this is a mistake, and sophisticated segmenters and search engines must not subscribe to this fallacy.

The sure way to identify multi-word lexemes accurately is a lexicon- based approach, which has proven its effectiveness in the Chinese and Japanese morphological analyzers used by major IT companies (for which we supplied the data) and other major portals. Methods based on statistical analysis (such as collocation-extraction algorithms, the mutual information statistic, and automatic phrase indexing) often give satisfactory results, but often also lead to erroneous results.

For reference, check http://www.comp.lancs.ac.uk/computing/users/eiamjw/claws/claws7.html