CHALLANGES IN DEVELOPING WORD ANALYZERS FOR INDIAN LANGUAGES

Akshar Bharati Amba P. Kulkarni Vineet Chaitanya Satyam School of Applied Information Systems Indian Institute of Information Technology Hyderabad {vineet,amba,sangal}@iiit.net

[Presented at Workshop on Morphology, CIEFL, Hyderabad, July 1996.]



1 Machine translation (MT) (TOP)

Machine translation is throwing up many challenges and opening up many opportunities for doing work. Some of the problems relate to grammars; others pertain to word analysis, bilingual dictionaries, language generation, etc. The concept of information is central to building MT Systems. The question to be asked at every level of language analysis while analyzing a given language string is what is the information content, how is it coded and how can it be extracted. For example, at word level we try to identify the stem or the root and the affixes, and the informationa contained in them. At the level of local word grouping, we try to identify as to what information is contained in the proximity of words in a word group. Similarly, at the sentential level, the relationships among the word groups etc can be identified.

While generating sentences the information obtained from such an analysis is expressed in the target language. the information obtained at one level in the source language, say by word analysis, might be expressed at another level in the target language, say at word group level or sentence level. At times, there might be no way to express the information without making a total change to the text. The information centric view brings about a major change in the way we look at linguistic phenomena. This affects the grammars, the frameworks in which we write grammars, etc. It might also serve to redefine the current subdivisions between various submodules such as morphology, syntax etc. Information theoretic view has been discussed in our book (Bharati et al., 1995). It has been shown, for example, why the existing the mainstream linguistic theories have not turned out to be very useful in NLP related work even for English.

When we are dealing with Indian languages, there is the additional factor that our languages are free word order, while most of the Western theories seem to be designed for languages in which the word order plays an important role. Our own traditional theories such as by Panini (and possibly Tolkappiyam) fit very well for our languages. Happily, they are also designed from the information theoretic view point. All this has already been demonstrated (see Bharati et al. (1995; Chap.13)).

In this paper, therefore, we switch over to some other problems in MT that have turned out to be more difficult than earlier believed. Most of these problems pertain to word lanalysis in a practical system. These are posing new challenges which require the immediate attention of linguists and computer scientists. These have to be solved first, even before computational grammars at the sentence level can be used.



2. Word analyzer (TOP)

Building a Telugu word analyzer for a practical MT system has turned out to be more difficult than anticipated. A system built using standard rules available from linguists gives a coverage of about 50%. A practical system requires that the coverage should be wide. The reader should not have to deal with more than 1% or 2% unknown words in the text, otherwise he will have difficulty in understanding the output (The above requirement might seem to be very stringent, but unknown words are only one of many problems, a reader has to face.) This level of coverage has to be achieved for actual written language given that the language might have foreign words (from say English), borrowed words from other Indian Languages, lack of standard conventions, etc. One of the factors that makes the task easier is the fact that the reader of the target language text is likely to know the commonly occuring English words and Sanskrit words. Therefore, even if the machine does not have the stems or roots of such words in the bilingual dictionary, the reader should be able to understand them. However, as we shall see shortly, there are some problems that require research attention and work.

We now describe some major causes of difficulty for the word analyzer. It should be kept in mind that a word means a sequence of characters separated by white space or punctuation mark on both sides.



2.1. Sandhi > (TOP)

In sandhi two adjacent words are written together without intervening space and possibly by undergoing a change. Sandhi poses many problems. First, it might introduce ambiguity as in AvidakI: Avidaku     +    I             -->    AvidakI
               this
               yaha
Avidaki     +  emphatic marker -->    AvidakI

The above sandhi can be broken in two ways, but how does the machine determine which one is correct? Another problem with sandhi is that the rules typically apply on native words of the language, and not on borrowed or foreign words. Therefore, when actual text is given and the machine is unable to recognise a word, it applies the sandhi breaking rules. If it happens to be a non-native word, the application of the rules produces wrong results. As the machine has no way of recognizing native versus non-native words, it has no way of deciding whether the rules are applicable. One research problem that might be of interest is developing the rules or algorithm so that the machine can recognize which words are native and which are non-native.

There is also a performance problem namely that sandhi breaking rules slow down the system (particularly in languages like Sanskrit). This could be addressed by computer scientists.

Research on the general sandhi problem is continuing, however a solution is unlikely to appear in near future. Human beings seem to be able solve this problem by the use of their world knowledge, something that the machine does not have.

A practical solution is to pre-edit the given text manually to break the sandhi everywhere so that the presently available system becomes useful. (Here, the machine should be viewed as a child who is trying to learn to read.)



2.2. Wrong breaking of words > (TOP)

In Telugu, sometimes after a Sandhi is made, it is broken at a wrong place. Therefore, sometimes a part of a word goes with the adjacent word, resulting in two unknown words. Following are two examples:   Avida kAsta SAMtaMgA AlociMci natlayite tana Barta saMpAdiMcinadAniki
                       ^^^^^^^^^^^^^^^^^^
                       AlociMcinatlu ayite
  iMko enimidi tommideLalopala pilla peLLi kedugutuMdi
                                     ^^^^^^^^^^^^^^^^^^^^
                                     peLLiki edugutuMdi

Sometimes compounds are broken and sandhi is made between parts of compounds. For example, 'noka' and 'dAMtoM' which are parts of two different compounds (the former as part of 'jiBa' and 'noka', and the latter as part of 'dAMtoM' and 'maDya') are grouped together thereby creating a difficulty for the reader of the translated text.

  T  : nAlika  monapaLLa    maDya      bigiMci
  @H : jIBa    ^^^^^^^^^^^  maDya      kasa_kara

  T  : nAlika  mona  paLLa  maDya    bigiMci

  @H : jIBa    noka   dAMtoM  maDya    kasa_kara

These things suggest that their are no standard conventions regarading putting of space. Solutions to this problem are needed which might appear as rules that deal with the incorrect breaking of words or might also mean standardization of writing conventions.



2.3. Grammar for Written Language > (TOP)

The written language has its own grammar, which usually differs from the spoken language. Many usages have creeped into the written language which violate that grammar. For example, instead of writing a quotation from English as follows;   "vana vumana So"  ani
  "One woman show"  as
it is written as
  "vana vumana So" nani

Where "nani" reflects sandhi between "So" amd "ani". However, that sandhi still has an intervening quote mark.



2.4 Spelling Variation > (TOP)

There are many variant spellings for the lsame word. Some of the spellings reflect dialectal variation, the others however are there simply because of lack of standard conventions. For example, all the following are found in printed texts:   taruvAta, taravAta, tarvAta, taruAta

Similarly, for "cUsAru" there are many variations where only the last one reflects dialectal variation:

  cUsAru, cUSAru, cUcAru, cUsinAru


2.5. Spelling variation also occurs because Telugu does not make > (TOP)

a sharp distinction between aspirated and unaspirated sounds and their corresponding written characters:   t        th
  d        dh

For example, 'artha' is spelt as 'ardha', and 'vIthi' is spelt as 'vIdhi'. This creates unnecessary ambiguities. For example,

  kadhannavAdu
can be a spelling variation of any one of the following:
  kathannavAdu (katha + annavAdu)
  kadannavAdu  (kada + annavAdu)

This ambiguity can only be resolved by looking at the context, which is beyond the capabilities of the present machines.

Such problems suggest that there is a need to standardize the spelling conventions in the language.



3. Algorithm (TOP)

It has been discussed earlier (Bharati et al., 1995; Chap. 3), how a given rules for generating the words, can be used for analysis by inverting them using reverse suffix table, etc. However, the method does not handle derivational morphology, spelling variations, etc. The algorithm given below incorporates the earlier method as its part along with heuristics of various kinds. The new algorithm can also give either all possible answers or only one answer which has higher probability.

The algorithm has a procedure which takes a given word and obtains the root together with sequence of possible operations (rules) using which the word can be derived. The procedure uses reverse suffix tables (of the earlier algorithm) as well as heuristics to do the task quickly. This procedure is applied recursively to take care of derivational morphology. The results of this procedure are checked by the main program by applying the sequence of possible operations and verifying that the given word is generated.

The advantage is that heuristics can be incorporated in the morphological rules. For example, rules can be given which work provided the word is a foreign word, whereas to identify the foreign word some heuristics can be supplied. The heuristics need not work perfectly, any error in their application is eliminated by the checking done by the main procedure.

procedure main
   read word;
   (root, operation_sequence) = get_pattern (word);
   gen_word = generate (root, operation_sequence);
   if (word eq gen_word) { output (root, operation_sequence); }
procedure get_pattern
   for each applicable rule {
      apply rule and get root and operation_sequence;
           # Operation sequences correspond to features
           #   such as TAM, gnp, vibhakti, etc.
      if (root is in the root_dictionary) {
        return (root, operation_sequence);
      }
      else { get_pattern (root); }  # Possibly a derived root.
   }


4. Conclusions (TOP)

We have tried to share our experience in building a practical morphological analyzer which can process "actual published text". It shows that by using the normal rules provided by linguists we achieve only about 50% recognition on such a text. Part of the problem is that the rules tend to cover native words, while the actual text has many borrowed words from other Indian languages, and foreign words from English. There is no easy way to identify whether a given word is a native word or a non-native word, and hence, it is not clear when the rules should be applied and when not. While more work is needed on this by the linguists, an algorithm has been developed which can incorporate heuristics for combining such rules.

There is also a need to standardize writing conventions, eliminating unnecessary spelling variations, and evolve a proper concept of word.



5. Reference: (TOP)

Natural Language Processing: A Paninian Perspective, Akshar Bharati,
Vineet Chaitanya, Rajeev Sangal, Prentice-Hall of India, 1995.
Anusaraka: A Device to Overcome the Language Barrier, V.N.  Narayana,
Ph.D. thesis, Dept. of CSE, I.I.T. Kanpur, 1994.


Anusaaraka Home Page