NEW CHALLANGES IN AUTOMATIC TRANSLATION FROM AN INDIAN PERSPECTIVE (Extended Abstract)

Akshar Bharati Amba P. Kulkarni Vineet Chaitanya Satyam School of Applied Information Systems Indian Institute of Information Technology Hyderabad {vineet,amba,sangal}@iiit.net

[Presented at Int. Seminar on Automatic Translation, Annual DLA Meeting, Kuppam, June 1996. Organized by Dravidian Linguistics Association.]

1. Automatic Translation or machine translation (MT)
2. Word analyzer
2.1. Sandhi
2.2. Wrong breaking of words
2.3. Grammar for Written Language
2.4 Spelling Variation
2.5 Spelling Variation in Telugu

3. Dictionary Related Problems
3.1. Divergent Meanings
3.2 English Words

4. Reference:

1. Automatic Translation or machine translation (MT) (TOP)

Machine Translation is throwing up many challenges and opening up many opportunities for doing work. Some of the prob- lems relate to grammars; others pertain to word analysis, bilin- gual dictionaries, language generation, etc. The concept of information is central to building MT Systems. The question to be asked at every level of language analysis while analyzing a given language string is what is the information content, how is it coded and how can it be extracted. For example, at word level we try to identify the stem or the root and the affixes, and the informationa contained in them. At the level of local word grouping, we try to identify as to what information is contained in the proximity of words in a word group. Similarly, at the sentential level, the relationships among the word groups etc can be identified.

While generating sentences the information obtained from such an analysis is expressed in the target language. the information obtained at one level in the source language, say by word analy- sis, might be expressed at another level in the target language, say at word group level or sentence level. At times, there might be no way to express the information with- out making a total change to the text. The information centric view brings about a major change in the way we look at linguistic phenomena. This affects the grammars, frameworks in which we write grammars etc. It might also serve to redefine the current subdivisions between various submodules such as morphology, syntax etc. Information theoretic view has been discussed in our book. It has been shown, for example, why the existing linguistic theories have not turned out to be very useful for NLP related work to English or other languages.

When we are dealing with Indian languages, there is the addition- al factor that our languages are free word order, while most of the Western theories seem to be designed for languages in which word order plays an important role. Our own traditional theories such as by Panini (and possibly Tolkappiyam) fit very well for our languages. Happily, they are also designed from the information theoretic view point. All this has already been demonstrated (See Bharati etal, 95; Chap.13).

In this paper, therefore, we switch over to some other problems in MT that have turned out to be more difficult than earlier believed. Most of these problems pertain to word lanalysis in a practical system. These are posing new challenges which require the immediate attention of linguists and computer scientists. These have to be solved first, even before computational grammars at the sentence level can be used.

2. Word analyzer (TOP)

Building a Telugu word analyzer for a practical MT system has turned out to be more difficult than anticipated. A practical system requires that the coverage should be wide. The reader should not have to deal with more than 1% or 2% unknown words in the text, otherwise he will have difficulty in understanding the output (because there might be other difficulties he may have to face). This level of coverage has to be achieved for actual written language given that the language might have foreign words (from say English), borrowed words from other Indian Languages, lack of standard conventions, etc. One of the factors that makes the task easier is the fact that the reader of the target language text is expected to know the commonly occuring English words and Sanskrit words. Therefore, even if the machine does not have the stems or roots of such words in the bilingual dictionary, the reader should be able to understand them. However, as we shall see shortly, there are some problems that require research attention and work.

We now describe some major causes of difficulty for the word analyzer. It should be kept in mind that a word means a sequence of characters separated by white space on both sides.

2.1. Sandhi > (TOP)

In sandhi two adjacent words are written together without inter- vening space and possilby by undergoing a change. Sandhi poses many problems. First, it might introduce ambiguity as in Avidaku: Avidaku     +    I             -->    AvidakI
                He
               vaha
Avidaku     +  emphatic marker -->    AvidakI

Sandhi can be broken in two ways but how does the machine deter- mine which one is correct? Another problem with sandhi is that the rules typically apply on native words of the language, and not on borrowed or foreign words. Therefore, when actual text is given and the machine is unable to recognise a word, it applies the sandhi breaking rules. If it happens to be a non-native word, the application of the rules produces wrong results. As the machine has no way of recognizing native versus non-native words, it has no way of deciding whether the rules are applica- ble. One research problem that might be of interest is developing the rules or algorithm so that the machine can recognize which words are native and which are non-native.

There is also a performance problem namely that Sandhi breaking rules slow down the system (paraticularly in languages like sanskrit). This could be addressed by computer scientists.

Research on the general sandhi problem is continuing, however a solution is unlikely to appear in near future. Human beings seem to be able solve this problem by the use of their world knowledge, something that the machine does not have.

A practical solution is to pre-edit the given text manually to break the sandhi everywhere so that the presently available system becomes useful. (Here, machine should be viewed as a child who is trying to learn to read.)

2.2. Wrong breaking of words > (TOP)

In Telugu, sometimes after a Sandhi is made, it is broken at a wrong place. Therefore, part of a word goes with the adjacent word, resulting in two unknown words.   Avida kAswa SAMwaMgA AlociMci natlayiwe wana Barwa saMpAxiMcinaxAniki iMcumiMcu
                       ^^^^^^^^^^^^^^^^^^
                       AlociMcinatlu ayiwe
  iMko eVnimixi woVmmixelYalopala pilla peVlYli keVxuguwuMxi
                                        ^^^^^^^^^^^^^^^^^^^^
                                        peVlYliki eVxuguwuMxi

Sometimes compounds are broken and sandhi is made between parts of compounds. For example, 'noka' and 'dAMtoM' which are parts of two different compounds.

These things suggest that their are no standard conventions regarading putting of space. Solutions to this problem are needed which might appear as rules that deal with the incorrect breaking of words or might also mean standardization of writing conventions.

2.3. Grammar for Written Language > (TOP)

The written language has its own grammar, which usually differs from the spoken language. Many usages have creeped into the written languabge which violate that grammar. For example, instead of writing a quotation from English as follows;   "Van vuman so" ani
  "One woman show" like this
it is written as
  "Van vuman so" nani

Where "nani" reflects sandhi between "So" amd "ani". However, that sandhi still has an intervening quote mark.

2.4 Spelling Variation > (TOP)

There are many variant spellings for the lsame word. Some of the spellings reflect dialectal variation, the others however are there simply because of lack of standard conventions. For example, all the following are found in printed texts: taruvAta, taravAwa, taruAwa

Similarly, for "cUsAru" there are many variations where only the last one reflects dialectal variation:

cUsAru, CUSAru, cUcAru, cUsinAru

2.5 Spelling Variation in Telugu > (TOP)

Spelling Variation because Telugu does not make a sharp distinction between between aspirated and unaspirated sounds and their corresponding written characters: t T
d D

Thus, "stAnaM' can also be written as "sTanaM" This together with typing errors can mislead readers as in the following example:

hahu kaDvinavivALLu

a mistyping of "D" instead of "T" was misunderstood by native speaker as a case of mistyping of 'K' instead of 'C". The former means 'people having not heard hAhu story' instead of 'people having read hahu story'. Such issues have to be thought about, and suitable solutions worked out.

3. Dictionary Related Problems (TOP)

Indian languages share a large part of their vocabulary as many words have been derived from Sanskrit. Similarly, many English words are in currency in all Indian languages. This makes the task of building an MT system among Indian languages easier. Even when there is word in the input text which is not known to the system, the machine can reproduce the same root/stem in the target language and the reader might still be able to follow the meaning. However, there are certain pitfalls.

3.1. Divergent Meanings > (TOP)

Sometimes the same Sanskrit word is present in Telugu and Hindi but with opposite meanings. For example, "lAMCana" roughly meaning "to mark" is taken in a positive sense in Telugu and negative sense in Hindi. Thus, in Telugu it can be used for "award", whereas in Hindi, it can only be used as a "black mark". The original Sanskrit meaning seems closer to Telugu.

Contrastive studies are needed which list such divergences, so that they can be incorporated in the MT system or the training material for MT system.

3.2 English Words > (TOP)

These are different conventions for writing English words in different Indian languages. Whenever there is an English word written in the source text (which is not found in the dictionary of the MT system), presenting it or its stem as it is in the target language might still not help because the reader of the target text will still not be able to follow it because of spell- ing variation. for example, the word "cat" would be written as follows: Telugu: kyAta
Hindi: keta

Work is needed for solving this problem. One solution is to come up with rules which can change the spellings from one language to another. However, the difficulty would be in recognising foreign words in the first place by the machine. Other solutions relate to standardization in spelling, development of training material, etc.

4. Reference: (TOP)

Natural Language Processing: A Paninian Perspective, Akshar Bharati, Vineet Chaitanya, Rajeev Sangal, Prentice-Hall of India, 1995.

Anusaraka: A Device to Overcome the Language Barrier, V.N. Narayana, Ph.D. thesis, Dept. of CSE, I.I.T. Kanpur, 1994.

Anusaaraka Home Page