|
1. Automatic Translation or machine translation (MT)
(TOP)
Machine Translation is throwing
up many challenges and opening up many opportunities for doing work.
Some of the prob- lems relate to grammars; others pertain to word
analysis, bilin- gual dictionaries, language generation, etc.
The concept of information is central to building MT Systems.
The question to be asked at every level of language analysis while
analyzing a given language string is what is the information content,
how is it coded and how can it be extracted. For example, at word
level we try to identify the stem or the root and the affixes, and
the informationa contained in them. At the level of local word
grouping, we try to identify as to what information is contained
in the proximity of words in a word group. Similarly, at the
sentential level, the relationships among the word groups etc can
be identified.
While generating sentences the information obtained from such an
analysis is expressed in the target language. the information
obtained at one level in the source language, say by word analy-
sis, might be expressed at another level in the target language,
say at word group level or sentence level.
At times, there might be no way to express the information with-
out making a total change to the text. The information centric
view brings about a major change in the way we look at linguistic
phenomena. This affects the grammars, frameworks in which we
write grammars etc. It might also serve to redefine the current
subdivisions between various submodules such as morphology, syntax
etc. Information theoretic view has been discussed in our book. It
has been shown, for example, why the existing linguistic theories
have not turned out to be very useful for NLP related work to
English or other languages.
When we are dealing with Indian languages, there is the addition-
al factor that our languages are free word order, while most of
the Western theories seem to be designed for languages in which
word order plays an important role. Our own traditional theories
such as by Panini (and possibly Tolkappiyam) fit very well for our
languages. Happily, they are also designed from the information
theoretic view point. All this has already been demonstrated
(See Bharati etal, 95; Chap.13).
In this paper, therefore, we switch over to some other problems
in MT that have turned out to be more difficult than earlier
believed. Most of these problems pertain to word lanalysis in a
practical system. These are posing new challenges which require
the immediate attention of linguists and computer scientists.
These have to be solved first, even before computational grammars
at the sentence level can be used.
|
Building a Telugu word analyzer for a practical MT system has
turned out to be
more difficult than anticipated. A practical system requires
that the coverage should be wide. The reader should not have to
deal with more than 1% or 2% unknown words in the text, otherwise
he will have difficulty in understanding the output (because there
might be other difficulties he may have to face). This level of
coverage has to
be achieved for actual written language given that the language
might have foreign words (from say English), borrowed words from
other Indian Languages, lack of standard conventions, etc. One of
the factors that makes the task easier is the fact that the
reader of the target language text is expected to know the
commonly occuring English words and Sanskrit words. Therefore,
even if the machine does not have the stems or roots of such
words in the bilingual dictionary, the reader should be able to
understand them. However, as we shall see shortly, there are
some problems that require research attention and work.
We now describe some major causes of difficulty for the word
analyzer. It should be kept in mind that a word means a sequence
of characters separated by white space on both sides.
In sandhi two adjacent words are written together without inter-
vening space and possilby by undergoing a change. Sandhi poses
many problems. First, it might introduce ambiguity as in Avidaku:
Avidaku + I --> AvidakI
He
vaha
Avidaku + emphatic marker --> AvidakI
Sandhi can be broken in two ways but how does the machine deter-
mine which one is correct? Another problem with sandhi is that
the rules typically apply on native words of the language, and
not on borrowed or foreign words. Therefore, when actual text is
given and the machine is unable to recognise a word, it applies
the sandhi breaking rules. If it happens to be a non-native
word, the application of the rules produces wrong results. As
the machine has no way of recognizing native versus non-native
words, it has no way of deciding whether the rules are applica-
ble. One research problem that might be of interest is developing
the rules or algorithm so that the machine can recognize which words
are native and which are non-native.
There is also a performance problem namely that Sandhi breaking
rules slow down the system (paraticularly in languages like
sanskrit). This could be addressed by computer scientists.
Research on the general sandhi problem is continuing, however a
solution is unlikely to appear in near future. Human beings
seem to be able solve this problem by the use of their
world knowledge, something that the machine does not have.
A practical solution is to pre-edit the given text manually
to break the sandhi everywhere so that the presently available
system becomes useful. (Here, machine should be viewed as a
child who is trying to learn to read.)
|
2.2. Wrong breaking of words
>
(TOP)
In Telugu, sometimes after a Sandhi is made, it is broken at a
wrong place. Therefore, part of a word goes with the adjacent
word, resulting in two unknown words.
Avida kAswa SAMwaMgA AlociMci natlayiwe wana Barwa saMpAxiMcinaxAniki iMcumiMcu
^^^^^^^^^^^^^^^^^^
AlociMcinatlu ayiwe
iMko eVnimixi woVmmixelYalopala pilla peVlYli keVxuguwuMxi
^^^^^^^^^^^^^^^^^^^^
peVlYliki eVxuguwuMxi
Sometimes compounds are broken and sandhi is made
between parts of compounds. For example, 'noka' and 'dAMtoM' which
are parts of two different compounds.
These things suggest that their are no standard conventions
regarading putting of space. Solutions to this problem are
needed which might appear as rules that deal with the incorrect
breaking of words or might also mean standardization of writing
conventions.
|
2.3. Grammar for Written Language
>
(TOP)
The written language has its own grammar, which usually differs
from the spoken language. Many usages have creeped into the
written languabge which violate that grammar. For example,
instead of writing a quotation from English as follows;
"Van vuman so" ani
"One woman show" like this
it is written as
"Van vuman so" nani
Where "nani" reflects sandhi between "So" amd "ani". However,
that sandhi still has an intervening quote mark.
|
2.4 Spelling Variation
>
(TOP)
There are many variant spellings for the lsame word. Some of
the spellings reflect dialectal variation, the others however
are there simply because of lack of standard conventions. For
example, all the following are found in printed texts:
taruvAta, taravAwa, taruAwa
Similarly, for "cUsAru" there are many variations where only the
last one reflects dialectal variation:
cUsAru, CUSAru, cUcAru, cUsinAru
|
2.5 Spelling Variation in Telugu
>
(TOP)
Spelling Variation because Telugu does not
make a sharp distinction between between aspirated and unaspirated
sounds and their corresponding written characters:
t T
d D
Thus, "stAnaM' can also be written as "sTanaM"
This together with typing errors can mislead readers as in the
following example:
hahu kaDvinavivALLu
a mistyping of "D" instead of "T" was misunderstood by native
speaker as a case of mistyping of 'K' instead of 'C". The
former means 'people having not heard hAhu story' instead of 'people
having read hahu story'. Such issues have to be thought about, and
suitable solutions worked out.
|
3. Dictionary Related Problems
(TOP)
Indian languages share a large part of their vocabulary as many
words have been derived from Sanskrit. Similarly, many English
words are in currency in all Indian languages. This makes the
task of building an MT system among Indian languages easier.
Even when there is word in the input text which is not known to
the system, the machine can reproduce the same root/stem in the
target language and the reader might still be able to follow the
meaning. However, there are certain pitfalls.
3.1. Divergent Meanings
>
(TOP)
Sometimes the same Sanskrit word is present in Telugu and Hindi
but with opposite meanings. For example, "lAMCana" roughly
meaning "to mark" is taken in a positive sense in Telugu and
negative sense in Hindi. Thus, in Telugu it can be used for
"award", whereas in Hindi, it can only be used as a "black mark".
The original Sanskrit meaning seems closer to Telugu.
Contrastive studies are needed which list such divergences, so
that they can be incorporated in the MT system or the training
material for MT system.
|
3.2 English Words
>
(TOP)
These are different conventions for writing English words in
different Indian languages. Whenever there is an English word
written in the source text (which is not found in the dictionary
of the MT system), presenting it or its stem as it is in the
target language might still not help because the reader of the
target text will still not be able to follow it because of spell-
ing variation. for example, the word "cat" would be written as
follows:
Telugu: kyAta
Hindi: keta
Work is needed for solving this problem. One solution is to come
up with rules which can change the spellings from one language
to another. However, the difficulty would be in recognising
foreign words in the first place by the machine. Other
solutions relate to standardization in spelling, development of
training material, etc.
|
Natural Language Processing: A Paninian Perspective, Akshar Bharati,
Vineet Chaitanya, Rajeev Sangal, Prentice-Hall of India, 1995.
Anusaraka: A Device to Overcome the Language Barrier, V.N. Narayana,
Ph.D. thesis, Dept. of CSE, I.I.T. Kanpur, 1994.
Anusaaraka Home Page
|
| |
|