|
1 Machine translation (MT)
(TOP)
Machine translation is throwing up many challenges and opening
up many opportunities for doing work. Some of the problems relate to
grammars; others pertain to word analysis, bilingual dictionaries,
language generation, etc. The concept of information is central to
building MT Systems. The question to be asked at every level of
language analysis while analyzing a given language string is what
is the information content, how is it coded and how can it be
extracted. For example, at word level we try to identify the stem or
the root and the affixes, and the informationa contained in them.
At the level of local word grouping, we try to identify as to what
information is contained in the proximity of words in a word group.
Similarly, at the sentential level, the relationships among the word
groups etc can be identified.
While generating sentences the information obtained from such an
analysis is expressed in the target language. the information
obtained at one level in the source language, say by word analysis,
might be expressed at another level in the target language, say at
word group level or sentence level. At times, there might be no way
to express the information without making a total change to the
text. The information centric view brings about a major change in
the way we look at linguistic phenomena. This affects the grammars,
the frameworks in which we write grammars, etc. It might also serve
to redefine the current subdivisions between various submodules such
as morphology, syntax etc. Information theoretic view has been
discussed in our book (Bharati et al., 1995). It has been shown, for
example, why the existing the mainstream linguistic theories have not
turned out to be very useful in NLP related work even for English.
When we are dealing with Indian languages, there is the additional
factor that our languages are free word order, while most of the
Western theories seem to be designed for languages in which the word
order plays an important role. Our own traditional theories such as
by Panini (and possibly Tolkappiyam) fit very well for our languages.
Happily, they are also designed from the information theoretic view
point. All this has already been demonstrated (see Bharati et al.
(1995; Chap.13)).
In this paper, therefore, we switch over to some other problems in
MT that have turned out to be more difficult than earlier believed.
Most of these problems pertain to word lanalysis in a practical
system. These are posing new challenges which require the immediate
attention of linguists and computer scientists. These have to be
solved first, even before computational grammars at the sentence level
can be used.
|
Building a Telugu word analyzer for a practical MT system has turned
out to be more difficult than anticipated. A system built using
standard rules available from linguists gives a coverage of about 50%.
A practical system requires that the coverage should be wide. The
reader should not have to deal with more than 1% or 2% unknown words
in the text, otherwise he will have difficulty in understanding the
output (The above requirement might seem to be very stringent, but
unknown words are only one of many problems, a reader has to face.)
This level of coverage has to be achieved for actual written language
given that the language might have foreign words (from say English),
borrowed words from other Indian Languages, lack of standard
conventions, etc. One of the factors that makes the task easier is
the fact that the reader of the target language text is likely to
know the commonly occuring English words and Sanskrit words.
Therefore, even if the machine does not have the stems or roots of
such words in the bilingual dictionary, the reader should be able to
understand them. However, as we shall see shortly, there are some
problems that require research attention and work.
We now describe some major causes of difficulty for the word
analyzer. It should be kept in mind that a word means a sequence of
characters separated by white space or punctuation mark on both sides.
In sandhi two adjacent words are written together without intervening
space and possibly by undergoing a change. Sandhi poses many
problems. First, it might introduce ambiguity as in AvidakI:
Avidaku + I --> AvidakI
this
yaha
Avidaki + emphatic marker --> AvidakI
The above sandhi can be broken in two ways, but how does the machine
determine which one is correct? Another problem with sandhi is that
the rules typically apply on native words of the language, and not
on borrowed or foreign words. Therefore, when actual text is given
and the machine is unable to recognise a word, it applies the sandhi
breaking rules. If it happens to be a non-native word, the
application of the rules produces wrong results. As the machine
has no way of recognizing native versus non-native words, it has no
way of deciding whether the rules are applicable. One research
problem that might be of interest is developing the rules or algorithm
so that the machine can recognize which words are native and which are
non-native.
There is also a performance problem namely that sandhi breaking rules
slow down the system (particularly in languages like Sanskrit). This
could be addressed by computer scientists.
Research on the general sandhi problem is continuing, however a
solution is unlikely to appear in near future. Human beings seem to
be able solve this problem by the use of their world knowledge,
something that the machine does not have.
A practical solution is to pre-edit the given text manually to break
the sandhi everywhere so that the presently available system becomes
useful. (Here, the machine should be viewed as a child who is
trying to learn to read.)
|
2.2. Wrong breaking of words
>
(TOP)
In Telugu, sometimes after a Sandhi is made, it is broken at a wrong
place. Therefore, sometimes a part of a word goes with the adjacent
word, resulting in two unknown words. Following are two examples:
Avida kAsta SAMtaMgA AlociMci natlayite tana Barta saMpAdiMcinadAniki
^^^^^^^^^^^^^^^^^^
AlociMcinatlu ayite
iMko enimidi tommideLalopala pilla peLLi kedugutuMdi
^^^^^^^^^^^^^^^^^^^^
peLLiki edugutuMdi
Sometimes compounds are broken and sandhi is made between
parts of compounds. For example, 'noka' and 'dAMtoM' which are parts
of two different compounds (the former as part of 'jiBa' and 'noka',
and the latter as part of 'dAMtoM' and 'maDya') are grouped together
thereby creating a difficulty for the reader of the translated text.
T : nAlika monapaLLa maDya bigiMci
@H : jIBa ^^^^^^^^^^^ maDya kasa_kara
T : nAlika mona paLLa maDya bigiMci
@H : jIBa noka dAMtoM maDya kasa_kara
These things suggest that their are no standard conventions
regarading putting of space. Solutions to this problem are needed
which might appear as rules that deal with the incorrect breaking of
words or might also mean standardization of writing conventions.
|
2.3. Grammar for Written Language
>
(TOP)
The written language has its own grammar, which usually differs from
the spoken language. Many usages have creeped into the written
language which violate that grammar. For example, instead of
writing a quotation from English as follows;
"vana vumana So" ani
"One woman show" as
it is written as
"vana vumana So" nani
Where "nani" reflects sandhi between "So" amd "ani". However, that
sandhi still has an intervening quote mark.
|
2.4 Spelling Variation
>
(TOP)
There are many variant spellings for the lsame word. Some of the
spellings reflect dialectal variation, the others however are there
simply because of lack of standard conventions. For example, all
the following are found in printed texts:
taruvAta, taravAta, tarvAta, taruAta
Similarly, for "cUsAru" there are many variations where only the last
one reflects dialectal variation:
cUsAru, cUSAru, cUcAru, cUsinAru
|
2.5. Spelling variation also occurs because Telugu does not make
>
(TOP)
a sharp distinction between aspirated and unaspirated sounds and their
corresponding written characters:
t th
d dh
For example, 'artha' is spelt as 'ardha', and 'vIthi' is spelt as
'vIdhi'. This creates unnecessary ambiguities. For example,
kadhannavAdu
can be a spelling variation of any one of the following:
kathannavAdu (katha + annavAdu)
kadannavAdu (kada + annavAdu)
This ambiguity can only be resolved by looking at the context, which
is beyond the capabilities of the present machines.
Such problems suggest that there is a need to standardize the spelling
conventions in the language.
|
It has been discussed earlier (Bharati et al., 1995; Chap. 3), how a
given rules for generating the words, can be used for analysis by
inverting them using reverse suffix table, etc. However, the method
does not handle derivational morphology, spelling variations, etc.
The algorithm given below incorporates the earlier method as its part
along with heuristics of various kinds. The new algorithm can also
give either all possible answers or only one answer which has higher
probability.
The algorithm has a procedure which takes a given word and obtains the
root together with sequence of possible operations (rules) using which
the word can be derived. The procedure uses reverse suffix tables (of
the earlier algorithm) as well as heuristics to do the task quickly.
This procedure is applied recursively to take care of derivational
morphology. The results of this procedure are checked by the main
program by applying the sequence of possible operations and verifying
that the given word is generated.
The advantage is that heuristics can be incorporated in the
morphological rules. For example, rules can be given which work
provided the word is a foreign word, whereas to identify the foreign
word some heuristics can be supplied. The heuristics need not work
perfectly, any error in their application is eliminated by the
checking done by the main procedure.
procedure main
read word;
(root, operation_sequence) = get_pattern (word);
gen_word = generate (root, operation_sequence);
if (word eq gen_word) { output (root, operation_sequence); }
procedure get_pattern
for each applicable rule {
apply rule and get root and operation_sequence;
# Operation sequences correspond to features
# such as TAM, gnp, vibhakti, etc.
if (root is in the root_dictionary) {
return (root, operation_sequence);
}
else { get_pattern (root); } # Possibly a derived root.
}
|
We have tried to share our experience in building a practical
morphological analyzer which can process "actual published text". It
shows that by using the normal rules provided by linguists we achieve
only about 50% recognition on such a text. Part of the problem is that
the rules tend to cover native words, while the actual text has many
borrowed words from other Indian languages, and foreign words from
English. There is no easy way to identify whether a given word is a
native word or a non-native word, and hence, it is not clear when the
rules should be applied and when not. While more work is needed on
this by the linguists, an algorithm has been developed which can
incorporate heuristics for combining such rules.
There is also a need to standardize writing conventions, eliminating
unnecessary spelling variations, and evolve a proper concept of word.
|
Natural Language Processing: A Paninian Perspective, Akshar Bharati,
Vineet Chaitanya, Rajeev Sangal, Prentice-Hall of India, 1995.
Anusaraka: A Device to Overcome the Language Barrier, V.N. Narayana,
Ph.D. thesis, Dept. of CSE, I.I.T. Kanpur, 1994.
Anusaaraka Home Page
|
|
|