It does not need to be stressed that it is important to provide
knowledge and information in electronic form in South Asian
(SA) languages. This task requires the development of software
for searching texts, script conversion, dictionaries, spelling
checkers, multi-lingual access software, etc., and of course,
a rich collection of texts in electronic form. All this can be
called the infra-structure for language.
There are a number of problems which need to be addressed.
- Very few word processors are following any standards regarding
coding schemes while entering texts in SA languages. This renders
the texts unusable across platforms. Even if another user has the
right platform, the only thing he can usually do is to view the
text. Normally he cannot even annotate the text using the keyboard.
While the long term solution is for everybody to follow the ACII
standard; in the short term, there is a need to develop code
converters rapidly. This task has been automated to a large extent
for Devanagari. The same should be done for other scripts.
- The technical feasibility of multi-lingual access software
has been demonstrated. (Though the machine translation technology
is far away.) Anusaaraka systems for accessing texts in five SA
languages are under development, and alpha-version for some have been
released. This task can be taken up at a wider scale covering all
SA languages. The systems already built can also be refined further.
- Electronic texts and resources such as dictionaries, thesauri,
lexical databases, are urgently needed. These can be prepared by
the collective effort of myriads of people.
In this paper, we argue that the SA language infra-structure can
best be developed through a large cooperative effort. The GPL
"free" software model is best suited for this development because
the source code is open, and license is given to all to refine and
redistribute it. All the anusaaraka systems (developed under funding
from DOE) are available as GPL free software with source code, for
everyone to use and contribute to. As another cooperative effort,
a free version of an online English to Hindi dictionary is expected
to be available shortly.
|
The internet and other means of distributing and accessing electronic
texts are growing, and considerable resources have been built up for
English. There is a need to build similar resources related to South
Asian (SA) languages. A number of basic functionalities are needed.
For example, a free font with keyboard support for every script for SA
languages, ability to perform search over the net, script conversion
facility (which is particularly useful for similar languages with
different scripts), dictionaries among various language pairs, spelling
checkers, computer translation software, etc. Of course, there can also
be organized efforts to put large amount of texts in electronic form.
In this paper, we argue that computer software for the above, should
be developed and available as "free" software. Similarly, there can
be voluntary effort for language related resources such as texts,
dictionaries, etc. All this can form the infra-structure for our
languages. This would be available to users and developers alike.
Some developers can provide value-added services at a price too.
|
2. NEED FOR STANDARDS
(TOP)
Texts in English language are widely available over the web. These
texts can be viewed by anyone anywhere, independent of the computer
hardware, the operating system, or the word processor being used.
They can also be searched, passed through an English processing
program such as a parser or a machine translation software. In other
words, they are stored, viewed, and processed as texts. This is so
because everybody uses a common standard, namely, ASCII.
There are a number of web sites storing texts in Indian languages,
such as Indian language newspapers and magazines; however, the texts
are available only for viewing. They cannot be processed as text. This
means that the following kinds of things can either not be done easily
or not at all: search for words or phrases, script conversion (in
case someone wants to read a text in a language whose script he does
not know, e.g., Urdu, Punjabi, Bengali for a Hindi knowing person),
dictionary lookup while reading the text electronically, running
of machine translation software to access text in another language,
etc. In fact, one cannot even use the keyboard to make annotations
to the text.
The simple answer here is to use a standard coding scheme for
SA languages. If everybody keeps their texts in the standard, the
problems outlined above would disappear. One would be able to search
or perform other operations on the texts. This is certainly the most
desirable solution. There already exists one such standard: ACII
(alphabetic code for information interchange) and can be followed.
For a short discussion on scripts for our languages, see Appendix A.
|
3. CONVERTING AMONG CODING SCHEMES
(TOP)
Clearly, while the long term solution is to use the standard coding
scheme for storing texts, it is difficult to get users to change
over to it immediately. There is a need, therefore, to also come up
with alternative short term solutions. One effective answer is that
conversion facility should be available.
If a text is not in the ACII coding scheme but is stored as a
sequence of glyph codes, then one way to use it (for all the purposes
mentioned earlier) is to first convert it to the ACII coding scheme.
Therefore, one immediate answer to be able to use texts in various
coding schemes is to develop technology for making converters rapidly.
Development of a new converter depends on what information is
available regarding glyph codes. Sometimes a picture is available
(on a printout or on the screen) for each glyph code (called the
glyph table). By looking at a picture, one can frequently identify
what part of a character it would be used for. But there are cases
where such judgements are not very straight forward: (1) Sometimes, a
glyph can form a part of many different characters. It is not easy
to anticipate all these uses. (2) At times, it may not be clear at all
as to what characters the glyph is part of. For example, for a glyph
standing for "a white space with smaller width", it may not be clear,
what characters it is used with.
Sometimes, the glyph table is not easily available (particularly for
word processors based on DOS). However, a print out of a given text
is available. In this situation, the first few pages of the text can
be separately typed in ACII, and the resulting file compared
with unknown glyph file. In such a situation, a learning program
(Bharati, 1998b) can be used, which by looking at the two files, gets
the codes of different glyphs, and then with some user help generates
converters between the glyphs and ACII (ISCII). Such a program has
built into it, possible glyph grammars for the given script, to
facilitate its learning task.
If the glyph table is available, the grammar is useful even when
no training text in ACII is available. The codes for glyphs can
be specified manually by looking at the glyph table, and partial
converters can still be built using the grammar. They can be manually
refined over a period of time.
|
4. INDIAN SCRIPTS UNDER X-WINDOWS
(TOP)
There has been no proper support for Indian scripts under X-windows
(in Unix). Now that X-windows (xterm) provide support for True-type
fonts, and these fonts for Indian scripts are available, the display
part of the problem is effectively solved. To handle keyboard support
and interactivity, Expect can be used. Expect is an extremely powerful
free software that sits between the user and a program, say, X-windows
in this case. It allows handling of user interaction (and appropriate
manipulation of what the user has typed) independent of the program
at the other end. Earlier, manipulation of such interaction required
special knowledge regarding OS programming. An implementation based
on Expect, supporting Indian scripts is described in brief below.
The basic principle is straight forward. There are two programs sitting
respectively between (i) keyboard and X-term, and (ii) X-term and
display monitor. This means that any program running under X-term is
taken care of. Pictorially, this can be shown as:
.------------.
| X-Windows |
| (xterm) |
.------------.
^ \
/ v
.------. .------.
| E1 | | E2 |
.------. .------.
^ \
/ v
Keyboard display
E1 is a program written in Expect which generates ACII codes based on
the keys pressed. It is straight forward to set it up for different
keyboard layouts. Different layouts for Indian characters might be
prefered by different people. Those who know the Roman keyboard might
prefer the wx-layout (Bharati, 1995; Appendix B), some others might
prefer the Inscript layout, etc.
E2 takes ACII codes and displays characters, correctly changing
the order of i-matra (Û), handling conjunct-clusters, etc. One of
the problems that needed to be solved was how to handle the display
when the entry of a new character alters the display pertaining to
the previous few characters. For example, in the case of consonent
clusters such as 'p_r' (ÈèÏ), when 'r' (Ï) is typed after 'p_' (Èè),
it changes the shape of the character already on display, namely,
'p' (È). This was done by keeping a temporary buffer for the current
akshara which gets updated while the user is doing entry and is
written out after the word/syllable is over. A number of alternative
solutions are being tried to find one which the users would like the
most.
Similarly, another problem pertains to cursor movement and cursor
positioning which causes extra white spaces to appear in the display
after every character is typed. The current answer is to refresh the
display line repeatedly. (Suggestions regarding why this happens in
xterm under X-windows and how to overcome it, will be appreciated.)
|
5. MULTI-LINGUAL ACCESS IN SOUTH ASIA
(TOP)
South-Asia (SA) is rich in languages. The internet and other electronic
technologies can be used to distribute and access electronic texts,
containing knowledge and information. The problem is that unless
the reader knows the language, he cannot use this knowledge and
information. Full-fledged machine translation systems are beyond the
reach of current technology. Therefore, software tools which allow a
user to access texts in other languages and understand them, which are
within reach, should be developed. Fortunately, languages in the SA
region are close to each other in grammar and vocabulary. Therefore,
it is comparatively easier to build such tools.
Anusaaraka systems (Bharati, 1998), (Narayana, 1994) allow a user to
access texts in another language with some effort. They produce an
'image' of the source text, preserving information in it. The output is
not grammatical in the target language because the output follows the
grammar of the source language. (As the grammars of the two languages
are similar for most of the constructions, this is noticed only when
a construction is different from ones in the target language.) Therefore,
certain amount of training is needed to understand the output. This
training would include notation used in anusaaraka output, as well
as differences between the grammars of two languages. The advantage
is that the system is not specific to any subject domain. It is also
robust. With a small amount of training, the user can learn to read
any Indian language text through anusaaraka.
Anusaaraka systems are under development from Telugu, Kannada, Marathi,
Bengali and Punjabi to Hindi. Alpha-version of the Telugu-Hindi system
has been released for experimentation and development. These systems
have been built by IIT Kanpur and University of Hyderabad under
funding from Department of Electronics, Government of India. Now,
Satyam Computers is also contributing to this activity in a major way.
As part of the anusaaraka systems, many resources have been developed
which can be used as by-products. These include morphological analyzers
for each of the five languages, bilingual electronic dictionaries
from these languages to Hindi, generator for Hindi, etc. Further work
is continuing in refining each one of these. (See Appendix B for a
brief description.)
|
Development and refinement of many of these resources requires
sustained effort by many people for a long time. There is a great
potential need for these resources, but the paying capacity of users is
limited. Many of these resources are a part of the infra-structure for
our languages which is needed in the long run. Therefore, it is best
to adopt the co-operative model for developing these resources. The
free software model of GNU (with General Public License or GPL) fits in
the best. It insists on the source code being given with the software,
together with the license that the recipient is allowed to modify and
redistribute it. The only condition is that the redistribution must
carry the same license with the new modified source code, so that the
system remains free.
In the case of language software, source code includes not only the
computer programs but also language data. Thus, GNU model ensures
that the computer programs and language data continue to be open,
and keep getting refined in a co-operative mode.
The anusaaraka systems have been made "free" with the permission of
the funding agency, and are available off the internet for free
download under GPL. Users are invited to join in the development
activity. Similarly, an online English to Hindi dictionary is being
prepared with the cooperative effort of tens of people. This will also
be available freely under GPL for use by users and developers.
|
In this paper, we have discussed some issues that need to be
addressed so that South-Asian (SA) languages become more easily
available for mass use. These issues pertain to use of standard
coding scheme (ACII/ISCII) for texts, development of anusaaraka like
multi-lingual access software, and availability of language databases
and texts. These can be called the infra-structure for SA languages.
This infra-structure can develop only with the cooperative effort of
a large number of people. The best method for developing this infra-
structure and putting it to mass use, is by the GPL free software model.
Anusaarka systems themselves are available under the free software
license (GPL). Some other work has been started with the same license.
Everybody is invited to join in the cooperative enterprise of building
the infra-structure for SA languages.
|
Research reported here has been supported by Department of Electronics,
Government of India as part of Anusaaraka project under Technology
Development for Indian Languages programme. Earlier the authors (VC,
APK, RS) were at IIT Kanpur Centre for NLP at Hyderabad. The anusaaraka
activity is being jointly carried out by IIT Kanpur and University of
Hyderabad. Satyam Computers has now joined this activity and is also
providing support for it.
|
- Bharati, Akshar, Vineet Chaitanya, and Rajeev Sangal, Natural
Language Processing: A Paninian Perspective, Prentice-Hall of
India, New Delhi, 1995.
- Bharati, Akshar, Vineet Chaitanya, Amba P. Kulkarni, Rajeev Sangal,
and G Uma Maheshwar Rao, Anusaaraka: Overcoming the Language Barrier
in India, In Anuvad: Approaches to Translation, Rukmini Bhaya Nair
(editor), Katha, New Delhi, 1998 (forthcoming).
- Bharati, Akshar, Nisha Sangal, Vineet Chaitanya, Amba P Kulkarni,
and Rajeev Sangal, Generating Converters between Fonts Semi-
automatically, In Proc. of SAARC conference on Multi-lingual and
Multi-media Information Technology, CDAC, Pune, 1-4 Sept. 1998b.
- Narayana, V.N., Anusaraka: A Device to Overcome the Language Barrier
in India, Ph.D. thesis, Dept. of Computer Sc. and Engg., I.I.T. Kanpur,
1994.
|
Appendix A: SYLLABIC VS. ALPHABETIC NOTATION
(TOP)
While addressing the questions related to display, keyboarding, and
processing of text, it is important to understand that even though our
script is syllabic in nature, the syllables (aksharas) are constructed
out of basic vowels and consonents (varna). Thus, it should be called
compositional syllabic notation, to differentiate it from some other
scripts of the world, say, Chinese, where each word is written as
a different picture (not composed of anything more basic). For example:
(syllabic notation) (alphabetic notation)
³ Ì ÑÚ = ³èè + ¤ + Ìèè + ¤ + Ñèè + ¥
ka ma lA k a m a l A
(alphabetic) (mixed) (syllabic)
×ÛÄèÅ: ×èè + ¦ + Äèè + Åèè + ¤ = × + Û + ÄèÅ = ×Û + ÄèÅ
siddha s i d dh a s i ddh si ddh
Thus, on the one hand there are all the advantages of the
alphabetic notation (such as, linearity, a small alphabet, easy
learnability) on the other hand, the syllabic notation is more
compact to write and easier to read (after a slightly longer
training). In fact, the difficulty with a mechanical typewriter
appears only because we set our goals higher: syllabic printing.
The alphabetic notation was always available and posed no problem
as far as typewriting was concerned. That this was never adopted in
typewriting has reasons about which can only speculate. (One
reason perhaps was the general environment prevailing in India which
was not conducive to flexibility, experiments and new ideas. Another
reason might be that the best is at times the enemy of the good. The
"best" (represented by syllabic script) was not possible to produce
satisfactorily with the mechanical technology, and the "good"
(alphabetic notation) was not accepted because it is inferior to
the former, to which everyone was used to, with the handwritten
characters. These are speculative thoughts.)
The breakthrough with electronic technology is that it allows one to
combine the two: input or keyboarding can be in pure alphabetic notation
or mixed syllabic notation where the conjunct consonents are naturally
typed separately by the use of halant mark, but the display can be in
the customary syllabic script. Thus, it is possible to get advantages
of both.
The question as to the form in which the machine should store the
text internally (in syllabic or alphabetic notation) for Indian
languages has been reasonably answered through the ISCII standard.
While there is always some scope for improvement, the difficulty
we are facing today is entirely different: people are storing texts
neither alphabetically nor syllabically, but in the form of glyphs!
And there too each person has a coding scheme of his own! Clearly the
situation is unacceptable needing a strong corrective action.
|
Appendix B: ANUSAARAKA SYSTEM
(TOP)
B.1 ANUSAARAKA APPROACH
>
(TOP)
Machine translation systems are extremely difficult to build.
Translation is a creative process in which the translator has to
interpret the text, something which is very hard for the machine to
do. In spite of the difficulty of MT, the anusaaraka can be used to
overcome the language barrier in India today. Anusaaraka systems among
Indian languages are designed by noting the following two key
features:
1. In the anusaaraka approach, the load between the reader and the
machine is divided in such a way that the aspects which are difficult
for the reader are handled by the machine, and aspects which are easy
for the reader are left to him. Specifically, reader would have
difficulty learning the vocabulary of the language, while he would be
good at using general background knowledge needed to interpret any
text. On the other hand, the machine is good at "memorising" an entire
dictionary, grammar rules, etc. but poor at using background
knowledge. Thus, the work is divided, in which the language-based
analysis of the text is carried out by the machine, and
knowledge-based analysis or interpretation is left to the reader.
2. Among Indian languages, which share vocabulary, grammar, pragmatics,
etc. the task is easier. For example, in general, the words in
a language are ambiguous, but if the languages are close to each
other, one is likely to find a one to one correspondence between
words where the meaning is carried across from source language to
target language. For example, for 80 percent of the Kannada words
in the current anusaaraka dictionary of 30000 root words, there is a
single equivalent Hindi word which covers the senses of the original
Kannada word.
In the anusaaraka approach, the reader is given an image of the source
text in the target language by faithfully representing whatever is
actually contained in the source language text. So the task boils
down to presenting the information to the user in an appropriate form.
We relax the requirement that the output in the target language
should be grammatical. The emphasis shifts to comprehensibility. The
answer is to deviate from the target language in a systematic manner.
First, new notation is invented and incorporated. For example, Hindi
has the post-position marker 'ko', which functions both as accusative
marker as well as dative marker. We distinguish between them by
putting a diacritic mark (backquote). Thus, existing words in the
target language may be given wider or narrower meaning.
Second, we may relax some of the conditions in the target language. For
example, we give up agreement in our "dialect" of the target language.
The principle behind the systematic deviations is simple: the output
follows the grammar of the source language. In the case of agreement,
to state it more precisely, the output follows the agreement rules
of the source language, therefore, the output in the target language
appears to be without agreement. Some of the constructions of the source
language may also get introduced in the target language. (Actually,
as the constructions are largely common across the two languages,
a new construction is noticed only when the source language has a
construction which is somewhat different from the target language.)
Sometimes, language bridges might be built between constructions in the
source language which are not there in the target language. A different
construction but which can express the same information in the target
language is chosen, with some additional notation, if necessary. For
example, adjectival participial phrases in the South Indian languages
are mapped to relative clauses with the 'jo*' notation.
Because of the reasons mentioned above, some amount of training
will be needed on the part of the reader to read and understand
the output. This training will include teaching of notation, some
saliant features of the source language, and is likely to be about
10% of the time needed to learn a new language. For example, among
Indian languages it could be of a few weeks duration, depending on the
proficiency desired. It could also occur informally as the reader uses
the system and reads its output, so the formal training could be small.
|
Anusaaraka can be used in a variety of situations. Here we give some
examples:
- A reader wants to read an e-mail message or a document quickly, to
find out its gross contents.
The reader can run anusaaraka on the source and read the output
directly. He might not be proficient in the use of anusaaraka, but
since the reader motivation is high, he might be willing to put in the
effort using the online help.
- A publisher wants to translate a literary work and publish it.
The anusaaraka output will have to be post-edited by a person, to make
it grammatically correct, stylistically proper, etc. The post-edited
output can be published. (In fact, the anusaaraka group is planning
to bring out two books by well-known Kannada authors, which have
already been translated in Hindi with the help of the anusaaraka.)
- A scholar wants to find out about what an original work or epic
actually says, where the original is in a language which he does not
know.
Translation is available, but he wants to see for himself as to what
the epic says and what the translator has interpreted. He can read
the epic directly through the anusaaraka. As the machine does not
interpret, and presents an image of the contents, he is able to see
the original without the translator's interpretation.
Anusaaraka Home Page
|
|
|