BUILDING A "FREE" INFRA-STRUCTURE FOR SOUTH-ASIAN LANGUAGES

Akshar Bharati Vineet Chaitanya Amba P. Kulkarni Rajeev Sangal Satyam School of Applied Information Systems Indian Institute of Information Technology Hyderabad {vineet,amba,sangal}@iiit.net

[A keynote lecture. In Proc. of SAARC Conf. on Multilingual and Multimedia information Technology, CDAC Pune, 1-4 Sept. 1998.] (This text contains some Indian script characters in ISCII-8 coding standard.) ABSTRACT

1. BACKGROUND:
2. NEED FOR STANDARDS
3. CONVERTING AMONG CODING SCHEMES
4. INDIAN SCRIPTS UNDER X-WINDOWS
5. MULTI-LINGUAL ACCESS IN SOUTH ASIA
6. FREE SOFTWARE
7. CONCLUSION
8.ACKNOWLEDGEMENT
9. REFERENCES
Appendix A: SYLLABIC VS. ALPHABETIC NOTATION
Appendix B: ANUSAARAKA SYSTEM
B.1 ANUSAARAKA APPROACH
B.2 APPLICATIONS

It does not need to be stressed that it is important to provide knowledge and information in electronic form in South Asian (SA) languages. This task requires the development of software for searching texts, script conversion, dictionaries, spelling checkers, multi-lingual access software, etc., and of course, a rich collection of texts in electronic form. All this can be called the infra-structure for language.

There are a number of problems which need to be addressed.

Very few word processors are following any standards regarding coding schemes while entering texts in SA languages. This renders the texts unusable across platforms. Even if another user has the right platform, the only thing he can usually do is to view the text. Normally he cannot even annotate the text using the keyboard. While the long term solution is for everybody to follow the ACII standard; in the short term, there is a need to develop code converters rapidly. This task has been automated to a large extent for Devanagari. The same should be done for other scripts.
The technical feasibility of multi-lingual access software has been demonstrated. (Though the machine translation technology is far away.) Anusaaraka systems for accessing texts in five SA languages are under development, and alpha-version for some have been released. This task can be taken up at a wider scale covering all SA languages. The systems already built can also be refined further.
Electronic texts and resources such as dictionaries, thesauri, lexical databases, are urgently needed. These can be prepared by the collective effort of myriads of people.

In this paper, we argue that the SA language infra-structure can best be developed through a large cooperative effort. The GPL "free" software model is best suited for this development because the source code is open, and license is given to all to refine and redistribute it. All the anusaaraka systems (developed under funding from DOE) are available as GPL free software with source code, for everyone to use and contribute to. As another cooperative effort, a free version of an online English to Hindi dictionary is expected to be available shortly.

1. BACKGROUND: (TOP)

The internet and other means of distributing and accessing electronic texts are growing, and considerable resources have been built up for English. There is a need to build similar resources related to South Asian (SA) languages. A number of basic functionalities are needed. For example, a free font with keyboard support for every script for SA languages, ability to perform search over the net, script conversion facility (which is particularly useful for similar languages with different scripts), dictionaries among various language pairs, spelling checkers, computer translation software, etc. Of course, there can also be organized efforts to put large amount of texts in electronic form.

In this paper, we argue that computer software for the above, should be developed and available as "free" software. Similarly, there can be voluntary effort for language related resources such as texts, dictionaries, etc. All this can form the infra-structure for our languages. This would be available to users and developers alike. Some developers can provide value-added services at a price too.

2. NEED FOR STANDARDS (TOP)

Texts in English language are widely available over the web. These texts can be viewed by anyone anywhere, independent of the computer hardware, the operating system, or the word processor being used. They can also be searched, passed through an English processing program such as a parser or a machine translation software. In other words, they are stored, viewed, and processed as texts. This is so because everybody uses a common standard, namely, ASCII.

There are a number of web sites storing texts in Indian languages, such as Indian language newspapers and magazines; however, the texts are available only for viewing. They cannot be processed as text. This means that the following kinds of things can either not be done easily or not at all: search for words or phrases, script conversion (in case someone wants to read a text in a language whose script he does not know, e.g., Urdu, Punjabi, Bengali for a Hindi knowing person), dictionary lookup while reading the text electronically, running of machine translation software to access text in another language, etc. In fact, one cannot even use the keyboard to make annotations to the text.

The simple answer here is to use a standard coding scheme for SA languages. If everybody keeps their texts in the standard, the problems outlined above would disappear. One would be able to search or perform other operations on the texts. This is certainly the most desirable solution. There already exists one such standard: ACII (alphabetic code for information interchange) and can be followed. For a short discussion on scripts for our languages, see Appendix A.

3. CONVERTING AMONG CODING SCHEMES (TOP)

Clearly, while the long term solution is to use the standard coding scheme for storing texts, it is difficult to get users to change over to it immediately. There is a need, therefore, to also come up with alternative short term solutions. One effective answer is that conversion facility should be available.

If a text is not in the ACII coding scheme but is stored as a sequence of glyph codes, then one way to use it (for all the purposes mentioned earlier) is to first convert it to the ACII coding scheme. Therefore, one immediate answer to be able to use texts in various coding schemes is to develop technology for making converters rapidly.

Development of a new converter depends on what information is available regarding glyph codes. Sometimes a picture is available (on a printout or on the screen) for each glyph code (called the glyph table). By looking at a picture, one can frequently identify what part of a character it would be used for. But there are cases where such judgements are not very straight forward: (1) Sometimes, a glyph can form a part of many different characters. It is not easy to anticipate all these uses. (2) At times, it may not be clear at all as to what characters the glyph is part of. For example, for a glyph standing for "a white space with smaller width", it may not be clear, what characters it is used with.

Sometimes, the glyph table is not easily available (particularly for word processors based on DOS). However, a print out of a given text is available. In this situation, the first few pages of the text can be separately typed in ACII, and the resulting file compared with unknown glyph file. In such a situation, a learning program (Bharati, 1998b) can be used, which by looking at the two files, gets the codes of different glyphs, and then with some user help generates converters between the glyphs and ACII (ISCII). Such a program has built into it, possible glyph grammars for the given script, to facilitate its learning task.

If the glyph table is available, the grammar is useful even when no training text in ACII is available. The codes for glyphs can be specified manually by looking at the glyph table, and partial converters can still be built using the grammar. They can be manually refined over a period of time.

4. INDIAN SCRIPTS UNDER X-WINDOWS (TOP)

There has been no proper support for Indian scripts under X-windows (in Unix). Now that X-windows (xterm) provide support for True-type fonts, and these fonts for Indian scripts are available, the display part of the problem is effectively solved. To handle keyboard support and interactivity, Expect can be used. Expect is an extremely powerful free software that sits between the user and a program, say, X-windows in this case. It allows handling of user interaction (and appropriate manipulation of what the user has typed) independent of the program at the other end. Earlier, manipulation of such interaction required special knowledge regarding OS programming. An implementation based on Expect, supporting Indian scripts is described in brief below.

The basic principle is straight forward. There are two programs sitting respectively between (i) keyboard and X-term, and (ii) X-term and display monitor. This means that any program running under X-term is taken care of. Pictorially, this can be shown as:               .------------.
              | X-Windows  |
              |  (xterm)   |
              .------------.
               ^           \
              /              v
       .------.             .------.
       |  E1  |             |  E2  |
       .------.             .------.
          ^                      \
         /                        v
    Keyboard                     display
E1 is a program written in Expect which generates ACII codes based on the keys pressed. It is straight forward to set it up for different keyboard layouts. Different layouts for Indian characters might be prefered by different people. Those who know the Roman keyboard might prefer the wx-layout (Bharati, 1995; Appendix B), some others might prefer the Inscript layout, etc.

E2 takes ACII codes and displays characters, correctly changing the order of i-matra (Û), handling conjunct-clusters, etc. One of the problems that needed to be solved was how to handle the display when the entry of a new character alters the display pertaining to the previous few characters. For example, in the case of consonent clusters such as 'p_r' (ÈèÏ), when 'r' (Ï) is typed after 'p_' (Èè), it changes the shape of the character already on display, namely, 'p' (È). This was done by keeping a temporary buffer for the current akshara which gets updated while the user is doing entry and is written out after the word/syllable is over. A number of alternative solutions are being tried to find one which the users would like the most.

Similarly, another problem pertains to cursor movement and cursor positioning which causes extra white spaces to appear in the display after every character is typed. The current answer is to refresh the display line repeatedly. (Suggestions regarding why this happens in xterm under X-windows and how to overcome it, will be appreciated.)

5. MULTI-LINGUAL ACCESS IN SOUTH ASIA (TOP)

South-Asia (SA) is rich in languages. The internet and other electronic technologies can be used to distribute and access electronic texts, containing knowledge and information. The problem is that unless the reader knows the language, he cannot use this knowledge and information. Full-fledged machine translation systems are beyond the reach of current technology. Therefore, software tools which allow a user to access texts in other languages and understand them, which are within reach, should be developed. Fortunately, languages in the SA region are close to each other in grammar and vocabulary. Therefore, it is comparatively easier to build such tools.

Anusaaraka systems (Bharati, 1998), (Narayana, 1994) allow a user to access texts in another language with some effort. They produce an 'image' of the source text, preserving information in it. The output is not grammatical in the target language because the output follows the grammar of the source language. (As the grammars of the two languages are similar for most of the constructions, this is noticed only when a construction is different from ones in the target language.) Therefore, certain amount of training is needed to understand the output. This training would include notation used in anusaaraka output, as well as differences between the grammars of two languages. The advantage is that the system is not specific to any subject domain. It is also robust. With a small amount of training, the user can learn to read any Indian language text through anusaaraka.

Anusaaraka systems are under development from Telugu, Kannada, Marathi, Bengali and Punjabi to Hindi. Alpha-version of the Telugu-Hindi system has been released for experimentation and development. These systems have been built by IIT Kanpur and University of Hyderabad under funding from Department of Electronics, Government of India. Now, Satyam Computers is also contributing to this activity in a major way.

As part of the anusaaraka systems, many resources have been developed which can be used as by-products. These include morphological analyzers for each of the five languages, bilingual electronic dictionaries from these languages to Hindi, generator for Hindi, etc. Further work is continuing in refining each one of these. (See Appendix B for a brief description.)

6. FREE SOFTWARE (TOP)

Development and refinement of many of these resources requires sustained effort by many people for a long time. There is a great potential need for these resources, but the paying capacity of users is limited. Many of these resources are a part of the infra-structure for our languages which is needed in the long run. Therefore, it is best to adopt the co-operative model for developing these resources. The free software model of GNU (with General Public License or GPL) fits in the best. It insists on the source code being given with the software, together with the license that the recipient is allowed to modify and redistribute it. The only condition is that the redistribution must carry the same license with the new modified source code, so that the system remains free.

In the case of language software, source code includes not only the computer programs but also language data. Thus, GNU model ensures that the computer programs and language data continue to be open, and keep getting refined in a co-operative mode.

The anusaaraka systems have been made "free" with the permission of the funding agency, and are available off the internet for free download under GPL. Users are invited to join in the development activity. Similarly, an online English to Hindi dictionary is being prepared with the cooperative effort of tens of people. This will also be available freely under GPL for use by users and developers.

7. CONCLUSION (TOP)

In this paper, we have discussed some issues that need to be addressed so that South-Asian (SA) languages become more easily available for mass use. These issues pertain to use of standard coding scheme (ACII/ISCII) for texts, development of anusaaraka like multi-lingual access software, and availability of language databases and texts. These can be called the infra-structure for SA languages. This infra-structure can develop only with the cooperative effort of a large number of people. The best method for developing this infra- structure and putting it to mass use, is by the GPL free software model.

Anusaarka systems themselves are available under the free software license (GPL). Some other work has been started with the same license. Everybody is invited to join in the cooperative enterprise of building the infra-structure for SA languages.

8.ACKNOWLEDGEMENT (TOP)

Research reported here has been supported by Department of Electronics, Government of India as part of Anusaaraka project under Technology Development for Indian Languages programme. Earlier the authors (VC, APK, RS) were at IIT Kanpur Centre for NLP at Hyderabad. The anusaaraka activity is being jointly carried out by IIT Kanpur and University of Hyderabad. Satyam Computers has now joined this activity and is also providing support for it.

9. REFERENCES (TOP)

Bharati, Akshar, Vineet Chaitanya, and Rajeev Sangal, Natural Language Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi, 1995.
Bharati, Akshar, Vineet Chaitanya, Amba P. Kulkarni, Rajeev Sangal, and G Uma Maheshwar Rao, Anusaaraka: Overcoming the Language Barrier in India, In Anuvad: Approaches to Translation, Rukmini Bhaya Nair (editor), Katha, New Delhi, 1998 (forthcoming).
Bharati, Akshar, Nisha Sangal, Vineet Chaitanya, Amba P Kulkarni, and Rajeev Sangal, Generating Converters between Fonts Semi- automatically, In Proc. of SAARC conference on Multi-lingual and Multi-media Information Technology, CDAC, Pune, 1-4 Sept. 1998b.
Narayana, V.N., Anusaraka: A Device to Overcome the Language Barrier in India, Ph.D. thesis, Dept. of Computer Sc. and Engg., I.I.T. Kanpur, 1994.

Appendix A: SYLLABIC VS. ALPHABETIC NOTATION (TOP)

While addressing the questions related to display, keyboarding, and processing of text, it is important to understand that even though our script is syllabic in nature, the syllables (aksharas) are constructed out of basic vowels and consonents (varna). Thus, it should be called compositional syllabic notation, to differentiate it from some other scripts of the world, say, Chinese, where each word is written as a different picture (not composed of anything more basic). For example:     (syllabic notation)      (alphabetic notation)
   ³ Ì ÑÚ       =  ³èè + ¤ + Ìèè + ¤ + Ñèè + ¥
       ka ma lA         k    a    m   a   l    A
            (alphabetic)             (mixed)          (syllabic)
  ×ÛÄèÅ:  ×èè + ¦ + Äèè + Åèè + ¤  = × + Û + ÄèÅ    = ×Û + ÄèÅ
  siddha  s   i    d   dh  a         s   i    ddh     si  ddh
Thus, on the one hand there are all the advantages of the alphabetic notation (such as, linearity, a small alphabet, easy learnability) on the other hand, the syllabic notation is more compact to write and easier to read (after a slightly longer training). In fact, the difficulty with a mechanical typewriter appears only because we set our goals higher: syllabic printing. The alphabetic notation was always available and posed no problem as far as typewriting was concerned. That this was never adopted in typewriting has reasons about which can only speculate. (One reason perhaps was the general environment prevailing in India which was not conducive to flexibility, experiments and new ideas. Another reason might be that the best is at times the enemy of the good. The "best" (represented by syllabic script) was not possible to produce satisfactorily with the mechanical technology, and the "good" (alphabetic notation) was not accepted because it is inferior to the former, to which everyone was used to, with the handwritten characters. These are speculative thoughts.)

The breakthrough with electronic technology is that it allows one to combine the two: input or keyboarding can be in pure alphabetic notation or mixed syllabic notation where the conjunct consonents are naturally typed separately by the use of halant mark, but the display can be in the customary syllabic script. Thus, it is possible to get advantages of both.

The question as to the form in which the machine should store the text internally (in syllabic or alphabetic notation) for Indian languages has been reasonably answered through the ISCII standard. While there is always some scope for improvement, the difficulty we are facing today is entirely different: people are storing texts neither alphabetically nor syllabically, but in the form of glyphs! And there too each person has a coding scheme of his own! Clearly the situation is unacceptable needing a strong corrective action.

Appendix B: ANUSAARAKA SYSTEM (TOP)

B.1 ANUSAARAKA APPROACH > (TOP)

Machine translation systems are extremely difficult to build. Translation is a creative process in which the translator has to interpret the text, something which is very hard for the machine to do. In spite of the difficulty of MT, the anusaaraka can be used to overcome the language barrier in India today. Anusaaraka systems among Indian languages are designed by noting the following two key features:

1. In the anusaaraka approach, the load between the reader and the machine is divided in such a way that the aspects which are difficult for the reader are handled by the machine, and aspects which are easy for the reader are left to him. Specifically, reader would have difficulty learning the vocabulary of the language, while he would be good at using general background knowledge needed to interpret any text. On the other hand, the machine is good at "memorising" an entire dictionary, grammar rules, etc. but poor at using background knowledge. Thus, the work is divided, in which the language-based analysis of the text is carried out by the machine, and knowledge-based analysis or interpretation is left to the reader.

2. Among Indian languages, which share vocabulary, grammar, pragmatics, etc. the task is easier. For example, in general, the words in a language are ambiguous, but if the languages are close to each other, one is likely to find a one to one correspondence between words where the meaning is carried across from source language to target language. For example, for 80 percent of the Kannada words in the current anusaaraka dictionary of 30000 root words, there is a single equivalent Hindi word which covers the senses of the original Kannada word.

In the anusaaraka approach, the reader is given an image of the source text in the target language by faithfully representing whatever is actually contained in the source language text. So the task boils down to presenting the information to the user in an appropriate form. We relax the requirement that the output in the target language should be grammatical. The emphasis shifts to comprehensibility. The answer is to deviate from the target language in a systematic manner.

First, new notation is invented and incorporated. For example, Hindi has the post-position marker 'ko', which functions both as accusative marker as well as dative marker. We distinguish between them by putting a diacritic mark (backquote). Thus, existing words in the target language may be given wider or narrower meaning.

Second, we may relax some of the conditions in the target language. For example, we give up agreement in our "dialect" of the target language. The principle behind the systematic deviations is simple: the output follows the grammar of the source language. In the case of agreement, to state it more precisely, the output follows the agreement rules of the source language, therefore, the output in the target language appears to be without agreement. Some of the constructions of the source language may also get introduced in the target language. (Actually, as the constructions are largely common across the two languages, a new construction is noticed only when the source language has a construction which is somewhat different from the target language.)

Sometimes, language bridges might be built between constructions in the source language which are not there in the target language. A different construction but which can express the same information in the target language is chosen, with some additional notation, if necessary. For example, adjectival participial phrases in the South Indian languages are mapped to relative clauses with the 'jo*' notation.

Because of the reasons mentioned above, some amount of training will be needed on the part of the reader to read and understand the output. This training will include teaching of notation, some saliant features of the source language, and is likely to be about 10% of the time needed to learn a new language. For example, among Indian languages it could be of a few weeks duration, depending on the proficiency desired. It could also occur informally as the reader uses the system and reads its output, so the formal training could be small.

B.2 APPLICATIONS > (TOP)

Anusaaraka can be used in a variety of situations. Here we give some examples:

A reader wants to read an e-mail message or a document quickly, to find out its gross contents.

The reader can run anusaaraka on the source and read the output directly. He might not be proficient in the use of anusaaraka, but since the reader motivation is high, he might be willing to put in the effort using the online help.
A publisher wants to translate a literary work and publish it.

The anusaaraka output will have to be post-edited by a person, to make it grammatically correct, stylistically proper, etc. The post-edited output can be published. (In fact, the anusaaraka group is planning to bring out two books by well-known Kannada authors, which have already been translated in Hindi with the help of the anusaaraka.)
A scholar wants to find out about what an original work or epic actually says, where the original is in a language which he does not know.

Translation is available, but he wants to see for himself as to what the epic says and what the translator has interpreted. He can read the epic directly through the anusaaraka. As the machine does not interpret, and presents an image of the contents, he is able to see the original without the translator's interpretation.

Anusaaraka Home Page