ISCII PLUGIN FOR DISPLAYING INDIAN LANGUAGE WEB PAGES


Akshar Bharati
Vineet Chaitanya
Amba P. Kulkarni
Language Technologies Research Center,
Indian Institute of Information Technology Hyderabad
{vc,amba}@iiit.net


ABSTRACT

There is a chaos as far as the Indian languages in electronic form are concerned. Neither can one exchange the notes in Indian languages as conveniently as in English language, nor can one perform search on texts in Indian languages available over the web. This is so because the texts are being stored in font dependent glyph codes.

Though on the face of it UNICODE appears to be an ideal solution for this chaos, on closer look one finds that UNICODE with UTF-8 the transmission cost for Indian languages will be three times that of English!

So our suggestion is till a satisfactory solution is found for overcoming the transmission inefficiency one should stick with ISCII standard. It has an additional advantage of having uniform code for all the Indian languages. This makes the conversion among Indian languages as simple as selection of font.

An ISCII plug-in has been developed to enable the display of ISCII web pages on client machines using available local fonts an ISCII plug-in has been developed. Storage of web pages in ISCII also solves the search problem.

Since the plug-in is available "free" under GPL, computer dealers can provide it as a pre-installed software in PCs. Also experts can improve upon it by adding new features to it.

Same plug-in can also be used for UNICODE with minor modifications.

CURRENT STATUS :
--------------

The last two decades has seen a mushroom growth of software packages for Indian languages. But the people using these softwares face many problems. Users can not exchange their e-notes in Indian languages unless they use the same software package and fonts for editing the texts unlike English!

Indian language texts on web are in even worse condition. Some of the sites store the information as bit map images, thereby carrying absolutely no linguistic information directly in electronic form. Majority of the Indian language web sites, however store the text in the form of font glyphs. This gives rise to two major problems.

a) The glyph coding schemes for these fonts is typically different for different fonts. To view the content of these sites then one requires these fonts on local machine. Use of dynamic fonts is an attempt to solve this problem but it involves an additional cost of transmission.

b) Second problem with the storage in glyph codings is that it is difficult to process such texts by machine. Processing a text by machine means: search for words or phrases, script conversion, dictionary lookup while reading the text electronically, running parsers and machine translation softwares to access text in other language, etc. Worst of all, in many cases,one can not even edit a text unless one has the special editor for that font!

None of these problems exist for English, since English sticks to the ASCII standard. It is difficult to come to a consensus on the standardization of the glyph codes, since the issues like astheticity, easy typing scheme, etc. which govern the font design are subjective. For more than a decade, ISCII - a character level standard exists for Indian languages, but unfortunately only a few companies follow it.

Many people hope that the standardization problem will get solved because of Unicode. However there is an issue of transmission efficiency. The transmission cost for Indian languages will be three times that of English! The real culprit being UTF-8. UTF-8 converts Unicode two-byte codes to byte sequence of one to four bytes. In the process they make sure that ASCII part of the Unicode is transmitted as single byte only. So for a language like English which uses only 0-127 part of the code there is no overhead. European languages use only a few character codes in the region 128-255 in addition to 0-127 part. So in the case of the Europian languages the transmission of this portion may incur some overhead say of the order of 10%.

In contrast to above cases Indian languages use no part of the code in region 0-127. Secondly Indian character codes occupy less than 127 codes for each language. So what could have been transmitted in one byte if one uses ASCII will be transmitted in a sequence of two to four bytes. This amounts to extra overhead of 200%!

SUGGESTION:
----------

Our suggestion is, till the issue of transmission efficiency is resolved, we must not switch over to UNICODE, but follow ISCII. ISCII has an additional advantage of having common code for all the Indian languages. This makes the conversion among Indian languages as simple as selection of fonts.

Storing of the text in ISCII format is just half part of the story. One also needs a utility which will import the ISCII text into the locally available fonts before displaying it, and a utility to export the edited text into ISCII before saving it.

The question then is how to enable the display of ISCII web pages on client machines using locally available fonts?

The Plug-in technology of the Netscape has been used to provide an import utility for ISCII texts for both Netscape and Internet Explorer under Windows and Linux. The text browser lynx allows the user to define viewer application corresponding to the mime type, and hence for lynx an import facility can be provided easily.

ISCII Plug-in:
-------------

The ISCII plug-in has been developed with the goal of giving users full freedom to use any script, any font, any platform, and any browser to view the Indian language contents on the web, without sacrificing efficiency or incurring additional cost.

An ISCII Plug-in is a special program that gets invoked when the browser encounters a text with particular mime type. When the client end browser sends a request to the server for an iscii file (extension .isc), server sends the data along with mime type(text/iscii).

The client end browser then invokes the iscii plugin to handle this input stream. Iscii plugin converts the incoming iscii stream into font glyph sequences, for the user defined fonts at the client end.

In case of forms, iscii plugin also adds a hidden field with user defined font name, so that when the form is submitted, server gets the name of the font in which the field values have been encoded.

Settings are required at the server as well as client side. At the server side one needs to set the mime type corresponding to the filename extension. This is a one time setting. At the client end one has to install a software that does the conversion of iscii content to the user selected font glyphs. Again this is also a one time setting.

FURTHER WORK:
------------

ISCII Plug-in is available under GPL for free download at http://www.iiit.net/amba/iscii_plugin/index.html

This iscii plugin currently supports most of the major fonts for devanagari, and the C-DAC fonts for all Indian languages. It is very easy to provide a support for any Indian language font.

Since the plug-in is available "free" under GPL, computer dealers can provide it as a pre-installed software.

The availability of the plugin under GPL also makes it easy to improve it further by adding new features, developing user-friendly smooth installation packages, etc. Same concept can further be adapted for Active-X, so that all the existing office suits can be used with Indian language support. We invite others to join in this activity, since we do not have any expertize in Active-X.

Unlike ISCII, Unicode has seperate codes for different Indian langauges. So when one switches over to Unicode, the same plugin can be used with minor changes in the data.

Acknowledgement:
---------------

The initial proto-type of this plug-in was developed by Mr. P. Ganesh, Mr. Prakash Daga, and Mr. Pranav Dharma, the students of Regional College of Engineering, Trichy, during winter vacation, in Dec 1999 at the LTRC, IIIT,Hyderabad. Further enhancement has been done at the center, and is being maintained by the Akshara Bharati Group at LTRC, IIIT, Hyderabad.

Satyam Computers Services Ltd. is providing the financial support to the Language Technologies Research Center, and the systems developed under the "free" wing of the center are available as "free" open source under GPL.

References:
ISCII - Indian Script Code for Information Interchange - ISCII Bureau of Indian Standards: New Delhi, 1991.