GENERATING CONVERTERS BETWEEN FONTS SEMI-AUTOMATICALLY

Akshar Bharati Nisha Sangal Vineet Chaitanya Amba P. Kulkarni Rajeev Sangal Satyam School of Applied Information Systems Indian Institute of Information Technology Hyderabad {vineet,amba,sangal}@uohyd.ernet.in

[In Proc. of SAARC Conf. on Multilingual and Multimedia information Technology, CDAC Pune, 1-4 Sept. 1998.] (This text contains some Indian script characters in ISCII-8 coding standard.) ABSTRACT

1. MOTIVATION
2. CONCRETE PROBLEM
3. OUTLINE OF THE SOLUTION
4. A PLAUSIBLE GLYPH-GRAMMAR FOR DEVANAGARI
4.1 AKSHARAS WITH 'к' ON THE RIGHT
4.2 AKSHARAS WITH 'к' IN THE MIDDLE
4.3 AKSHARAS WITHOUT 'к'
4.4 OTHER SPECIAL CASES

5. PRE AND POST FILTERS
5.1 PHENOMENON OF I-MATRA (л)
5.2 PHENOMENON OF REF-MATRA ( #R)

6. GENERATING THE CONVERTERS - ALGORITHM:
7. CONCLUSION
8 ACKNOWLEDGEMENT
9 REFERENCES

It is important for us to be able to view as well perform search and other operations on texts in Indian languages available over the world wide web (or floppies or CDs), independent of the hardware or software platform. There are problems in doing this because currently most of the sites on Indian language texts are not following any coding standards. While the long term answer might be for everyone to switch over to a standard alphabetic coding scheme (ACII), some tools have been developed for immediate use that allow texts in glyph coding to be converted to the standard, automatically or semi-automatically.

In this paper, a system is described which takes: (i) a text in an unknown coding scheme, and (ii) the same text in the ACII coding scheme, and generates a converter between the given unknown coding scheme and ACII. The converter can be used to convert a text from the non-standard coding scheme to ACII, and back. They can be progressively refined, manually.

For generating the converter, a glyph-grammar for the script of the language is also needed, which specifies what possible glyph sequences make up an akshara. The grammar is independent of the coding schemes, and is structured. It needs to be developed only once, for a script.

A sample glyph grammar for Devanagari has been described. A system has been implemented for Devanagari script, using which converters for at least two sample scripts have been generated semi-automatically.

1. MOTIVATION (TOP)

As has been discussed elsewhere, texts in English language are accessible over the web independent of the computer hardware, the operating system, or the word processor being used. These texts can be viewed by anyone anywhere, and they can also be searched, passed through an English processing program such as a parser or a machine translation software. In other words, they are stored, viewed, and processed as texts using the common ASCII standard.

There are a number of web sites storing texts in Indian languages, such as Indian language newspapers and magazines; however, the texts are available only for viewing on a specific platform. Usually, the proprietary font must be downloaded with the text (and the font can be used for viewing only under a specific operating system, say, Windows-95). The texts cannot be processed by machine. This means that the following kinds of things can either not be done easily or not at all: search for words or phrases, script conversion (in case someone wants to read a text in a language whose script he does not know, e.g., Urdu, Punjabi, Bengali for a Hindi knowing person), dictionary lookup while reading the text electronically, running of machine translation software to access text in another language, etc. In fact, usually one cannot even use the keyboard to make annotations to the text.

While the long term answer might be for everyone to switch over to a standard, some tools have been developed for immediate use that allow texts in one coding/fonts to be converted to the standard coding scheme automatically or semi-automatically. In this paper, we describe such a tool.

2. CONCRETE PROBLEM (TOP)

The problem is to convert a given file in an unknown coding scheme to the standard ACII scheme (alphabetic code for information interchange) given the following:

(1) A file of Indian language text from an unknown word processor. (The file typically consists of a sequence of byte codes (for glyphs) such that the Indian script (in the given font) is visible when glyph are drawn for those codes using that word processor.)
(2) A print out of say the first few pages corresponding to the given file above. (The same text can be manually typed to produce ACII codes for the initial part of the given glyph file.)

Optionally, the glyph table might also be available. (The glyph table shows a picture of each glyph together with its code.)

We wish to be able to convert the given file in particular, and similar such files from the same word processor in general, to the standard ACII coding scheme (namely, ISCII). For this, we have to develop a converter. This would allow us full functionality: to use keyboard, to search for strings, to do script conversion, etc. In summary, we want to be able to go from glyphs to aksharas (syllables) for any glyph file produced using the same word processor for the same font.

3. OUTLINE OF THE SOLUTION (TOP)

The given ACII file is compared with the corresponding part of the given glyph file using a matching program. The matching program learns the equivalences between the ACII codes and codes for glyphs. However, for this learning to take place with effectiveness, a grammar for the script is needed that gives the plausible alternative combination of shapes (glyphs) that make up the aksharas. Such a grammar is glyph-based and remains fixed once it becomes reasonably comprehensive for a script. (It does not depend on either the specific codes used for glyphs or on the different fonts.)

Generation of the converter from glyph codes (in the given glyph file) to ACII codes, takes place in stages, each stage broadly requiring two steps:

(1) First, the matching program using the plausible grammar generates a map table between glyphs and the codes used.
(2) Second, the grammar and the map table are used to generate a converter.

The above two steps are repeated a number of times. In the first stage, only those rules are selected from the grammar which are certain (e.g., no alternatives). In the second stage, grammar rules having alternatives are given. Besides finding the map table, the matcher also selects those rules (out of the alternatives) which are the best. Subsequently, another convertor module is produced. Finally, the same process might be repeated a third time to select those rules out of alternative rules for which the evidence is weak. Again a new converter is produced.

The overall converter, it is expected, will be able to convert 90 to 95% of the glyphs with a page or two of sample ACII text. Some glyphs might remain unconverted because the learning data (namely, the given ACII file and the corresponding initial part of glyph file) might not be comprehensive, and the program might not have seen some glyphs as they do not occur in this learning data. It is hoped that once a large part of the file is converted, the user will be able to supply the missing code map and the rules manually, (by looking at the remaining part of the glyph file and making intelligent guesses) using which the converter would become complete.

4. A PLAUSIBLE GLYPH-GRAMMAR FOR DEVANAGARI (TOP)

The glyph-grammar can have at least three types of rules: Type 1 rules are applicable across all the Indian languages. For example, the rule regarding absorbing of 'a' (Є) vowel in the akshara, and its subtraction by means of halant symbol (ш) holds across all the scripts of Indian languages. Type 2 rules hold for Devanagari, and perhaps for some other scripts. For example, how the vertical-bar (к), can be neatly removed from many of the consonents to show absence of vowel would come under this type. Many of the rules dealing with whole classes of consonents given below are of this type. Type 3 rules pertain to idio-syncratic combination of glyphs to achieve beauty. An actual grammar for Devanagari is given below.

The glyph grammar is written by giving rules for converting aksharas to glyphs. The left hand side gives the actual ACII codes used for the aksharas whereas the right hand side is a description for combining glyphs. The codes need not be specified for glyphs (the program will learn these automatically). For example, usually there is a single glyph for the akshara 'и' (h). This would be written as:

и -> &и

In the rule above, the left side gives the actual ACII code for 'и' whereas &и is a name for a constant one byte code for the glyph. Any arbitrary name could have been chosen, for example, the above rule could have been written as

и --> &h (1)

as long as the chosen name is used consistently in the entire grammar.

4.1 AKSHARAS WITH 'к' ON THE RIGHT > (TOP)

A very important glyph in Devanagari is the vertical bar 'к' which occurs very frequently. For example, the akshara 'Е' is always made up of two glyphs: #'Еш' and 'к'. This is shown as:
  Е  --> &g  &A
where &g = #'Еш' and &A = 'к'. The fact that 'Ь' is also made up of
two glyphs can be shown as:
  Ь --> &m &A
Note that '&A' is simply a name used to refer to the second glyph for
these aksharas. The above rules can be made more readable by chosing
more suggestive names:
  Е   -->  &Е  &к             (2)
  Ь   -->  &Ь  &к
It does not alter their meanings.
For half-aksharas (or pure consonents) for 'Е' and 'Ь', there are two
ACII characters whereas there is a single glyph. This is written as:
  Е ш   -->   &Е
  Ь ш   -->   &Ь
at this point we can refine the rules in (2) to:
  Е/[^ш]  -->  &Е             (2')
  Ь/[^ш]  -->  &Ь
(where &Е is a name for the first glyph namely #Еш, and so on for &Ь). Expression '/[^ш]' means that the character ('Е' or 'Ь') is not followed by halant character 'ш' (in the usual Unix syntax for regular expressions where '[..]' specifies a list, and '[^..]' specifies the negation of the list).

Rules similar to (2') and (3') apply not just to 'Е' and 'Ь' but all aksharas that have a vertical bar (к) at the end. It would be cumbersome to write these separately each time. Therefore, symbols denoting a class can be used:

  $Е/[^ш]   -->  %Е &к            (2")
  $Е ш       -->  %Е            (3")
  $Е  = '[ДЕЖИКМТУХЦШЪЫЬЭбдз]'         (4)

Here, '$Е' denotes a class. The actual members of the class (namely, the actual byte codes in ACII) are enumerated in (4). '%Е' is a class variable for glyphs associated with the members of the same class $Е. The fact that the same symbol is used with '$' and '%' gives it a special meaning. Class %Е has the same cardinality as $Е and there is a one to one correspondence between the elements of the two. In other words, for every element of $Е there is a unique element of %Е and vice versa. This is another way of saying that for every ACII code in $Е, there is a unique associated glyph code which satisfies the rules (2") and (3").

Most matras (vowel marks on the aksharas) follow normally after
the above:
Ех -> &Е &к &х
Ек -> &Е &к &к

There is no need, therefore, to give these repeatedly with each akshara or class of aksharas. They can be given once, for the matra itself. There are some alternative glyph combinations one of which would be used in making up the matra. This is expressed as follows:

х -> &х || &к &с

The above rule says that ACII code 'х' can be made up of one or two glyphs. (Another alternative where the two glyphs can come in any order is ignored here to keep the grammar simple.)

It is important to note that the above is not a grammar with two possibilities, but represents two different grammars. Only one of the two will apply to a given font. Once a rule is selected, the other rule will cease to exist for that font.

We illustrate the power and flexibility of this grammar system by showing a case where the glyphs for matras combine in "unusual" ways to yield aksharas and different matras. This case was actually found to occur in many different fonts:

Е с --> #Еш х

'Е' followed by 'с' matra was broken up into two glyphs: half akshar '#Еш' and a different matra 'х'. The following grammar rules cover this case:   $Е с   -->  %Е &х || %Е &к &с      (5)
  $Е т   -->  %Е &ц || %Е &к &т      (6)
Similarly, rule (2") must be modified to exclude the above case:
  $Е/[^шст] ->   %Е &к
  $Е/[^шст] ->   %Е &к

4.2 AKSHARAS WITH 'к' IN THE MIDDLE > (TOP)

Rules are also needed for 'Г' class in which 'к' occurs in the middle:

$Г = '[ГЩ]'

$Г/[^шстно] о] --> %k &к &{k2} (7)

Note that &{k2} is the name for a glyph code which represents the right part of akshara 'Г'. (Curly brackets are used to indicate the start and end of the name.) The glyphs for matras are intermixed with glyphs for 'Г' class given by the following rules:
  $Г с --> %Г &х &{k2} || %Г &к &с &{k2}      (8)
  $Г т --> %Г &ц &{k2} || %Г &к &т &{k2}
  $Г н -->   %Г &к &н &{k2}
  $Г о -->   %Г &к &о &{k2}
we also need:
  $Г ш -->   %Г &к &-                    (9)

4.3 AKSHARAS WITHOUT 'к' > (TOP)

The 'Н' class (t-class) consists of those aksharas which do not have a vertical bar at all. All these aksharas are accounted for by the following grammar:

$Н = '[НОПРЙФ]'

$Н/[^ш] --> %Н

Half akshara of 'Н'class is written using halant as follows:

$Н ш/[^$Н] --> %Н &ш

$Н1ш$Н2 --> %Н1 &ш %Н2

If there are more than one occurrences of a variable ($Н above), they are numbered, if necessary, to refer to the occurrences ($Н1 and $Н2 above). (This feature is currently not implemented).

In the above example, it can be also be written as:

$Н ш/[$Н] --> %Н &ш

In the above, the half akshara of t-class is written using halant mark. The conjoint clusters for Н-class (t-class), however, are also written one below the other in many fonts:

НшО as #НшО

These would have to be enumerated in the grammar:   Н ш О   -->  &{t_T}
  О ш О   -->  &{T_T}
  П ш П   -->  &{d_d}

4.4 OTHER SPECIAL CASES > (TOP)

Similarly, there are other cases. Let us discuss the case of Я (r) following a half-akshar. In case of 'Ь' (m) followed by halant and 'Я' (r), the resulting shape can be constructed in several different ways:   Ь ш Я /[^ш]   --> &Ь &к &Я || &Ь &AR || &{mR} &к
               #Ьш к  #R     #Ьш  #AR   #mR к
Which of these three actually is used by a font, will have to be selected after the matching process. To cover the whole Е-class (g-class) we can write:   $Е ш Я /[^ш] --> %Е &к &Я || %Е &q || %1Е &к
               #Еш к #R    #Еш #AR   #gR к
Similarly, the t-class (Н-class) is handled as follows:   $Н ш Я  --> %Н &{rr}
         Н  #^
To cover the k-class (Г-class) we have:
  $Г ш Я  --> %Г &к &r &{k2} || %Г &AR &{k2}
  #ГшЯ         #  к   #/ #        #   #   #
and (9) can be rewritten as:
  $Г ш/[^Я] -->   %Г &к &-             (9')

5. PRE AND POST FILTERS (TOP)

There are some additional features in the scripts, to handle which the grammar notation can become very complex.

They are explained below using a grammar like notation except that '*' operator is also used.

5.1 PHENOMENON OF I-MATRA (л) > (TOP)

The glyph for i-matra (л) is put on the left side of akshara it follows:   akshara codes        glyph codes
     Г л      -->       л Г
The above case is not difficult to handle in the grammar notation. However, the full phenomenon is more complex: the i-matra moves to the left of all the half-consonents before the akshara (written below in a pseudo grammar notation):   akshara codes       glyph codes
   ($c1ш)* $c2 л     -->   &л (%c1ш)* %c2

5.2 PHENOMENON OF REF-MATRA ( #R) > (TOP)

The half akshara of r (Я) becomes a matra (called ref) and is put on
the full akshara that follows it:
  Я ш д/[^ш] --> Яшд
(In the above, the ref-matra is put on top of the akshara: д.)
A rule can be written for all consonents ($c):
  Я ш $c/[^ш]   --> %c &R
More generally, it jumps over half aksharas to the first full akshara
that follows it (written below in a pseudo notation):
  Я ш ($c1ш)* $c2/[^ш]        --> (%c1 &ш)* %c2 &R
Both of the above phenomena require a movement over "long" distance. Rather than introduce the '*' operator in the grammar, and make the grammar and the implementation more complex, we introduce the notion of pre- and post- filters. These are programs, that take an input file and change it in pre-defined ways.

A pre-filter can be defined on the akshara file. It moves the i-matra (л) and ref-matra (#ref) to a place where the corresponding glyph would be located. For example, the following two pre-filters given here using perl code would be applied.

  $c = '[...]';         # All the aksharas for consonents
  s/(($cш)*)($c)л/л$1$3/g;         # Filter(1)
  s/Яш(($cш)*)($c)([^ш])/$1$3#ref$4/g;      # Filter(2)
Note that the ref-matra is introduced as a special code (as if a special matra).

The pre-filter is applied on the akshara file, because trying to move the glyph for i-matra (&л) or ref-matra (&{ref}) would not be possible on the glyph file, because the code map is not known.

Having made this change to the akshara file, the matching grammar rule for the filtered akshara file and glyph file can be written as follows:
л --> &к &i || &л
#r --> &R

Once we define a pre-filter for the akshara file, the generated converter when applied to the glyph file would generate such an akshara file. A post-filter would be needed to get a proper akshara file.

The following would be the post filter (in Perl code):
s/л(($cш)*)($c)([^ш])/$1$3л$4/g;
s/(($cш)*)($c)$ref/Я ш$1$3/g;

(In earlier work on writing converters rapidly by VC and APK, only the filters were used for the script conversion. The converter simply consisted of a series of filters, each filter handling something roughly what a grammar rule covers. There was no separate grammar. That implementation is also available.)

6. GENERATING THE CONVERTERS - ALGORITHM: (TOP)

There are three tasks in building the converters:

(1). Find the code map for glyphs by comparing the sample "unknown" glyph file with its corresponding "known" ACII file.
(2). Select the grammar out of the given possible grammars.
(3). Generate the converters using the selected grammar and the code map.

The above tasks are repeated in stages. In the first stage, only that part of the code map is generated about which the system is "reasonably" sure. To find the code map in stage 1, only those grammar rules are used for which there are no alternataives, and then only those codes are put in the code map for which there is an "overwhelming" evidence (Figure 1).                          ACII                        Input
                         file                        glyph file
                           |                          |
                           V                          V
     glyph-           |-----------------------------------|
     grammar  ------> |   find possible codes             |
                      |-----------------------------------|
                                  |
                                  |
                             possible codes
                                  |
                                  |
                                  V
                         |---------------------|
                         |    weigh evidence   |
                         |---------------------|
                                  |
                                  V
                              code map       (after stage 1)
  Figure 1: find-code-map: finding code map for a grammar (stage 1)
At the end of stage 1, usually only a partial code map is obtained. Using this code map (and the grammar), a converter is generated from glyphs to ACII (Fig 2).
                   code-map                   Input glyph file
                      |                              |
                      V                              V
  glyph-        |-----------------|        |----------------------|
  grammar ----->|  gen-converter  | =====> | glyph-ACII converter |
  (stage 1)     |                 |        |  (after stage 1)     |
                |-----------------|        |----------------------|
                                                     |
                                                     |
                                                     V
                                        (partially) converted file
              Figure 2: Generating converter (stage 1)
This partial converter is applied to the given sample glyph file and the resulting partially converted file is displayed along with the corresponding ACII, so that the user can verify and assist the process, if he so desires.

In the second stage, whenever a grammar rule has alternatives, the correct one needs to be chosen, wherever possible. To do this, the "unknown" glyph file is converted using the partial converter as mentioned above. This is compared with the "known" ACII file, and the matched parts are marked in both. Now the corresponding unmatched parts are extracted to give two new files: unmatched parts from ACII file, and corresponding unmatched parts from glyph file. These become the new input files, against which the alternatives of the rules are tried one by one. For each rule, that alternative is selected which gives the best match without violating the existing code map. The selected rules are added to the rules from the previous stage, and the new codes are added to the code map from the previous stage.

Now with the augmented code map and augmented grammar rules, a new converter is generated, the effect of which is again shown to the user. At the end of this stage, 70% to 80% conversion is likely to be reached with a page or two of sample text.

Further stages can be tried by repeating exactly the same steps as in stage 2. Only a relaxation will be made in allowing the program to make guesses regarding rules and code map on lesser evidence. The purpose of delaying the guesses is that hypotheses with stronger evidence are established first, before trying the tentative ones. The process from stage 2 onwards is shown in Fig 3:-

            Unmatched                  Unmatched
          parts of ACII file           parts of glyph file
          (after stage n-1)            (after stage n-1)
                  |                        |
                  V                        V
glyph-       |--------------------------------|
grammar ---> |  find-code-map                 |
(rules with  |--------------------------------|
alternatives         |
for stage n)         V
              code maps (for alter-        code map
              native rules in grammar      (from stage n-1)
                      |                      |
                      V                      V
             |----------------------------------------|
             |        select alternatives from        |
             | grammar rule and corresponding code map|
             |----------------------------------------|
                      |                      |
             Selected grammar rule       additional code map
                      |                      |
                      V                      V
                    |---|                  |---|
       grammar ---> | + |                  | + | <-- code map
  (from stage n-1)  |---|                  |---|    (from stage n-1)
                      |                      |
                      V                      V
             augmented grammar          augmented code map
              (after stage n)           (after stage n)
                      |                      |
                      |                      |      Input glyph file
                      |                      V               |
                      |       |----------------|      |----------------|
                      .-----> | gen-converter  |=====>| glyph-ACII     |
                              |                |      | Converter      |
                              |----------------|      | (after stage n)|
                                                      |----------------|
                                                             |
                                original ACII file    Converted file
                                         |            (after stage n)
                                         V                   |
                                  |---------------------------------|
                                  |        comparator               |
                                  |---------------------------------|
                                          |                  |
                                          V                  V
                      unmatched parts of ACII      unmatched parts of glyph
                      file (after stage n)         file (aftger stage n)
  Figure 3:  nth stage of Converter generation (for n > = 2)
The algorithm given in Fig. 3, has been implemented in Perl. The actual converter is generated using Flex, and is extremely fast.

7. CONCLUSION (TOP)

In this paper, it has been shown how converters between an unknown coding scheme and the standard ACII coding scheme can be generated semi-automatically. A glyph-grammar for Devanagari has been presented. It is discussed how such a grammar has been used along with sample parallel text files (ACII file and given unknown glyph file with the same text content) to generate the converters. The user can help the generation process by providing information wherever the information about conjunct-clusters etc. is not available in the sample files. Glyph table can also be used if available.

The glyph-grammar for a script does not depend on the coding schemes, and therefore, needs to be written only once for a script. We invite people to write glyph-grammar for the script of their languages. After that it is straight-forward to generate converters for the unknown word processors or fonts.

The system described here for generating code converters, is available as GPL free software (from anu.uohyd.ernet.in using anonymous ftp). People are invited to download it, try it out, as well as augment the implementation.

8 ACKNOWLEDGEMENT (TOP)

Research reported here has been supported by Department of Electronics, Government of India as part of Anusaaraka project under Technology Development for Indian Languages programme. It was carried out while the authors (VC, APK, RS) were at IIT Kanpur Centre for NLP at Hyderabad. The anusaaraka activity is being carried out by IIT Kanpur and University of Hyderabad. Satyam Computers has now joined this activity and is also providing support for it. NS has been contributing her effort voluntarily and has implemented the system described here.

9 REFERENCES (TOP)

1. Bharati, Akshar, Vineet Chaitanya, Amba P. Kulkarni, and Rajeev Sangal, Anuvada ke upakarana: samganaka aur bhashaem (Tools of translation: computer and languages), Course material for PG Diploma in Translation Studies (Hindi), Distance Education, University of Hyderabad, 1998c. (Chapter 2).

2. Bharati, Akshar, Vineet Chaitanya, and Rajeev Sangal, Natural Language Processing: A Paninian Perspective, Prentice-Hall of India, New Delhi, 1995.

3. Bharati, Akshar, Vineet Chaitanya, Amba P. Kulkarni, Rajeev Sangal, and G Uma Maheshwar Rao, Anusaaraka: Overcoming the Language Barrier in India, In Anuvad: Approaches to Translation, Rukmini Bhaya Nair (editor), Katha, New Delhi, 1998 (forthcoming).

Anusaaraka Home Page