It is important for us to be able to view as well perform search
and other operations on texts in Indian languages available over the
world wide web (or floppies or CDs), independent of the hardware or
software platform. There are problems in doing this because currently
most of the sites on Indian language texts are not following any
coding standards. While the long term answer might be for everyone
to switch over to a standard alphabetic coding scheme (ACII),
some tools have been developed for immediate use that allow texts
in glyph coding to be converted to the standard, automatically or
semi-automatically.
In this paper, a system is described which takes: (i) a text in an
unknown coding scheme, and (ii) the same text in the ACII coding
scheme, and generates a converter between the given unknown coding
scheme and ACII. The converter can be used to convert a text
from the non-standard coding scheme to ACII, and back. They can
be progressively refined, manually.
For generating the converter, a glyph-grammar for the script of
the language is also needed, which specifies what possible glyph
sequences make up an akshara. The grammar is independent of the
coding schemes, and is structured. It needs to be developed only
once, for a script.
A sample glyph grammar for Devanagari has been described. A
system has been implemented for Devanagari script, using which
converters for at least two sample scripts have been generated
semi-automatically.
|
As has been discussed elsewhere, texts in English language are
accessible over the web independent of the computer hardware, the
operating system, or the word processor being used. These texts can
be viewed by anyone anywhere, and they can also be searched, passed
through an English processing program such as a parser or a machine
translation software. In other words, they are stored, viewed, and
processed as texts using the common ASCII standard.
There are a number of web sites storing texts in Indian languages,
such as Indian language newspapers and magazines; however, the texts
are available only for viewing on a specific platform. Usually,
the proprietary font must be downloaded with the text (and the font
can be used for viewing only under a specific operating system, say,
Windows-95). The texts cannot be processed by machine. This means that
the following kinds of things can either not be done easily or not at
all: search for words or phrases, script conversion (in case someone
wants to read a text in a language whose script he does not know, e.g.,
Urdu, Punjabi, Bengali for a Hindi knowing person), dictionary lookup
while reading the text electronically, running of machine translation
software to access text in another language, etc. In fact, usually
one cannot even use the keyboard to make annotations to the text.
While the long term answer might be for everyone to switch over to
a standard, some tools have been developed for immediate use that
allow texts in one coding/fonts to be converted to the standard
coding scheme automatically or semi-automatically. In this paper, we
describe such a tool.
|
2. CONCRETE PROBLEM
(TOP)
The problem is to convert a given file in an unknown coding scheme to
the standard ACII scheme (alphabetic code for information interchange)
given the following:
- (1) A file of Indian language text from an unknown word processor. (The
file typically consists of a sequence of byte codes (for glyphs) such
that the Indian script (in the given font) is visible when glyph are
drawn for those codes using that word processor.)
- (2) A print out of say the first few pages corresponding to the given
file above. (The same text can be manually typed to produce ACII codes
for the initial part of the given glyph file.)
Optionally, the glyph table might also be available. (The glyph table
shows a picture of each glyph together with its code.)
We wish to be able to convert the given file in particular, and
similar such files from the same word processor in general, to
the standard ACII coding scheme (namely, ISCII). For this, we have
to develop a converter. This would allow us full functionality:
to use keyboard, to search for strings, to do script conversion,
etc. In summary, we want to be able to go from glyphs to aksharas
(syllables) for any glyph file produced using the same word processor
for the same font.
|
3. OUTLINE OF THE SOLUTION
(TOP)
The given ACII file is compared with the corresponding part of the
given glyph file using a matching program. The matching program learns
the equivalences between the ACII codes and codes for glyphs. However,
for this learning to take place with effectiveness, a grammar for the
script is needed that gives the plausible alternative combination
of shapes (glyphs) that make up the aksharas. Such a grammar is
glyph-based and remains fixed once it becomes reasonably comprehensive
for a script. (It does not depend on either the specific codes used
for glyphs or on the different fonts.)
Generation of the converter from glyph codes (in the given glyph file)
to ACII codes, takes place in stages, each stage broadly requiring
two steps:
- (1) First, the matching program using the plausible grammar generates
a map table between glyphs and the codes used.
- (2) Second, the grammar and the map table are used to generate a
converter.
The above two steps are repeated a number of times. In the first stage,
only those rules are selected from the grammar which are certain
(e.g., no alternatives). In the second stage, grammar rules having
alternatives are given. Besides finding the map table, the matcher
also selects those rules (out of the alternatives) which are the
best. Subsequently, another convertor module is produced. Finally,
the same process might be repeated a third time to select those rules
out of alternative rules for which the evidence is weak. Again a new
converter is produced.
The overall converter, it is expected, will be able to convert 90
to 95% of the glyphs with a page or two of sample ACII text. Some
glyphs might remain unconverted because the learning data (namely, the
given ACII file and the corresponding initial part of glyph file) might
not be comprehensive, and the program might not have seen some glyphs
as they do not occur in this learning data. It is hoped that once
a large part of the file is converted, the user will be able to
supply the missing code map and the rules manually, (by looking at
the remaining part of the glyph file and making intelligent guesses)
using which the converter would become complete.
|
4. A PLAUSIBLE GLYPH-GRAMMAR FOR DEVANAGARI
(TOP)
The glyph-grammar can have at least three types of rules: Type 1
rules are applicable across all the Indian languages. For example,
the rule regarding absorbing of 'a' (¤) vowel in the akshara, and
its subtraction by means of halant symbol (è) holds across all the
scripts of Indian languages. Type 2 rules hold for Devanagari, and
perhaps for some other scripts. For example, how the vertical-bar (Ú),
can be neatly removed from many of the consonents to show absence
of vowel would come under this type. Many of the rules dealing
with whole classes of consonents given below are of this type. Type
3 rules pertain to idio-syncratic combination of glyphs to achieve
beauty. An actual grammar for Devanagari is given below.
The glyph grammar is written by giving rules for converting aksharas
to glyphs. The left hand side gives the actual ACII codes used for
the aksharas whereas the right hand side is a description for combining
glyphs. The codes need not be specified for glyphs (the program
will learn these automatically). For example, usually there is a single
glyph for the akshara 'Ø' (h). This would be written as:
Ø -> &Ø
In the rule above, the left side gives the actual ACII code for 'Ø'
whereas &Ø is a name for a constant one byte code for the glyph.
Any arbitrary name could have been chosen, for example, the above
rule could have been written as
Ø --> &h (1)
as long as the chosen name is used consistently in the entire grammar.
|
4.1 AKSHARAS WITH 'Ú' ON THE RIGHT
>
(TOP)
A very important glyph in Devanagari is the vertical bar 'Ú' which
occurs very frequently. For example, the akshara 'µ' is always made up
of two glyphs: #'µè' and 'Ú'.
This is shown as:
µ --> &g &A
where &g = #'µè' and &A = 'Ú'. The fact that 'Ì' is also made up of
two glyphs can be shown as:
Ì --> &m &A
Note that '&A' is simply a name used to refer to the second glyph for
these aksharas. The above rules can be made more readable by chosing
more suggestive names:
µ --> &µ &Ú (2)
Ì --> &Ì &Ú
It does not alter their meanings.
For half-aksharas (or pure consonents) for 'µ' and 'Ì', there are two
ACII characters whereas there is a single glyph. This is written as:
µ è --> &µ
Ì è --> &Ì
at this point we can refine the rules in (2) to:
µ/[^è] --> &µ (2')
Ì/[^è] --> &Ì
(where &µ is a name for the first glyph namely #µè, and so on for &Ì).
Expression '/[^è]' means that the character ('µ' or 'Ì') is not
followed by halant character 'è' (in the usual Unix syntax for regular
expressions where '[..]' specifies a list, and '[^..]' specifies the
negation of the list).
Rules similar to (2') and (3') apply not just to 'µ' and 'Ì' but
all aksharas that have a vertical bar (Ú) at the end. It would be
cumbersome to write these separately each time. Therefore, symbols
denoting a class can be used:
$µ/[^è] --> %µ &Ú (2")
$µ è --> %µ (3")
$µ = '[´µ¶¸º¼ÂÃÅÆÈÊËÌÍÑÔ×]' (4)
Here, '$µ' denotes a class. The actual members of the class (namely,
the actual byte codes in ACII) are enumerated in (4). '%µ' is a class
variable for glyphs associated with the members of the same class
$µ. The fact that the same symbol is used with '$' and '%' gives it
a special meaning. Class %µ has the same cardinality as $µ and there
is a one to one correspondence between the elements of the two. In
other words, for every element of $µ there is a unique element of %µ
and vice versa. This is another way of saying that for every ACII
code in $µ, there is a unique associated glyph code which satisfies
the rules (2") and (3").
Most matras (vowel marks on the aksharas) follow normally after
the above:
µå -> &µ &Ú &å
µÚ -> &µ &Ú &Ú
There is no need, therefore, to give these repeatedly with each akshara
or class of aksharas. They can be given once, for the matra itself.
There are some alternative glyph combinations one of which would be used
in making up the matra. This is expressed as follows:
å -> &å || &Ú &á
The above rule says that ACII code 'å' can be made up of one or two
glyphs. (Another alternative where the two glyphs can come in any
order is ignored here to keep the grammar simple.)
It is important to note that the above is not a grammar with two
possibilities, but represents two different grammars. Only one
of the two will apply to a given font. Once a rule is selected,
the other rule will cease to exist for that font.
We illustrate the power and flexibility of this grammar system by
showing a case where the glyphs for matras combine in "unusual" ways
to yield aksharas and different matras. This case was actually found
to occur in many different fonts:
µ á --> #µè å
'µ' followed by 'á' matra was broken up into two glyphs: half akshar
'#µè' and a different matra 'å'. The following grammar rules cover
this case:
$µ á --> %µ &å || %µ &Ú &á (5)
$µ â --> %µ &æ || %µ &Ú &â (6)
Similarly, rule (2") must be modified to exclude the above case:
$µ/[^èáâ] -> %µ &Ú
$µ/[^èáâ] -> %µ &Ú
|
4.2 AKSHARAS WITH 'Ú' IN THE MIDDLE
>
(TOP)
Rules are also needed for '³' class in which 'Ú' occurs in the middle:
$³ = '[³É]'
$³/[^èáâÝÞ] Þ] --> %k &Ú &{k2} (7)
Note that &{k2} is the name for a glyph code which represents the
right part of akshara '³'. (Curly brackets are used to indicate the
start and end of the name.)
The glyphs for matras are intermixed with glyphs for '³' class given by the following rules:
$³ á --> %³ &å &{k2} || %³ &Ú &á &{k2} (8)
$³ â --> %³ &æ &{k2} || %³ &Ú &â &{k2}
$³ Ý --> %³ &Ú &Ý &{k2}
$³ Þ --> %³ &Ú &Þ &{k2}
we also need:
$³ è --> %³ &Ú &- (9)
|
4.3 AKSHARAS WITHOUT 'Ú'
>
(TOP)
The '½' class (t-class) consists of those aksharas which do not have
a vertical bar at all. All these aksharas are accounted for by the
following grammar:
$½ = '[½¾¿À¹Ä]'
$½/[^è] --> %½
Half akshara of '½'class is written using halant as follows:
$½ è/[^$½] --> %½ &è
$½1è$½2 --> %½1 &è %½2
If there are more than one occurrences of a variable ($½ above),
they are numbered, if necessary, to refer to the occurrences ($½1
and $½2 above). (This feature is currently not implemented).
In the above example, it can be also be written as:
$½ è/[$½] --> %½ &è
In the above, the half akshara of t-class is written using halant mark.
The conjoint clusters for ½-class (t-class), however, are also written
one below the other in many fonts:
½è¾ as #½è¾
These would have to be enumerated in the grammar:
½ è ¾ --> &{t_T}
¾ è ¾ --> &{T_T}
¿ è ¿ --> &{d_d}
|
4.4 OTHER SPECIAL CASES
>
(TOP)
Similarly, there are other cases. Let us discuss the case of Ï (r)
following a half-akshar. In case of 'Ì' (m) followed by halant and 'Ï'
(r), the resulting shape can be constructed in several different ways:
Ì è Ï /[^è] --> &Ì &Ú &Ï || &Ì &AR || &{mR} &Ú
#Ìè Ú #R #Ìè #AR #mR Ú
Which of these three actually is used by a font, will have to be
selected after the matching process. To cover the whole µ-class
(g-class) we can write:
$µ è Ï /[^è] --> %µ &Ú &Ï || %µ &q || %1µ &Ú
#µè Ú #R #µè #AR #gR Ú
Similarly, the t-class (½-class) is handled as follows:
$½ è Ï --> %½ &{rr}
½ #^
To cover the k-class (³-class) we have:
$³ è Ï --> %³ &Ú &r &{k2} || %³ &AR &{k2}
#³èÏ # Ú #/ # # # #
and (9) can be rewritten as:
$³ è/[^Ï] --> %³ &Ú &- (9')
|
5. PRE AND POST FILTERS
(TOP)
There are some additional features in the scripts, to handle which
the grammar notation can become very complex.
They are explained below using a grammar like notation except that
'*' operator is also used.
5.1 PHENOMENON OF I-MATRA (Û)
>
(TOP)
The glyph for i-matra (Û) is put on the left side of akshara it
follows:
akshara codes glyph codes
³ Û --> Û ³
The above case is not difficult to handle in the grammar
notation. However, the full phenomenon is more complex: the i-matra
moves to the left of all the half-consonents before the akshara
(written below in a pseudo grammar notation):
akshara codes glyph codes
($c1è)* $c2 Û --> &Û (%c1è)* %c2
|
5.2 PHENOMENON OF REF-MATRA ( #R)
>
(TOP)
The half akshara of r (Ï) becomes a matra (called ref) and is put on
the full akshara that follows it:
Ï è Ô/[^è] --> ÏèÔ
(In the above, the ref-matra is put on top of the akshara: Ô.)
A rule can be written for all consonents ($c):
Ï è $c/[^è] --> %c &R
More generally, it jumps over half aksharas to the first full akshara
that follows it (written below in a pseudo notation):
Ï è ($c1è)* $c2/[^è] --> (%c1 &è)* %c2 &R
Both of the above phenomena require a movement over "long" distance.
Rather than introduce the '*' operator in the grammar, and make the
grammar and the implementation more complex, we introduce the notion
of pre- and post- filters. These are programs, that take an input
file and change it in pre-defined ways.
A pre-filter can be defined on the akshara file. It moves the i-matra
(Û) and ref-matra (#ref) to a place where the corresponding glyph
would be located. For example, the following two pre-filters given
here using perl code would be applied.
$c = '[...]'; # All the aksharas for consonents
s/(($cè)*)($c)Û/Û$1$3/g; # Filter(1)
s/Ïè(($cè)*)($c)([^è])/$1$3#ref$4/g; # Filter(2)
Note that the ref-matra is introduced as a special code (as if a
special matra).
The pre-filter is applied on the akshara file, because trying to move
the glyph for i-matra (&Û) or ref-matra (&{ref}) would not be possible
on the glyph file, because the code map is not known.
Having made this change to the akshara file, the matching grammar rule for the filtered akshara file and glyph file can be written as follows:
Û --> &Ú &i || &Û
#r --> &R
Once we define a pre-filter for the akshara file, the generated
converter when applied to the glyph file would generate such an akshara
file. A post-filter would be needed to get a proper akshara file.
The following would be the post filter (in Perl code):
s/Û(($cè)*)($c)([^è])/$1$3Û$4/g;
s/(($cè)*)($c)$ref/Ï è$1$3/g;
(In earlier work on writing converters rapidly by VC and APK, only
the filters were used for the script conversion. The converter simply
consisted of a series of filters, each filter handling something
roughly what a grammar rule covers. There was no separate grammar.
That implementation is also available.)
|
6. GENERATING THE CONVERTERS - ALGORITHM:
(TOP)
There are three tasks in building the converters:
- (1). Find the code map for glyphs by comparing the sample "unknown"
glyph file with its corresponding "known" ACII file.
- (2). Select the grammar out of the given possible grammars.
- (3). Generate the converters using the selected grammar and the
code map.
The above tasks are repeated in stages. In the first stage, only
that part of the code map is generated about which the system is
"reasonably" sure. To find the code map in stage 1, only those
grammar rules are used for which there are no alternataives, and
then only those codes are put in the code map for which there is an
"overwhelming" evidence (Figure 1).
ACII Input
file glyph file
| |
V V
glyph- |-----------------------------------|
grammar ------> | find possible codes |
|-----------------------------------|
|
|
possible codes
|
|
V
|---------------------|
| weigh evidence |
|---------------------|
|
V
code map (after stage 1)
Figure 1: find-code-map: finding code map for a grammar (stage 1)
At the end of stage 1, usually only a partial code map is obtained. Using this code map (and the grammar), a converter is generated from glyphs to ACII (Fig 2).
code-map Input glyph file
| |
V V
glyph- |-----------------| |----------------------|
grammar ----->| gen-converter | =====> | glyph-ACII converter |
(stage 1) | | | (after stage 1) |
|-----------------| |----------------------|
|
|
V
(partially) converted file
Figure 2: Generating converter (stage 1)
This partial converter is applied to the given sample glyph file
and the resulting partially converted file is displayed along with
the corresponding ACII, so that the user can verify and assist the
process, if he so desires.
In the second stage, whenever a grammar rule has alternatives, the
correct one needs to be chosen, wherever possible. To do this, the
"unknown" glyph file is converted using the partial converter as
mentioned above. This is compared with the "known" ACII file, and
the matched parts are marked in both. Now the corresponding unmatched
parts are extracted to give two new files: unmatched parts from ACII
file, and corresponding unmatched parts from glyph file. These become
the new input files, against which the alternatives of the rules
are tried one by one. For each rule, that alternative is selected
which gives the best match without violating the existing code map.
The selected rules are added to the rules from the previous stage,
and the new codes are added to the code map from the previous stage.
Now with the augmented code map and augmented grammar rules, a new
converter is generated, the effect of which is again shown to the user.
At the end of this stage, 70% to 80% conversion is likely to be reached
with a page or two of sample text.
Further stages can be tried by repeating exactly the same steps as
in stage 2. Only a relaxation will be made in allowing the program
to make guesses regarding rules and code map on lesser evidence.
The purpose of delaying the guesses is that hypotheses with stronger
evidence are established first, before trying the tentative ones.
The process from stage 2 onwards is shown in Fig 3:-
Unmatched Unmatched
parts of ACII file parts of glyph file
(after stage n-1) (after stage n-1)
| |
V V
glyph- |--------------------------------|
grammar ---> | find-code-map |
(rules with |--------------------------------|
alternatives |
for stage n) V
code maps (for alter- code map
native rules in grammar (from stage n-1)
| |
V V
|----------------------------------------|
| select alternatives from |
| grammar rule and corresponding code map|
|----------------------------------------|
| |
Selected grammar rule additional code map
| |
V V
|---| |---|
grammar ---> | + | | + | <-- code map
(from stage n-1) |---| |---| (from stage n-1)
| |
V V
augmented grammar augmented code map
(after stage n) (after stage n)
| |
| | Input glyph file
| V |
| |----------------| |----------------|
.-----> | gen-converter |=====>| glyph-ACII |
| | | Converter |
|----------------| | (after stage n)|
|----------------|
|
original ACII file Converted file
| (after stage n)
V |
|---------------------------------|
| comparator |
|---------------------------------|
| |
V V
unmatched parts of ACII unmatched parts of glyph
file (after stage n) file (aftger stage n)
Figure 3: nth stage of Converter generation (for n > = 2)
The algorithm given in Fig. 3, has been implemented in Perl. The
actual converter is generated using Flex, and is extremely fast.
|
In this paper, it has been shown how converters between an unknown
coding scheme and the standard ACII coding scheme can be generated
semi-automatically. A glyph-grammar for Devanagari has been presented.
It is discussed how such a grammar has been used along with sample
parallel text files (ACII file and given unknown glyph file with the
same text content) to generate the converters. The user can help the
generation process by providing information wherever the information
about conjunct-clusters etc. is not available in the sample files.
Glyph table can also be used if available.
The glyph-grammar for a script does not depend on the coding schemes,
and therefore, needs to be written only once for a script. We invite
people to write glyph-grammar for the script of their languages. After
that it is straight-forward to generate converters for the unknown
word processors or fonts.
The system described here for generating code converters, is available
as GPL free software (from anu.uohyd.ernet.in using anonymous
ftp). People are invited to download it, try it out, as well as
augment the implementation.
|
Research reported here has been supported by Department of Electronics,
Government of India as part of Anusaaraka project under Technology
Development for Indian Languages programme. It was carried out while the
authors (VC, APK, RS) were at IIT Kanpur Centre for NLP at Hyderabad.
The anusaaraka activity is being carried out by IIT Kanpur and
University of Hyderabad. Satyam Computers has now joined this activity
and is also providing support for it. NS has been contributing her
effort voluntarily and has implemented the system described here.
|
1. Bharati, Akshar, Vineet Chaitanya, Amba P. Kulkarni, and Rajeev
Sangal, Anuvada ke upakarana: samganaka aur bhashaem (Tools of
translation: computer and languages), Course material for PG Diploma
in Translation Studies (Hindi), Distance Education, University of
Hyderabad, 1998c. (Chapter 2).
2. Bharati, Akshar, Vineet Chaitanya, and Rajeev Sangal, Natural
Language Processing: A Paninian Perspective, Prentice-Hall of
India, New Delhi, 1995.
3. Bharati, Akshar, Vineet Chaitanya, Amba P. Kulkarni, Rajeev Sangal,
and G Uma Maheshwar Rao, Anusaaraka: Overcoming the Language Barrier
in India, In Anuvad: Approaches to Translation, Rukmini Bhaya Nair
(editor), Katha, New Delhi, 1998 (forthcoming).
Anusaaraka Home Page
|
|
|