Keynote Lectures

Keynote Lectures

Anusaaraka: An approach to Machine Translation

Akshar Bharati, Vineet Chaitanya1, Amba Kulkarni2

1Rashtriya Sanskrit Vidyapeetha,

Tirupati, India

2 Department of Sanskrit Studies

University of Hyderabad

Hyderabad, India

{vc9999999@gmail.com, apksh@uohyd.ernet.in}

Rule based Machine Translation (MT) needs lot of effort of "100 man years" or so. On the other hand Statistical methods need aligned bilingual corpus of substantial size (1.5 million words or more). Can we not benefit from the large bilingual population that India has? Wikipedia and ConceptNet are the examples of "What people in general can do". So, in case of MT, what is needed is the right kind of environment and tools for people to enable the people to contribute to it.

We describe “Anusaaraka - an approach to Machine Translation”

A) Which takes advantage of

1. existing software tools for analyzing English language,

2. existing Bilingual English-Hindi dictionary, and

3. existing architecture for MT,

all of which are available under General Public License.

B) which has the following features:

It reduces the complexity involved in MT by projecting different components such as Parser, Words Sense Disambiguation module and the Target language word order module orthogonally.

Anusaaraka thus serves as a reading aid promising 100% comprehension with a little effort.

It has "Human Understandable" interfaces to the existing software tools, so that even a layman can provide the necessary inputs, without much formal training.
It may be used as a tool for automatic generation of "Parallel Corpus" necessary for Machine Learning techniques. This "Parallel Corpus" has controlled complexity, since the "Word Sense Disambiguating" module is separated from the "Word Ordering" module. This should make the "machine learnt" part tractable for Human beings also.
It has a proper blend of "hand crafted knowledge" with "automatic learning".

C) which also serves as a work-bench for NLP students

To run and test various components. Combine the components in a creative manner by writing simple glue programs.
To manipulate the knowledge base and evaluate the results.

On Estimating Probability Mass Functions from Small Samples

Sanjeev P. Khudanpur

The Johns Hopkins University, USA

{khudanpur@jhu.edu}

Probabilistic models, with parameters estimated from sample data, are pervasive in natural language processing, as is the concomitant and age old problem of estimating the necessary probabilities from data. A novel and insightful view of a recurring problem in this context will be presented, namely the problem of estimating a probability mass function (pmf) for a discrete random variable from a small sample. Formally, a pmf will be deemed admissible as an estimate if it assigns merely a higher likelihood to the observed value of a sufficient statistic than to any other value possible for the same sample size. The standard maximum likelihood estimate is trivially admissible by this definition, but so are many other pmfs. It will be shown that the principled selection of an estimate from this admissible family via criteria such as minimum divergence leads to inherently smooth estimates that make no prior assumptions about the unknown probability while still providing a way to incorporate prior domain knowledge when available. Widely prevalent practices such as discounting the probability of seen events, and ad hoc procedures such as back-off estimates of conditional pmfs, will be shown to be natural consequences of this viewpoint. Some newly developed theoretical guarantees on the accuracy of the estimates will be provided and empirical results in statistical language modeling will be presented to demonstrate the computational feasibility of the proposed methods.

Towards Word Sense Disambiguation in the Large

Hwee Tou Ng

National University of Singapore

{nght@comp.nus.edu.sg}

Word sense disambiguation (WSD) is the task of determining the correct meaning or sense of a word in context. A critical problem faced by current supervised WSD systems is the lack of manually annotated training data. Tackling this data acquisition bottleneck is crucial, in order to build WSD systems with broad coverage of words. In this talk, I will present results of our attempt to scale up WSD, exploiting large quantities of Chinese-English parallel text. Our evaluation indicates that our implemented approach of gathering training examples from parallel text is promising, when tested on nouns and adjectives of SENSEVAL-2 and SENSEVAL-3 English all-words task. This work is jointly done with Yee Seng Chan.

The Semantic Quilt: Contexts, Co-occurrences, Kernels, and Ontologies

Ted Pedersen

University of Minnesota, Duluth

{tpederse@d.umn.edu}

Determining the meaning of words and phrases in text has been a central problem in Natural Language Processing for many years. As a result, there is a wealth of approaches available, including knowledge based methods, unsupervised clustering approaches, and supervised learning techniques. At present these methods are generally used independently to good but limited effect. In this talk I will provide an overview of these approaches, and show how they can be combined into a single framework that expands their coverage and effectiveness well beyond their individual capabilities.