|Invited talk||:||"Querying Linguistic Databases"|
Department of Computer Science, University of Melbourne,Australia &
Linguistic Data Consortium, University of Pennsylvania,USA.
Linguistic research and language technology depend on annotated
databases of text and speech. These databases are growing in size and
complexity to the point where many curation and analysis tasks have
become extremely onerous. Special-purpose scripts and query systems
clearly do not scale. Relational database technologies offer the
required scalability, but for a data model that is unsuited to
linguistic representations. Accordingly we turn to XML, whose ordered
tree model and associated language, XPath, are natural choices for
storing and querying linguistic data. We augment XPath with several
expressive features required for linguistic queries, and show how the
extended langauge can be translated into well-understood database
languages, permitting us to derive expressiveness and efficiency
properties. Experiments demonstrate that the query system is
significantly faster than other linguistic query systems for a wide
range of queries. We conclude with some case studies which show how
we can replace existing special-purpose scripts and query systems with
our new system, for expressive and efficient linguistic database
curation and analysis.
For speaker details, please visit the below link http://www.ldc.upenn.edu/sb
|Invited talk||:||"The Empirical Revolution in Natural Language Processing"|
|Speakers||:||Lucy Vanderwende and Arul Menezes, Microsoft Research, Microsoft Corporation, USA|
The past decade has witnessed a revolutionary shift in the methods that are used in Natural Language Processing (NLP)
from heuristic to empirical algorithms. While heuristic algorithms encode our introspective knowledge of the language
phenomena, the scope and complexity of language is such that, given data, learning algorithms have the potential to
capture a much more robust and complete picture of language than is feasible for a human. In this talk we examine common
pitfalls in the transition from the heuristic to the empirical approach, illustrated by our own experience at Microsoft
Data is paramount to an empirical program, and the quality and quantity of the data will determine the success of the NLP system. When creating corpora, the annotation level and guidelines should be carefully considered, to ensure that they embody the linguistic analysis that best serves the community, and that internal consistency, i.e. a high degree inter-annotator agreement, can be achieved. A central question before the community here is to discuss what type of analysis best suits for natural language processing of Indian languages, be it a constituency-level of analysis as for English, or a dependency analysis as for Japanese, or a different structure altogether.
The effort of developing annotated data for new areas of study in NLP, and of developing resources for languages that have not been studied previously, has often been considered time-consuming. However, current research shows that useful components can be built with astonishingly small amounts of annotated data. Furthermore, once the corpus exists, systems can be quickly trained to reproduce the level of annotation. These systems improve as researchers gain more insight into the linguistic phenomena being modeled. Furthermore the corpus provides lasting value as it is reused over time as more advanced training methods are developed.
Competitive evaluations can play a key role in accelerating the pace of NLP research. Progress is measured by evaluation of linguistic components or end-to-end systems. A shared task and evaluation metric can create a research community that shares a common goal and methodology and speaks a common language, enabling easier exchange of ideas between research groups. A credible automatic metric must be shown to have good correlation with human evaluations and with a real world task. We illustrate this by contrasting the experience of MT and summarization evaluations. Ideally, the evaluation method should be automatic, so that it can be used during system development.
The NLP group in Microsoft Research has experience with developing heuristic components, but now works primarily with empirical algorithms. To motivate this shift, we will describe our efforts in Machine Translation (MT), focusing on the advantages and short-comings of the heuristic, empirical, and hybrid approaches to MT, emphasizing the role that data availability and evaluation methodology have played in this transition. We will also describe our recent efforts in creating a hybrid approach to Textual Entailment, an area of study that has a long history, but which is now a new sub-area in the field of Computational Linguistics.For speaker details, please visit the link http://research.microsoft.com/~lucyv
|Invited talk||:||"Large Vocabulary Speech Recognition"|
|Speaker||:||S. Umesh, IIT Kanpur, India|
In this talk, I will present the current state-of-the-art in transcribing continuous speech from any speaker. Current large vocabulary continuous speech recognition (LVCSR) systems have a vocabulary of around 60,000 words and they are speaker-independent, i.e. they can handle speech from any speaker. Further, they can recognise continuous speech, i.e. natural speech without any artificial pauses between words. In contrast to LVCSR, commercially available speech recognition software are typically for a single/few speakers (after suitable training) and have a smaller vocabulary and further require good microphones for good recognition performance. On the other hand, LVCSR systems handle speech from many different speakers, different channels and environmental noise. Using the Cambridge University's Broadcast News Evaluation (CU-BNE) system as an example I will describe the various modules in a LVCSR system and the recent trends and practices in the building of such a system. The Broadcast News Evaluation task involves the transcription of news broadcasts from many American TV and radio shows, and for this task the CU-BNE system has consistently outperformed other systems from different laboratories and universities from around the world.