Tapping Linguistic Data on the Web for Building Tools for Resource Poor Languages

Abstract

It has generally been the case that in order to build tools for automatically annotating and analyzing language data, one must first manually annotate a large body of language data over which such tools can be trained. Human annotation, however, is an expensive and time consuming endeavor, generally requiring significant linguistic expertise on the part of the annotators. Due to the expense it has not been feasible to build substantial annotated corpora for many more than ten or so of the world's languages, tragically leaving the vast majority of the worlds 6,000 languages outside the reach of substantive computational technology.

Some recent efforts have explored the potential for projecting annotations from data for one language to another. These efforts have tapped bilingual corpora between some majority language, for which annotated data or annotation tools exist, and some resource poor language, for which such resources do not. However, these methods can only reliably be applied to very large bilingual corpora (tens of millions of words), limiting its utility to perhaps a few dozen additional languages.

A substantial portion of theoretical linguistic research has focused on minority languages. Field linguists have been collecting data for hundreds of the world's languages for well over a hundred years, and much of the data that is published has been annotated or enriched by these or subsequent linguists. With the move towards electronic publication on the Web, a large portion of linguistic scholarly discourse is making its way to the Web, with the consequence being that a large amount of annotated linguistic data is also making its way to the Web.

Our research has focused on how we can harvest the wide body of data that is posted to the Web in order to build tools and resources for languages that are typically overlooked by the computational linguistic community. Our research has focused on four distinct lines of work: (1) harvesting data, (2) enriching data, (3) tool and resource building, and (4) search. For (1), we have focused on particular common, yet highly enriched, linguistic data types which we have harvested for hundreds of the world's languages. We then take the data that we have harvested, and (2) enrich it further, by adding parts of speech, phrase structures, dependency structures, etc., generally through alignment and projection. We take these highly enriched snippets of data, build enriched corpora and grammars over, which we use to drive the development of taggers and parsers for these languages. The tools we build allow us to enrich more data, and make it available to linguistic search (4), providing new avenues for linguistic and typological discovery.