General techniques for creating treebanks


Abstract
 
Since the release of the treebanks for English and other languages starting in the late 1980s, there has been tremendous progress in natural language processing, as the existence of treebanks allows researchers to build statistical models for tasks such as part-of-speech tagging and parsing, and the resulting technologies have proved far more robust than the hand-crafted systems popular in the 1980s. Today, treebanks have been constructed for many languages, including Arabic, Czech, Chinese, French, German, Korean, Spanish, and Turkish. However, so far, building treebanks has been an art, not a science, as there are no hard and fast rules.

In the talk, I will go through several major issues on treebank development, including the creation of annotation guidelines, the usage of annotation tools, and quality control. I will use the Chinese Penn Treebank as an example to illustrate the obstacles that treebank designers often encounter while building a large-scale treebank. I will then discuss several usages of treebanks, including POS tagging, parsing, and grammar extraction. I will conclude with remaining issues and future directions.



Back to Schedule