Scalable techniques for Information Extraction

Abstract

Information Extraction from terabytes of text is becoming an increasingly important requirement. In this presentation, we will review the state of the art in scaling up the extraction of information from large amounts of unstructured text. Two key approaches will be covered: (a) specialized indexing techniques and (b) grammar vs algebra-based paradigms for information extraction. We will also specifically discuss the challenges involved in building and maintaing rule-based annotators for information extraction. To conclude, we will demonstrate two systems that are focussed on scalable rule-based information extraction.