..........................................NEXT PREVIOUS INDEX
L01 Applications - Info Extraction INFORMATION EXTRACTION o Info of interest is specified - described in natural language o Output database schema is specified * Typically, sample training input texts are given -------------- EXAMPLE TASK o Specification: For each terrorist event, system should determine the type of attack (bombing, arson, etc.), date, location, perpetrator, targets, effects on target. -------------- EXAMPLE INPUT [SOURCE TYPE="newspaper", TITLE="The Hindu", DATE="27 Sept 1999", EDITION="Hyderabad", PAGE=1] [HEADLINE text="JD(S) candidate escapes unhurt in mine blast"] Srinagar, Sept 26, 1999. The Janata Dal (Secular) candidate for Anantnag parliamentary constituency, Peerzada Abdul Hamid, had a miraculous escape when a landmine planted by militants blew up his escort vehicle, seriously wounding eight CRPF personnel in south Kashmir today. ... -------------- EXAMPLE OUTPUT o TYPE = explosion o DATE = 26.09.1999 o LOC = Anantnag o PERPETRATOR = NOT_AVAILABLE o PHYS_TARGET = NONE o HUMAN_TARGET = Peerzada Abdul Hamid o EFF_ON_HT = unhurt o EFF_ON_OTHERS = eight CRPF personnel o INSTR = landmine -------------- MOTIVATION o Large amount of info exists only in NL o If the info is loaded in a more structured form info can be processed more easily - NL understanding task broken up in more managable stages -------------- SOME RESULTS Message understanding conference-1 (MUC-1) 1987 o Participants given description of scenario o Templates to be extracted o Training corpus o Time: 1-6 months o Test corpus o Systems run and outputs returned o Organizer compares with manually filled templates (answer key) o Scores assigned - precision and recall o F-measure combines precision and recall -------------- SCORING METHOD There is a universe of search. o N-key = Total number of answer keys in the universe o N-resp = Total number of system responses o N-corr = Total number of correct responses - Precision (quality) = N-corr/N-resp - Recall (quantity) = N-corr/N-key - F-measure = 2 * precision * recall / ( precision + recall ) -------------- MUC BEST SCORES - Precision and Recall - MUC-3 1991 Terrorist attacks 60% & 50% - MUC-4 1992 Terrorist attacks 65% & 60% - MUC-5 1993 Joint ventures 57% & 53% in microelectronics - MUC-6 1995 Changes in mgmt 73% & 58% personnel - MUC-7 1998 Missile/air lnches 68% & 48% MUC-5 domain was new and proved complex. Performance went down. -------------- TYPICAL PROCESSING STAGES o Text zoner - Break into text segments - tables, text, headers o Pre-processor - Identify part-of-speech, dates, times, person and company names, locations, currency amounts o Filter - Remove irrelevant sentences - simple techniques o Preparser #12 o Parser # o Fragment combination #13 o Semantic interpretation # o Lexical disambiguation #14 o Coreference resolution # o Template generation #15 -------------- APPROACHES o Knowledge engineering - Molecular approach - Atomic approach o Learning and statistical..........................................NEXT PREVIOUS INDEX