..........................................NEXT PREVIOUS INDEX
 

 L01 Applications - Info Extraction


      INFORMATION EXTRACTION


o Info of interest is specified
   - described in natural language

o Output database schema is specified

* Typically, sample training input texts are given
--------------
      EXAMPLE TASK

o Specification:

For each terrorist event, system should determine
the type of attack (bombing, arson, etc.), date,
location, perpetrator, targets, effects on target.

--------------
      EXAMPLE INPUT

[SOURCE TYPE="newspaper", TITLE="The Hindu", DATE="27 Sept 1999", 
EDITION="Hyderabad", PAGE=1]

[HEADLINE text="JD(S) candidate escapes unhurt in mine blast"]

Srinagar, Sept 26, 1999. The Janata Dal (Secular)
candidate for Anantnag parliamentary constituency,
Peerzada Abdul Hamid, had a miraculous escape when
a landmine planted by militants blew up his escort
vehicle, seriously wounding eight CRPF personnel in
south Kashmir today. ...
--------------
      EXAMPLE OUTPUT

o TYPE	= explosion
o DATE	= 26.09.1999
o LOC	= Anantnag
o PERPETRATOR	= NOT_AVAILABLE
o PHYS_TARGET 	= NONE
o HUMAN_TARGET 	= Peerzada Abdul Hamid
o EFF_ON_HT	= unhurt
o EFF_ON_OTHERS	= eight CRPF personnel
o INSTR		= landmine
--------------
      MOTIVATION

o Large amount of info exists only in NL
o If the info is loaded in a more structured form
  info can be processed more easily
   - NL understanding task broken up in more
     managable stages
--------------
      SOME RESULTS

Message understanding conference-1 (MUC-1) 1987

o Participants given description of scenario
o Templates to be extracted
o Training corpus
o Time: 1-6 months

o Test corpus
o Systems run and outputs returned
o Organizer compares with manually filled templates
  (answer key)
o Scores assigned - precision and recall
o F-measure combines precision and recall
--------------
      SCORING METHOD

There is a universe of search.

o N-key =  Total number of answer keys in the universe
o N-resp = Total number of system responses
o N-corr = Total number of correct responses

- Precision (quality) = N-corr/N-resp
- Recall   (quantity) = N-corr/N-key

- F-measure = 2 * precision * recall /
                ( precision + recall )
--------------
      MUC BEST SCORES - Precision and Recall

- MUC-3	1991 Terrorist attacks	60% & 50%
- MUC-4	1992 Terrorist attacks	65% & 60%
- MUC-5	1993 Joint ventures 	57% & 53%
             in microelectronics
- MUC-6	1995 Changes in mgmt	73% & 58%
             personnel
- MUC-7	1998 Missile/air lnches	68% & 48%

MUC-5 domain was new and proved complex.
Performance went down.

--------------
      TYPICAL PROCESSING STAGES

o Text zoner
   - Break into text segments - tables, text, headers
o Pre-processor
   - Identify part-of-speech, dates, times, person 
     and company names, locations, currency amounts
o Filter
   - Remove irrelevant sentences - simple techniques
o Preparser #12
o Parser #
o Fragment combination #13
o Semantic interpretation #
o Lexical disambiguation #14
o Coreference resolution #
o Template generation #15
--------------
      APPROACHES

o Knowledge engineering
   - Molecular approach
   - Atomic approach

o Learning and statistical
..........................................NEXT PREVIOUS INDEX