Graph-Based Methods for Open Domain Information Extraction William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science Carnegie Mellon University Traditional IE vs Open Domain IE Goal: recognize people, places, companies, times, dates, in NL text. Supervised learning from corpus completely annotated with target entity class (e.g. people) Linear-chain CRFs Language- and genrespecific extractors Goal: recognize arbitrary entity sets in text Minimal info about entity class Example 1: ICML, NIPS Example 2: Machine learning conferences Semi-supervised learning from very large corpora (WWW) Graph-based learning methods Techniques are largely language-independent (!) Graph abstraction fits many languages Examples with three seeds

Outline History Open-domain IE by pattern-matching The bootstrapping-with-noise problem Bootstrapping as a graph walk Open-domain IE as finding nodes near seeds on a graph Approach 1: A natural graph derived from a smaller corpus + learned similarity Approach 2: A carefully-engineered graph derived from huge corpus (e.gs above) History: Open-domain IE by patternmatching (Hearst, 92) Start with seeds: NIPS, ICML Look thru a corpus for certain patterns: at NIPS, AISTATS, KDD and other learning conferences Expand from seeds to new instances Repeat.until ___ on PC of KDD, SIGIR, and Bootstrapping as graph proximity NIPS at NIPS, AISTATS, KDD and other learning conferences SNOWBIRD For skiiers, NIPS, SNOWBIRD, and AISTATS SIGIR KDD

on PC of KDD, SIGIR, and AISTATS,KDD, shorter paths ~ earlier iterations many paths ~ additional evidence Outline Open-domain IE as finding nodes near seeds on a graph Approach 1: A natural graph derived from a smaller corpus + learned similarity with Einat Minkov (CMU Nokia) Approach 2: A carefully-engineered graph derived from huge corpus (above) with Richard Wang (CMU ?)) Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) nsubj boys partmod like prep.with playing all kinds det NN

VB VB DT cars prep.of NN NN Dependency parsed sentence is a naturally represented as a tree Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Dependency parsed corpus is naturally represented as a graph Learning Similarity Measures for Parsed Text (Minkov & Cohen, EMNLP 2008) Open IE Goal: Find coordinate terms (eg, girl/boy, dolls/cars) in the graph, or find Similarity measure S so S(girl,boy) is high What about off-the-shelf similarity measures: Random Walk with Restart (RWR) Hitting time Commute time ?) Personalized PR/RWR

The graph A query language: Q: { , } Nodes Node type Edge label Edge weight graph walk parameters: edge weights , walk length K and reset probability . M[x,y] = Prob. of reaching y from x in one step: the edge weight from x to y, out of the outgoing weight from x. `Personalized PageRank: reset probability biased towards initial distribution. Returns a list of nodes (of type ) ranked by the graph walk probs. Approximate with power iteration, cut off after fixed number of iterations K. mention girls girls1

nsubj mention-1 like1 like mention mention nsubj-1 like2 boys2 -1 boys mention girls girls1 mention girls nsubj girls1 mention-1 like1 nsubj

like partmod like1 mention like2 mention-1 playing1 mention nsubj-1 boys2 mention playing -1 boys mention -1 boys mention girls girls1 nsubj

mention-1 like1 Prep.with playing1 dolls1 mention-1 dolls Useful but not our goal here Learning a better similarity metric Task T (query class) Query a + Rel. answers a Query b + Rel. answers b Query q + Rel. answers q GRAPH WALK node rank 1 node rank 1 node rank 1 node rank 2

node rank 2 node rank 2 node rank 3 node rank 3 node rank 3 node rank 4 node rank 4 node rank 4 node rank 10 node rank 10 node rank 10 node rank 11 node rank 11 node rank 11 node rank 12 node rank 12 node rank 12

node rank 50 node rank 50 node rank 50 Seed words (girl, boy, ) Potential new instances of the target concept (doll, child, toddler, ) Learning methods Weight tuning weights learned per edge type [Diligenti et-al, 2005] Reranking re-order the retrieved list using global features of all paths from source to destination [Minkov et-al, 2006] FEATURES Edge label sequences Lexical unigrams boys nsubj.nsubj-invnsubj-inv dolls

nsubj partmod prep.nsubj-invin nsubj partmod partmod-inv nsubjinv like, playing like, playing Learning methods: Path-Constrained Graph Walk PCW (summary): for each node x, learn P(xz : relevant(z) | history(Vq,x) ) History(Vq,x) = seq of edge labels leading from Vq to x, with all histories stored in a tree Vq girls nsubj x1 nsubj-inv boys partmod partmod-inv x2 prep.nsubj-invin x3 dolls nsubj-inv boys boys

nsubj.nsubj-invnsubj-inv nsubj partmod partmod-inv nsubjinv dolls nsubj partmod prep.nsubj-invin City and person name extraction Labeling words MUC Complete Partial/Noisy MUC+AP nodes edges NEs 140K 82K 244K 3K (true) 2,440K 1,030K 3,550K 36K (auto)

Vq = {sydney, stamford, greenville, los_angeles} Person names: Vq = {carter, dave_kingman, pedro_ramos, florio} City names: 10 (X4) queries for each task Train queries q1-q5 / test queries q6-q10 Extract nodes of type NE. GW: 6 steps, uniform/learned weights Reranking: top 200 nodes (using learned weights) Path trees: 20 correct / 20 incorrect; threshold 0.5 MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7

0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61

71 81 91 Graph Walk 1 11 21 31 41 51 61 71 81 91 rank MUC precision City names Person names 1

1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0

1 11 21 31 41 51 61 71 81 91 conj-and, prep-in, nn, appos Graph Walk Weight Tuning 1 11 21 31 41 51 61 71

subj, obj, poss, nn 81 91 rank MUC precision City names Person names 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4

0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71 81 91 Graph Walk Weight Tuning PCW

1 11 21 31 41 51 61 71 conj-and, prep-in, nn, appos subj, obj, poss, nn prep-in-inv conj-and nn-inv nn nsubj nsubj-inv appos nn-inv 81 91 rank MUC precision City names Person names

1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0

0 1 11 21 31 41 51 61 71 81 91 Graph Walk Weight Tuning PCW Reranking 1 11 21 31 41 51 61

71 81 conj-and, prep-in, nn, appos subj, obj, poss, nn Prep-in-inv conj-and nn-inv nn nsubj nsubj-inv appos nn-inv LEX.based, LEX.downtown LEX.mr, LEX.president 91 rank Vector-space models Co-occurrence vectors (counts; window: +/- 2) Dependency vectors [Pad & Lapata, Comp Ling 07] A path value function: Length-based value: Relation based value: 1 / length(path) subj-5, obj-4, obl-3, gen-2, else-1 Context selection function: Minimal: verbal predicate-argument (length 1) Medium: coordination, genitive construction, noun compounds (<=3) Maximal: combinations of the above (<=4) Similarity function:

Cosine Lin Only score the top nodes retrieved with reranking (~1000 overall) GWs Vector models MUC City names precision 1 Person names 1 0.9 PCW 0.9 0.8 Rerank 0.8 CO 0.7 0.7 DV 0.6

0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 1 11 21 31 41 51 61 71

81 91 rank 1 11 21 31 41 51 61 The graph-based methods are best (syntactic + learning) 71 81 91 GWs Vector models MUC + AP City names precision 1 Person names 1 0.9

Rerank 0.9 0.8 PCW 0.8 DV 0.7 0.7 CO 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1

0.1 0 0 1 11 21 31 41 51 61 71 81 91 rank 1 11 21 31 41 51 61 71

The advantage of the graph based models diminishes with the amount of data. This is hard to evaluate at high ranks 81 91 Outline Open-domain IE as finding nodes near seeds on a graph Approach 1: A natural graph derived from a smaller corpus + learned similarity with Einat Minkov (CMU Nokia) Approach 2: A carefully-engineered graph derived from huge corpus with Richard Wang (CMU ?)) Set Expansion for Any Language (SEAL) (Wang & Cohen, ICDM 07) Basic ideas Dynamically build the graph using queries to the web Constrain the graph to be as useful as possible Be smart about queries Be smart about patterns: use clever methods for finding meaningful structure on web pages 1. 2. 3. Canon Nikon Olympus

System Architecture 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Pentax Sony Kodak Minolta Panasonic Casio Leica Fuji Samsung Fetcher: download web pages from the Web that contain all the seeds Extractor: learn wrappers from web pages Ranker: rank entities extracted by wrappers The Extractor Learn wrappers from web documents and seeds on the fly Utilize semi-structured documents Wrappers defined at character level Very fast No tokenization required; thus language independent Wrappers derived from doc d applied to d only

See ICDM 2007 paper for details

The Ranker Rank candidate entity mentions based on similarity to seeds Noisy mentions should be ranked lower Random Walk with Restart (GW) As before Whats the graph? Building a Graph ford, nissan, toyota Wrapper #2 find northpointcars.com extract curryauto.com chevrolet 22.5% honda 26.1% Wrapper #3 acura 34.6% derive Wrapper #1 volvo chicago 8.4%
Wrapper #4 bmw pittsburgh 8.4% A graph consists of a fixed set of Node Types: {seeds, document, wrapper, mention} Labeled Directed Edges: {find, derive, extract} Each edge asserts that a binary relation r holds Each edge has an inverse relation r-1 (graph is cyclic) Intuition: good extractions are extracted by many good wrappers, and good wrappers extract many good extractions, Evaluation Datasets: closed sets Evaluation Method Mean Average Precision Commonly used for evaluating ranked lists in IR Contains recall and precision-oriented aspects Sensitive to the entire ranking Mean of average precisions for each ranked list Prec(r) = precision at rank r NewEntity(r ) = 1 if (a) and (b) are true otherwise 0 (a) Extracted mention at r matches any true mention where L = ranked list of extracted mentions, r = rank

Evaluation Procedure (per dataset) 1. Randomly select three true entities and use their first listed mentions as seeds 2. Expand the three seeds obtained from step 1 3. Repeat steps 1 and 2 five times 4. Compute MAP for the five ranked lists (b) There exist no other extracted mention at rank less than r that is of the same entity as the one at r # True Entities = total number of true entities in this dataset Experimental Results: 3 seeds Methods Overall MAP vs. Various Methods 100% 95% 80% 90% 60% 85% 40% MAP (%) 80% 20% 75% 93.13% 94.03% 82.39% 94.18%

93.13% 87.61% 82.39% 43.76% 14.59% 0% 70% E1+EF+100 G.Sets E2+GW+100 E2+EF+100 G.Sets (Eng) E2+GW+200 E2+GW+100 E1+EF+100 E2+GW+300 Methods Vary: [Extractor] + [Ranker] + [Top N URLs] Extractor: E1: Baseline Extractor (longest common context for all seed occurrences) E2: Smarter Extractor (longest common context for 1 occurrence of each seed) Ranker: { EF: Baseline (Most Frequent), GW: Graph Walk } N URLs: { 100, 200, 300 } Side by side comparisons Telukdar, Brants, Liberman, Pereira, CoNLL 06 Side by side comparisons EachMovie vs WWW Ghahramani & Heller, NIPS 2005

NIPS vs WWW A limitation of the original SEAL Preliminary Study on Seed Sizes 85% 84% 83% 82% 81% 80% 79% 78% RW PR BS WL 77% Average Precision Mean 76% 75% 2 3 4 # Seeds (Seed Size) 5 6 Proposed Solution: Iterative SEAL (iSEAL) (Wang & Cohen, ICDM 2008) Makes several calls to SEAL, each call Expands a couple of seeds Aggregates statistics

Evaluate iSEAL using Two iterative processes Supervised vs. Unsupervised (Bootstrapping) Two seeding strategies Fixed Seed Size vs. Increasing Seed Size Five ranking methods ISeal (Fixed Seed Size, Supervised) Initial Seeds Finally rank nodes by proximity to seeds in the full graph Refinement (ISS): Increase size of seed set for each expansion over time: 2,3,4,4, Variant (Bootstrap): use highconfidence extractions when seeds run out Ranking Methods Random Graph Walk with Restart H. Tong, C. Faloutsos, and J.-Y. Pan. Fast random walk with restart and its application. In ICDM, 2006. PageRank L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. 1998. Bayesian Sets (over flattened graph) Z. Ghahramani and K. A. Heller. Bayesian sets. In NIPS, 2005.

Wrapper Length Weights each item based on the length of common contextual string of that item and the seeds Wrapper Frequency Weights each item based on the number of wrappers that extract the item Fixed Seed Size (Supervised) 98% 97% 96% 95% 94% 93% 92% RW PR 91% Average Precision Mean BS 90% WL WF 89% 1 2 3

4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Bootstrap) 92% 91% 90% 89% 88% RW PR Mean Average Precision BS WL 87% WF 86%

1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Supervised) 97% 96% 95% 94% 93% RW 92% PR Mean Average Precision BS 91%

WL WF 90% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Increasing Seed Size (Bootstrapping) 94% 93% 92% 91% RW PR Mean 90% Average Precision

BS WL WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised) 98% Increasing Seed Size (Supervised) 97% 97% 96% 96% 95%

95% 94% 94% 93% 93% 92% RW PR 91% Average Precision Mean 90% 89% 1 2 RW 92% PR Mean Average Precision Little differenceWLbetween 91% ranking methods for supervisedWFcase (all 90% seeds correct); 1 2 3 4 5 6

7 8 3 4 5 6 7 8 9 10 large differences when bootstrapping # Iterations (Cumulative Expansions) # Iterations (Cumulative Expansions) BS BS WL WF 9 10 Increasingmakes Seed Size (Bootstrapping) Increasing seed size {2,3,4,4,} all 94% ranking methods improve steadily in 93% bootstrapping case Fixed Seed Size (Bootstrap) 92%

91% 90% 92% 89% 91% 88% RW RW PR Mean Average Precision PR Mean 90% Average Precision BS WL 87% BS WL WF 86% 1 2 3 4

5 6 7 8 # Iterations (Cumulative Expansions) 9 10 WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 Fixed Seed Size (Supervised)

98% 97% 97% 96% 96% 95% 95% 94% 94% 93% 93% 92% RW WL 1 2 3 4 5 6 7

8 9 BS WL WF 90% WF 89% PR 91% BS 90% RW 92% Mean Average Precision PR 91% Average Precision Mean Increasing Seed Size (Supervised) 1 2 3

4 5 6 7 8 9 10 # Iterations (Cumulative Expansions) 10 # Iterations (Cumulative Expansions) Increasing Seed Size (Bootstrapping) 94% 93% Fixed Seed Size (Bootstrap) 92% 92% 91% 91% RW PR Mean

90% Average Precision 90% BS WL 89% WF 89% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 88% RW PR Mean Average Precision BS WL 87% WF

86% 1 2 3 4 5 6 7 8 # Iterations (Cumulative Expansions) 9 10 9 10 Current work Start with name of concept (e.g., NFL teams) Look for (language-dependent) patterns: for successful NFL teams (e.g., Pittsburgh Steelers, New York Giants, ) Take most frequent answers as seeds Run bootstrapping iSEAL with seed sizes 2,3,4,4. Datasets with concept names

Experimental results Direct use of text patterns Summary/Conclusions Open-domain IE as finding nodes near seeds on a graph NIPS SNOWBIRD at NIPS, AISTATS, KDD and other learning conferences For skiiers, NIPS, SNOWBIRD, and AISTATS SIGIR KDD on PC of KDD, SIGIR, and AISTATS,KDD, shorter paths ~ earlier iterations many paths ~ additional Summary/Conclusions Open-domain IE as finding nodes near seeds on a graph, approach 1: Minkov & Cohen, EMNLP 08: Graph ~ dependency-parsed corpus Off-the-shelf distance metrics not great With learning:

Results significantly better than state-of-the-art on small corpora (e.g. a personal email corpus) Results competitive on 2M+ word corpora Summary/Conclusions Open-domain IE as finding nodes near seeds on a graph, approach 2: Wang & Cohen, ICDM 07, 08: Graph built on-the-fly with web queries A good graph matters! Off-the-shelf distance metrics work Differences are minimal for clean seeds Modest improvements from learning w/ clean seeds E.g., reranking (not described here) Bigger differences in similarity measures with noisy seeds Thanks to DARPA PAL program Minkov, Cohen, Wang Sponsored links: http://boowa.com (Richards demo) Yahoo! Research Labs Minkov Google Research Grant program Wang The organizers for inviting me!