Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient

Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm1, Shengyue Ji1, Chen Li1, Jiaheng Lu2 1 University of California, Irvine 2 Renmin University of China Speaker: Alexander Behm Motivation: Data Cleaning Should clearly be Niels Bohr Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope, Jan 2008 Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Motivation: Record Linkage

Phone Age Name Brad Pitt Arnold Schwarzeneger George Bush Angelina Jolie Forrest Whittaker No exact match! Name

Brad Pitt Forest Whittacker George Bush Angelina Jolie Arnold Schwarzenegger Hobbies Address Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Motivation: Query Relaxation

Actual queries gathered by Google http://www.google.com/jobs/britney.html Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm What is Approximate String Search? Query against collection: Find entries similar to Arnold Schwarseneger What do we mean by similar to? - Edit Distance - Jaccard Similarity - Cosine Similarity - Dice - Etc. String Collection Brad Pitt Forest Whittacker George Bush

Angelina Jolie Arnold Schwarzenegger How can we support these types of queries efficiently? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Approximate Query Answering irvine Sliding Window 2-grams {ir, rv, vi, in, ne} Intuition: Similar strings share a certain number of grams Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Approximate Query Example Query: irvine, Edit Distance 1

2-grams {ir, rv, vi, in, ne} Lookup Grams 2-grams Inverted Lists (stringIDs) in tf vi ir ef rv ne un

1 3 4 5 7 9 5 9 1 5 1 2 3 9 3 9 7 9

5 6 9 1 2 4 5 6 Count >= 3 Candidates = {1, 5, 9} May have false positives Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm T-Occurrence Problem Merge Ascending order

Find elements whose occurrences T Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Motivation: Compression Inverted Index >> Source Data Fit in memory? Space Budget? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Motivation: Related Work IR: lossless compression of inverted lists (disk-based) Delta representation + compact encoding

Inverted lists in memory: decompression overhead Tune compression ratio? Overcome these limitations in our setting? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Main Contributions Two lossy compression techniques Answer queries exactly Index fits into a space budget

Queries faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms: re-use + compression specific optimizations Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Overview Motivation & Preliminaries Approach 1: Discarding Lists

Approach 2: Combining Lists Experiments & Conclusion Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Approach 1: Discarding Lists 2-grams Inverted Lists (stringIDs) in tf

vi ir ef rv ne un 1 3 4 5 7 9 5 9 1 5

1 2 3 9 3 9 7 9 5 6 9 1 2 4 5 6 Lists discarded, Holes

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Effects on Queries Decrease lower bound T on common grams Smaller T more false positives T <= 0 panic, scan entire string collection Surprise Fewer lists Faster Queries (depends) Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm

Query shanghai, Edit Distance 1 3-grams {sha, han, ang, ngh, gha, hai} 3-grams uni ing sha han ang ngh gha Basis: Edit Operations destroy q=3 grams No Holes: T = #grams ed * q = 6 1 * 3 = 3 With holes: T = T #holes = 0 Panic! Really destroy q=3 grams per edit operation? hai

ter Hole grams Regular grams Dynamic Programming for tighter T Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Choosing Lists to Discard Effect on Query Unaffected Panic Slower or Faster Good choice depends on query workload

Space budget: Many combinations of grams Make a reasonable choice efficiently? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Choosing Lists to Discard INPUT: Space Budget, Inverted lists, Workload in tf vi ir ef

rv ne Estimated impact t OUTPUT: Lists to discard un Choose one list at a time Incremental Update Query1 Query2 Query3 Total estimated

running time t ALGORITHM: Greedy & Cost-Based Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Estimating Query Times List-Merging: cost function, offline with linear regression Panic: #strings * avg similarity time Post-Processing: #candidates * avg similarity time Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Estimating #candidates Incremental-ScanCount Algorithm un Counts

2 3 0 1 4 StringIDs 0 1 2 3 4 1 3

4 List to Discard BEFORE T=3 #candidates = 2 Decrement Counts StringIDs 2 2 0 0 3 1 2

3 4 0 AFTER T = T-1 = 2 #candidates = 3 Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Overview Motivation & Preliminaries Approach 1: Discarding Lists

Approach 2: Combining Lists Experiments & Conclusion Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Approach 2: Combining Lists 2-grams Inverted Lists (stringIDs) in tf vi

ir ef rv ne un 1 3 4 5 7 9 5 9 5 6 9 1

2 3 9 1 3 9 7 9 6 9 1 2 4 5 6 Lists combined Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Speaker: Alexander Behm Effects on Queries Lower bound T is unchanged (no new panics) Lists become longer: More time to traverse lists More false positives Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Speeding Up Queries Query 3-grams {sha, han, ang, ngh, gha, hai}

combined lists refcount = 2 combined lists refcount = 3 Traverse physical lists once. Count for stringIDs increases by refcount. Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Choosing Lists to Combine Discovering candidate gram pairs Frequent q+1-grams correlated adjacent q-grams Locality-Sensitive Hashing (LSH) Selecting candidate pairs to combine Basis: estimated cost on query workload Similar to DiscardLists

Different Incremental ScanCount algorithm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Overview Motivation & Preliminaries Approach 1: Discarding Lists Approach 2: Combining Lists Experiments & Conclusion Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Speaker: Alexander Behm Experiments Datasets: Google WebCorpus Word Grams IMDB Actors DBLP Titles Overview: Performance & Scalability of DiscardLists & CombineLists Comparison with IR compression & VGRAM Changing workloads 10k Queries: Zipf distributed, from dataset q=3, Edit Distance=2, (also Jaccard & Cosine) Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Experiments DiscardLists Runtime decreases! CombineLists

Runtime decreases! Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Comparison with IR compression Carryover-12 Compressed Uncompressed Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Comparison with variable-length grams, VGRAM Uncompressed Compressed Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm

Future Work Combine: DiscardLists, CombineLists and IR compression Filters for partitioning, global vs. local decisions Dealing with updates to index Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Conclusions Two lossy compression techniques Answer queries exactly

Index fits into a space budget Queries faster on the compressed indexes Flexibility to choose space / time tradeoff Existing list-merging algorithms: re-use + compression specific optimizations Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm Thank You! This work is part of The Flamingo Project http://flamingo.ics.uci.edu

Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm More Experiments What if the workload changes from the training workload? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Speaker: Alexander Behm More Experiments What if the workload changes from the training workload? Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai

Recently Viewed Presentations

  • Chapter 7 Data Flow Diagramming Copyright  2016 McGraw-Hill

    Chapter 7 Data Flow Diagramming Copyright 2016 McGraw-Hill

    DFD symbols. Leveling and balancing. Database design. Normal forms. 7-Learning objectives. Explain the symbols and design considerations associated with DFDs. Compare and contrast flowcharts and DFDs with regard to purpose, content, structure, and use in accounting information systems.
  • Chapter 1

    Chapter 1

    Hasil-hasilnya adalah kejadian yang tidak terikat satu sama lain, Daftar hasilnya lengkap. Jadi jumlah probabilitas dari berbagai kejadian adalah 1. ... Sampling Techniques & Sample Size, Presentation Material of Biostatistic, High Institute of Public Health, University of Alexandria. Akhir ...
  • IR&D Studies of Light Weight Ablator for Future

    IR&D Studies of Light Weight Ablator for Future

    CALCARB/PI <±3% Grafoam/Ph <±3% The insulative peformance of current LWA's is judged acceptable. But the Arcjet tests revealed signs of spallation-driven recession especially for our domestic RVC/LWA, which may be attributed to the coarse microstructures peculiar to our current RVC's.
  • Industry and Market Analysis - Mrs. Gallegos

    Industry and Market Analysis - Mrs. Gallegos

    Market Segment Analysis - Why does this group want this good or service? Refer to the database to support your response. Conduct additional online research about the good, service or the market segment - You will need a total of...
  • Functional Assessment &amp; Positive Support Planning

    Functional Assessment & Positive Support Planning

    Attempts to decrease or eliminate the need for more restrictive measures. Positive Support Plan. A systematic analysis of factors, both internal and external to the person, which may be contributing to his/her Challenging Behavior ... The functional assessment must address...
  • Astronomy by Sight!

    Astronomy by Sight!

    Objects always cast shadows in the opposite direction with respect to the light source. The angle of the light source GREATLY affects the shadow's size and length. The same size/length shadow can be produced by keeping the angle the same,...
  • Lincoln and the Civil War - Commack Schools

    Lincoln and the Civil War - Commack Schools

    Abraham Lincoln's Gettysburg Address "Four score and seven years ago, our father's brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal." "Now we are engaged in a...
  • Make an Impact on Students - Impact Therapy Associates

    Make an Impact on Students - Impact Therapy Associates

    Life is a series of choices. Stages of Change Precontemplation Contemplation Preparation Action Maintenance A Simple, Useful Map to Follow if You Currently Don't Follow A Map Rapport Contract Focus Funnel Rational Emotive Behavior Therapy THOUGHTS CAUSE FEELINGS Sustained negative...