Algorithmic High-Dimensional Geometry 2 Alex Andoni (Microsoft Research SVC) The NNS prism High dimensional geometry NNS dimension reduction space partitions small dimension embedding sketching Small Dimension What if is small?

Can solve approximate NNS with space query time [AMNSW98,] OK, if say ! Usually, is not small effectively What if is small? Eg:

-dimensional subspace of , with Obviously, extract subspace and solve NNS there! Not a robust definition More robust definitions: KR-dimension [KR02] Doubling dimension [Assouad83, Cla99, GKL03, KL04] Smooth manifold [BW06, Cla08] other [CNBYM01, FK97, IN07, Cla06] Doubling dimension Definition:

for any point , radius , consider ball of points within distance of can cover by balls , Sanity check: pointset has doubling dimension if: -dimensional subspace has n points always have dimension at most Can be defined for any metric space! balls to cover NNS for small doubling dimension Euclidean space [Indyk-Naor07] JL into dimension works ! Contraction of any pair happens with very small probability Expansion of some pair happens with constant probability Good enough for NNS!

Arbitrary metric Navigating nets/cover trees [Krauthgamer-Lee04, HarPeled-Mendel05, Beygelzimer-Kakade-Langford06,] Algorithm: A data-dependent tree: recursive space partition using balls At query , follow all paths that intersect with the ball Embeddings General Theory: embeddings General motivation: given distance (metric) , solve a computational problem under Hamming distance Compute distance between two points Euclidean distance (2)

Nearest Neighbor Search Diameter/Close-pair of Edit distance between two set S strings Earth-Mover (transportation) Clustering, MST, etc Distance Reduce problem to Embeddings: landscape Definition: an embedding is a map of a metric into a host metric such that for any : where is the distortion (approximation) of the embedding .

Embeddings come in all shapes and colors: Source/host spaces Distortion Can be randomized: with probability Time to compute Types of embeddings: From norm to the same norm but of lower dimension (dimension reduction) From one norm () into another norm () From non-norms (edit distance, Earth-Mover Distance) into a norm () From given finite metric (shortest path on a planar graph) into a norm () not a metric but a computational procedure sketches

Earth-Mover Distance Definition: Which metric space? Given two sets of points in a metric space = min cost bipartite matching between and Can be plane, Applications in image vision Images courtesy of Kristen Grauma Embedding EMD into

At least as hard as Theorem [Cha02, IT03]: Can embed EMD over into with distortion . Time to embed a set of points: . Consequences: Nearest Neighbor Search: approximation with space, and query time. Computation: approximation in time Best known: approximation in time [SA12] The higher-dimensional variant is still fastest via embedding [AIK08] High level embedding Sets of size in box Embedding of set :

take a quad-tree 00 randomly shift it 02 Each cell gives a coordinate: =#points in the cell Need to prove 01 13 00 11 12 01 22 20 00

2210 0002001101000000 1202 0100001100001100 Main Approach Idea: decompose EMD over []2 into EMDs over smaller grids Recursively reduce to =O(1) + EMD over small grid

Suppose =3 f(A) has nine coordinates, counting # points in each joint f(A)=(2,1,1,0,0,0,1,0,0) f(B)=(1,1,0,0,2,0,0,0,1) Gives O(1) distortion Decomposition Lemma [I07] For randomly-shifted cut-grid G of side length k, we have: lower bound EEMD(A,B) EEMDk(A1, B1) + EEMDk(A2,B2)+ on cost + k*EEMD/k(AG, BG)

upper EEMD(A,B) [ EEMDk(A1, B1) + EEMDk(A2,B2)+bound k ] EEMD(A,B) [ k*EEMD/k(AG, BG) ] The distortion will /k follow by applying the lemma recursively to (AG,BG) Part 1: lower bound For a randomly-shifted cut-grid G of side length k, we have: EEMD(A,B) EEMDk(A1, B1) + EEMDk(A2,B2)+

+ k*EEMD/k(AG, BG) Extract a matching from the matchings on right-hand side k For each aA, with aAi, it is either: matched in EEMD(Ai,Bi) to some bBi or aAi\Bi, and it is matched in EEMD(AG,BG) to some bBj Match cost in 2nd case: Move a to center () paid by EEMD(Ai,Bi) Move from cell i to cell j

paid by EEMD(AG,BG) /k Parts 2 & 3: upper bound For a randomly-shifted cut-grid G of side length k, we have: EEMD(A,B) [ EEMDk(A1, B1) + EEMDk(A2,B2)+ ] EEMD(A,B) [ k*EEMD/k(AG, BG) ] Fix a matching minimizing EEMD(A,B) Will construct matchings for each EEMD on RHS

Uncut pairs (a,b) are matched in respective (Ai,Bi) Cut pairs (a,b) are matched in (AG,BG) and remain unmatched in their mini-grids Part 2: Cost? EEMD(A,B) [ i EEMDk(Ai, Bi)] Uncut pairs (a,b) are matched in respective (Ai,Bi) Contribute a total EEMD (A,B) Consider a cut pair (a,b) at distance Contribute 2k to i EEMDk(Ai, Bi)

Pr[(a,b) cut] = Expected contribution Pr[(a,b) cut] In total, contribute EEMD (A,B) k dx Wrap-up of EMD Embedding In the end, obtain that EMD(A,B) sum of EMDs of smaller grids in expectation Repeat times to get to grid approximation it total! Embeddings of various metrics into

Metric Upper bound Earth-mover distance (-sized sets in 2D plane) [Cha02, IT03] Earth-mover distance (-sized sets in ) [AIK08] Edit distance over (#indels to transform x->y) Ulam (edit distance Ulam (edit distance between permutations) between permutations) Block edit distance Block edit distance

edit( banana , 2 [OR05] [CK06] [MS00, CM07] ananas ) = edit(1234567, 7123456) =2 Non-embeddability into Metric Earth-mover distance (-sized sets in 2D plane) Earth-mover distance (-sized sets in ) Edit distance over (#indels to transform x->y) Ulam (edit distance Ulam (edit distance between

permutations) between permutations) Block edit distance Block edit distance Upper bound Lower bounds [Cha02, IT03] [NS07] [KN05] [AIK08] [KN05,KR06] [OR05] [AK07] [CK06] [MS00, CM07] 4/3 [Cor03]

Non-embeddability proofs Via Poincar-type inequalities [Enflo69]: embedding into (any dimension) must incur distortion Proof [Khot-Naor05] Suppose is the embedding of into Two distributions over pairs of points : Two steps: C: for random and index F: are random

(short diagonals) Implies lower bound! Other good host spaces? sq- ,etc What is good: is algorithmically tractable is rich (can embed into it) ??? sq-2=real space with distance: ||x-y||22 Lower bound into Metric Edit distance over Ulam (edit distance

[KN05, KR06] between permutations) [AK07] Earth-mover distance (-sized sets in ) [KN05] sq-2, hosts with very good LSH (lower bounds via communication complexity) [AK07] [AK07] [AIK08] The story [Mat96]: Can embed any metric on points into Theorem [I98]: NNS for with

Dimension is an issue though Smaller dimension? approximation space, query time Possible for some: Hausdorff, [FCI99] But, not possible even for [JL01] Other good host spaces? What is good: sq- ,etc

algorithmically tractable rich (can embed into it) But: combination sometimes works! Meet our new host [A-Indyk-Krauthgamer09] Iterated product space 2 sq ( ( 27

1 )) d1 d1 d1 d,1 d,1 d,1 d22,,1 Why ? [Indyk02, A-Indyk-Krauthgamer09]

edit distance between permutations ED(1234567, 7123456) = 2 Because we can Embedding: embed Ulam into with constant distortion Algorithmically Rich tractable dimensions = length of the string NNS: Any t-iterated product space has NNS on n points with

approximation near-linear space and sublinear time Corollary: NNS for Ulam with approx. Better than via each component separately! Sketching x y Computational view { 0,1 } { 0,1 } { 0,1 } Arbitrary computation

No/little structure (e.g., not metric) Pros: F Cons: More expressability: may achieve better distortion (approximation) F(y) smaller dimension F(x) Sketch : functional compression scheme for estimating distances

almost all lossy ( distortion or more) and randomized 2 ) , ( ) ) ( )) d M (x , y) ( () (( =1 Why? 1) Beyond embeddings: 2) A waypoint to get embeddings:

can more do if embed into computational space computational perspective can give actual embeddings 3) Connection to informational/computational notions 31 communication complexity Beyond Embeddings: Dimension reduction in ! Lemma [I00]: exists linear , and where

achieves: for any , with probability : Where with each distributed from Cauchy distribution (stable distribution) ( ) = 1 2 ( + 1) While does not have expectation, it has median! Waypoint to get embeddings Embedding of Ulam metric into was obtained via geometrization of an algorithm/characterization: sublinear (local) algorithms: property testing & streaming [EKKRV98, ACCL04, GJKK07, GG07, EJ08] sum of squares (sq-2) edit(X,Y) max ()

sum (1) X Y Ulam: algorithmic characterization [Ailon-Chazelle-Commandur-Lu04, Gopalan-JayramKrauthgamer-Kumar07, A-Indyk-Krauthgamer09] E.g., a=5; K=4 Lemma: Ulam(x,y) approximately X[5;4] equals the number of faulty x= 123456789 characters a satisfying: there exists K1 (prefix-length) s.t. y= 123467895 the set of K characters preceding a Y[5;4] in x differs much from the set of K characters preceding a in y Connection to communication complexity

Enter the world of Alice and Bob Communication complexity shared randomness CC bits sketch() sketch() Referee decide whether: or 35 model: Two-party protocol

Shared randomness Promise (gap) version c = approximation ratio CC = min. # bits to decide (for 90% success) Sketching model: Referee decides based on sketch(x), sketch(y) SK = min. sketch size to decide Fact: SK CC Communication Complexity VERY rich theory [Yao79, KN97,] Some notable examples:

Connection to NNS: [KOR98]: if sketch size is , then NNS with space and one memory lookup! From the perspective of NNS lower bounds, communication complexity closer to ground truth Question: do non-embeddability result say something about nonsketchability? are sketchable with bits! [AMS96,KOR98] hence also everything than embeds into it! is tight [IW03,W04, BJKS08,CR12] requires bits [BJKS02]

Coresets: sketches of sets of points for geometric problems [AHV04] also Poincar-type inequalities [AK07,AJP10] Connections to streaming: see Graham Cormodes lecture 36 High dimensional geometry ??? Closest Pair Problem: n points in d-dimensional Hamming space, which are random except a planted pair at distance - Solution 1: build NNS and query times LSH-type algo would give ~ [PRR89,IM98,D08] Theorem [Valiant12]: time

p1 p1 p1+p2 p2 p2 =M pn pn 38 Find max entry of MMt using subcubic MM algorithms What I didnt talk about:

Too Includes embedding of fixed finite metric into simpler/morestructured spaces like Tiny sample among them: many things to mention [LLR94]: introduced metric embeddings to TCS. E.g. showed can use [Bou85] to solve sparsest cut problem with approximation [Bou85]: Arbitrary metric on points into , with distortion [Rao99]: embedding planar graphs into , with distortion [ARV04,ALN05]: sparsest cut problem with approximation [KMS98,]: space partition for rounding SDPs for coloring

Lots others A list of open questions in embedding theory Edited by Ji Matouek + Assaf Naor: http://kam.mff.cuni.cz/~matousek/metrop.ps High dimensional geometry via NNS prism High dimensional geometry NNS dimension reduction space partitions small dimension embedding sketching +++