Bridging the Theory-Practice Gap in Multi-Commodity Flow Routing Siddhartha Sen, DISC 2011 Joint work with Sunghwan Ihm, Kay Ousterhout, and Mike Freedman Princeton University Routing flows in data center networks A B

Network utilization suffers when flows collide A B E D F C but there is available capacity!

A B E D F C but there is available capacity! Must compute routes repeatedly:

real workloads are dynamic (ms)! A B E D F C Multi-commodity flow problem Input:

links Network G = (V,E) of switches and Flows K = {(si,ti,di)} of source, target, demand tuples Goal: Compute flow that maximizes minimum fraction of any di routed Requires fractionally splitting flows, otherwise no O(1)-factor approximation

Prior solutions Sequential model Theory: [Vaidya89, PlotkinST95, GargK07, ] Practice: [BertsekasG87, BurnsOKM03, Hedera10, ] Billboard model more decentralized Theory: [AwerbuchKR07, AwerbuchK09, ] Practice: [MATE01, TeXCP05, MPTCP11, ...] Routers model

Theory: [AwerbuchL93, AwerbuchL94, AwerbuchK07, ] Practice: [REPLEX06, COPE06, FLARE07, ] Prior solutions Sequential model Theory: [Vaidya89, PlotkinST95, GargK07, ] Practice: [BertsekasG87, BurnsOKM03, Hedera10, ] Billboard model Theory: [AwerbuchKR07, AwerbuchK09, ] Practice: [MATE01, TeXCP05, MPTCP11, ...]

Routers model Theory: [AwerbuchL93, AwerbuchL94, AwerbuchK07, ] Practice: [REPLEX06, COPE06, FLARE07, ] Prior solutions Sequential model Theory: [Vaidya89, PlotkinST95, GargK07, ] Practice: [BertsekasG87, BurnsOKM03, Hedera10, ] Theory-practice gap: Billboard model

1. Models unsuitable for dynamic workloads Theory: [AwerbuchKR07, AwerbuchK09, ] 2. Splitting flows difficult in practice Practice: [MATE01, TeXCP05, MPTCP11, ...] Routers model Theory: [AwerbuchL93, AwerbuchL94, AwerbuchK07, ] Practice: [REPLEX06, COPE06, FLARE07, ] Contributions Demonstrate why prior solutions fail, both

theoretically and practically Propose theoretical + practical fixes Devise algorithms in new framework Goal: Provably optimal + practical multi-commodity flow routing Problems 1. Dynamic workloads Solutions 1. Routers Plus Preprocessing

(RPP) model Poly-time preprocessing is free In-band messages are free 2. Fractionally splitting flows 2. Splitting technique Group flows by target, split aggregate flow Group contiguous packets into flowlets to reduce reordering

3. Switch end host Limited processing, high-speed matching on packet headers 3. Add forwarding table rules to programmable switches Match TCP seq num header, use bit tricks to create flowlets Sequential solutions dont scale Controller

Sequential solutions dont scale Controller Billboard solutions require link utilization information in-band message A

B and use path-based (exponential) representations A B Routers solutions are local and hence scalable

but lack knowledge of global routes A B Problems 1. Dynamic workloads Solutions 1. Routers Plus Preprocessing (RPP) model Poly-time preprocessing is free

In-band messages are free 2. Fractionally splitting flows 2. Splitting technique Group flows by target, split aggregate flow Group contiguous packets into flowlets to reduce reordering 3. Switch end host Limited processing, high-speed

matching on packet headers 3. Add forwarding table rules to programmable switches Match TCP seq num header, use bit tricks to create flowlets Problems 1. Dynamic workloads Solutions 1. Routers Plus Preprocessing

(RPP) model Poly-time preprocessing is free In-band messages are free 2. Splitting flows 2. Splitting technique Group flows by target, split aggregate flow Group contiguous packets into flowlets to reduce reordering

3. Switch end host Limited processing, high-speed matching on packet headers 3. Add forwarding table rules to programmable switches Match TCP seq num header, use bit tricks to create flowlets No splitting + collisions A

B E D F C Proactive splitting A B E

D F C Proactive splitting Problems: Split every flow? Inefficient What granularity to split at? Per-packet too much reordering

A B E D F C Problems 1. Dynamic workloads

Solutions 1. Routers Plus Preprocessing (RPP) model Poly-time preprocessing is free In-band messages are free 2. Splitting flows 2. Splitting technique Group flows by target, split aggregate flow Group contiguous packets into

flowlets to reduce reordering 3. Switch end host Limited processing, high-speed matching on packet headers 3. Add forwarding table rules to programmable switches Match TCP seq num header, use bit tricks to create flowlets Conclusion

Both theoretical and practical innovations needed to bridge theory-practice gap LOCALFLOW: optimal algorithm in new framework for data center networks Additional slides Granularity of splitting Optimal routing High reordering Suboptimal routing

Low reordering Per-Packet Per-Flow flowlets Line rate splitting Flow 1/2 AB

TCP seq num Link *0***** 1 A 1/4 B

*10**** 2 AB 1/4 *11**** 3 flowlet =

16 packets LOCALFLOW: Frequency of splitting Group flows by target LOCALFLOW: Frequency of splitting -approximate splitting