Succinct Priority Indexing Structures for the Management of Large Priority Queues Hao Wang and Bill Lin University of California, San Diego IEEE IWQoS 2009 Charleston, South Carolina July 13-15, 2009 1 Introduction Priority queues used many network applications Per-flow advanced QoS scheduling

Management of per-flow DRAM packet buffers Maintenance of per-flow statistics counters for real-time network measurements Items in priority queue sorted at all times (e.g. smallest key first) Common operations: INSERT, FINDMIN, DELETE Challenge: Need to operate at high speeds (e.g. 40+ Gb/s) 2 Introduction Binary heap common structure for priority queues

Has (log2n) time complexity, where n is # items e.g. in fine-grained per-flow scheduling, n can be very large (e.g. 1 million) But (log2n) may be too slow for high line rates Pipelined heaps [Bhagwan, Lin 2000][Ioannou 2001][Wang, Lin 2006] Reduced amortized time complexity to constant time At the expense of (log2n) pipeline stages 3 Introduction

van Emde Boas (vEB) trees Instead of maintaining priority queue of sorted items, maintain sorted dictionary of keys In many applications, since keys are represented by a k-bit integer, possible keys can only be from a fixed universe of U = 2k values Only (log2log2U) complexity vs. (log2n) for heaps Pipelined vEB trees [Wang, Lin 2007] Reduced amortized time complexity to constant time At the expense of (log2log2U) pipeline stages 4

This Talk Propose 3 related Priority Indexing (PI) structures that leverage built-in hardware optimized instructions in modern 64-bit x86 processors (both Intel and AMD) Specifically, given a W=64 bit word, the instructions BSR (bit-scan-reverse) and BSF (bit-scan-forward) return the positions of the most-significant and leastsignificant bits, respectively 0 0 1 0 1 1 BSR

0 0 1 0 0 0 BSF 5 This Talk Most-significant (least-significant) bit positions can also be easily implemented using efficient priority encoder designs in custom hardware 6 Basic Priority Indexing Structure

Essentially a W-way tree. Maintains sorted subset S of N elements from a fixed universe of size U = Wh, where N U. Each element i of the universe is associated with a binary bit bi Leaf node contains W bits of bi: bi = 1 if element i is in the set Non-leaf node serves as summary of child nodes: bit in non-leaf node set to 1 if its child node has at least one non-zero bit Data Structure of PI with h = 3 7 Example Operations TEST(i): Just check bi

INSERT(i): Start at leaf, set bi. Set corresponding bit in parent. Repeat until root. FINDMIN(): Start at root. Find MSB (most-significant-bit) and traverse sub-tree. Repeat until leaf. DELETE(i): Start at leaf, clear bi. If word = 0 (no more bits set), clear corresponding bit in parent. Repeat until root. Data Structure of PI with h = 3 8 Time/Memory Complexity of PI TEST(i) takes constant time. All other operations take

(logwU) time, which is asymptotically not as good as the (log2log2U) time complexity of a van Emde Boas tree However, for W = 64, PI requires fewer or same number of operations for U 64 billion (h 6), but much simpler For PI of size U, memory size only 1.016U bits, (U) space 9 Motivation for Modified Structures PI is fast, but (logwU) time may still not be fast enough for high-performance applications

Want constant time operations (issue new operation every cycle) But PI cannot be readily pipelined Some PI operations are top-down (e.g. FINDMIN), but others are bottom-up (e.g. DELETE) Propose 2 modified structures Counting Priority Index (CPI) Pipelined Counting Priority Index (Pipelined CPI) 10 Counting-Priority-Index In addition to having a bit set to indicate a child node has at least one bit set, add counter to keep track of how many

bits in a child node are set. Enables all top-down operations. Data Structure of CPI with h = 3 11 Example CPI Operations FINDMIN(): Start at root. Find MSB (most-significant-bit) and traverse sub-tree. Repeat until leaf. Same as before. DELETE(i): Start at root. Decrement counter. If count = 0, clear bit. Go down corresponding sub-tree. Repeat until leaf. Data Structure of CPI with h = 3

12 Time/Memory Complexity of CPI TEST(i) takes constant time. All other operations take (logwU) time, same as basic PI structure. But all operations supported in top-down fashion For CPI of size U, memory size only 1.11U bits, still (U) space 13 Pipelined Counting-Priority-Index

Reduced amortized time complexity to constant time At the expense of (logwU) pipeline stages Memory size also only 1.11U bits, (U) space Data Structure of Pipelined CPI with h = 3 14 Operations Supported Operations supported by all 3 priority indexing structures TEST(i) INSERT(i) DELETE(i) FINDMIN

FINDMAX EXTRACTMIN EXTRACTMAX SUCCESSOR(i) PREDECESSOR(i) EXTRACTSUCC(i) EXTRACTPRED(i) Test if index i is in set S Insert a new index i to set S Delete index i from set S Find the smallest index in set S Find the largest index in set S

Delete the smallest index in set S Delete the largest index in set S Find the successor of index i in set S Find the predecessor of index i in set S Delete the successor of index i in set S Delete the predecessor of index i in set S 15 Comparison Time Hardware

Memory PI (logwU) constant 1.016 U CPI

(logwU) constant 1.11 U Pipelined CPI constant (logwU) 1.11 U

16 Hardware Complexity of Pipelined CPI Number of Pipeline Stages in the Data Structures 17 Summary Fast sorting data structures Fast and scalable succinct data structures for the implementation of priority queues The Pipelined CPI supports constant time priority

management operations The hardware complexity is only (logwU) with (U) memory space 18 Thank You 19