Big Data Analytics over Encrypted Datasets with Seabed

Big Data Analytics over Encrypted Datasets with Seabed

Big Data Analytics over Encrypted Datasets with Seabed SEABED Antonis Papadimitriou, Ranjita Bhagwan, Nishanth Chandran , , Nishanth Chandran , Nishanth Chandran , , Ramachandran Ramjee , Nishanth Chandran , , Andreas Haeberlen , Harmeet Singh , Nishanth Chandran , , Abhishek Modi , Nishanth Chandran , , Saikrishna Badrinarayanan University of Pennsylvania, , Nishanth Chandran , Microsoft Research India, UCLA Motivation: Big data analytics on sensitive data Retail business customer gender country payment Alice female CAN 12 Bob male USA 4 Charlie female

USA 1 Deborah female USA 15 Cloud Provider Ah, now I know everyones purchases! Analyst What is the total revenue in the USA? Goal: Outsource big data analytics Store database at a cloud provider Perform analytical queries remotely Problem: Rogue cloud admins or hackers could have access to data Sensitive information can be exposed 2 Prior work: Encrypted databases Retail business advertiser gender isFraud customer country country revenue payment

Alpha Co. %Th6j legit h4$89 CAN 548yvg 12 439856 Beta Inc. Fjg893 suspicious a3vbt9a USA sfbg43 4 582650 Psi Intl %gTHR legit h4$89 USA a3vbt9a 1 143759 Omega Ltd h4$89 legit 34%^d USA a3vbt9a

15 874563 Cloud Provider Analyst What can we do? Use encryption! Examples: CryptDB/Monomi [SOSP11, VLDB13], MS SQL Server [SQL16] These support SQL queries on encrypted data 3 Encrypted databases Challenges Retail business Cost of addition (single core) 3800ns customer gender country payment %Th6j h4$89 548yvg 439856 Fjg893 sfbg43 a3vbt9a 582650

%gTHR h4$89 a3vbt9a 143759 34%^d h4$89 a3vbt9a 874563 Cloud1ns Provider Analyst Encrypted Plaintext Challenge 1: Performance Aggregations on encrypted data are slower Ciphertext addition is > 3000x slower than plaintext Adding 100 million values takes 6 minutes instead of 100ms Not good for big data! More coffee breaks! 4 Encrypted databases Challenges Retail business customer gender country

payment %Th6j h4$89 548yvg 439856 Fjg893 sfbg43 a3vbt9a 582650 %gTHR h4$89 a3vbt9a 143759 34%^d h4$89 a3vbt9a 874563 Cloud Provider Analyst Challenge 2: Security Encrypted databases use cryptographic schemes with weaker guarantees Example: deterministic encryption (DET) reveals equality

Recent attack [CCS15] recovered > 60% from certain DET columns 5 Our approach payment male gender male customer h4$89h 987242 rb%Gj4 459239 revenue593292 genderkb3i&Qpayment %Th6j& 987242 h4$89h Epvi#$R 439856 Fjg893n sfbg43q %gTHR3 34%^db customer 742063 629459

582650 593292 payment h4$89h gender 143759 742063 female h4$89h female 874563654929 gfv941E sfbg43q 629459 H7&fgh1 243995 D23gr$w 592623 ASHE %Th6j& Fjg893n %gTHR3 34%^db SPLASHE Goal 1: Improve performance ASHE New cryptographic scheme that allows fast aggregation on encrypted data Goal 2: Improve security SPLASHE: DB transformation that enables more queries without using weaker crypto 6 Seabed: Big data analytics for encrypted datasets ASHE

SEABED Analyst SPLASHE We built Seabed on top of Spark Seabed leverages ASHE and SPLASHE Seabed runs SQL queries on encrypted data Examples: Group-by queries and aggregations (sum, average, variance) Seabed is fast enough for big data Up to 100x faster than previous systems 7 Outline Motivation & prior work Approach Improving performance ASHE Improving security SPLASHE System design Evaluation 8 Why is aggregation slow in encrypted databases? Plaintext DB Encrypted DB payment payment 439856 12 4

1 15 582650 Integer addition + 143759 Sum = 32 Homomorphic addition 874563 Sum = Enc(32) We need to sum up encrypted data This is impossible with schemes like AES We need an additively homomorphic cryptosystem Example: Paillier encryption [EUROCRYPT99] 9 Why is aggregation slow in encrypted databases? Plaintext DB Encrypted DB payment payment 439856

12 4 1 15 582650 Integer addition + 143759 Sum = 32 Homomorphic addition 874563 Sum = Enc(32) Most homomorphic cryptosystems are expensive! Example: Paillier ciphertexts need to be 2048-bit Homomorphic addition: > 3000x slower than plain addition 10 Can we have faster homomorphic cryptosystems? Retail business Public key Symmetric key Private key Analyst

But why does Paillier have so large ciphertexts? Because it is an asymmetric scheme based on large integers Encrypt with public key decrypt with private key Do we need asymmetric crypto in outsourced databases? Analysts and data collector usually work for the same organization We could use fast symmetric crypto! 11 ASHE Additive Symmetric Homomorphic Encryption Plaintext DB Encrypted DB Encrypted DB payment ID payment 12 12+439 1 12+F(1) 4 4 - 56 2 4 + F(2) 3 1 + F(3)

4 15+F(4) payment 1 15 1 + 379 15+763 Sum = 32 + 1525 - 1525 Sum = 32 + 1525 ID list: {1, 2, 3, 4} Encrypt by masking values with random numbers ASHE is semantically secure (IND-CPA) No need to remember random numbers Use pseudorandom function F(ID) ASHE ciphertexts are 32/64-bit integers Homomorphic addition only takes a few nanoseconds! 12 ASHE Optimizations Encrypted DB ID Aggregation Decryption payment 1 12 + F(1) - F(0)

2 4 + F(2) - F(1) 3 1 + F(3) - F(2) 4 15 + F(4) - F(3) Sum = 32 + Compute ID list: {1, 2, 3, 4} Decrypt: Sum - F F AES-NI Challenge: Aggregation and decryption cost depends on ID list length Optimizations: Optimize encryption so that the randomness cancels out for consecutive IDs Fast evaluation of pseudorandom function via AES-NI Compression techniques to make ID list as small as possible Outcome: ASHE enables fast aggregation even when the DB is very large 13 Outline Motivation & prior work Approach Improving performance ASHE

Improving security SPLASHE System design Evaluation 14 Why are encrypted databases vulnerable? customer gender customer gender Alice female %Th6j& h4$89 Bob male Fjg893n sfbg43 Charlie female %gTHR3 h4$89 Deborah

female 34%^db h4$89 Auxiliary information 3 69% 1 h4$89h sfbg43q + 31% female male h4$89h = female Some columns use deterministic encryption (DET) sfbg43q = male This leaks the distribution of values An adversary with auxiliary information can do a frequency attack [CCS15] In the example, the gender is revealed 15 How can we avoid deterministic encryption? SELECT sum (revenue) FROM purchases WHERE gender = female customer gender payment

Alice female 12 Bob male 4 Charlie female 1 Deborah female 15 customer customer gender gender female female gender gender male male payment female female payment

male male %Th6j& Alice 476529 1 459220 0 439856 12 314437 0 Fjg893n Bob 956204 0 953265 1 582650 0 207465 4 %gTHR3 Charlie 529482 1 234599 0

143759 1 958922 0 34%^db Deborah 459283 1 562087 0 874563 15 996324 0 Support single-table aggregation queries without DET SPLASHE: Transform DB schema to avoid DET Sum = 28 Answer single-table aggregation queries using additions only Some storage overhead Reduced by Enhanced SPLASHE (see paper) 16 Seabed System design SEABED Proxy Analyst We implemented Seabed on top of unmodified Spark ASHE and SPLASHE implemented in Scala

Seabeds high-level design is similar to CryptDBs Accepts SQL queries; transparently answers them on encrypted data Client proxy handles encryption/decryption 17 Outline Motivation & prior work Approach Improving performance ASHE Improving security SPLASHE System design Evaluation 18 Evaluation: Questions End-to-end latency of aggregation? Storage overhead of SPLASHE? End-to-end latency in Bing Ads analytics? How scalable is aggregation? How effective are Seabeds optimizations? Latency of group-by queries? Latency of batch queries (Big Data Benchmark)? Experimental setup: Spark with 100 cores On MS Azure Memory-resident data 19 Paillier 1200 No Enc. 16.6 min

1000 800 600 400 1 sec 200 0 0 500 End-to-end time (s) End-to-end time (s) How efficient is ASHE aggregation? 10 sec 15 Seabed (Worst) Seabed (Best) 10 5 0 No Enc. 0 500 2000 1500 1000 Dataset size (millions of rows)

1000 1500 2000 Dataset size (millions of rows) Synthetic data: up to 1.75 billion rows - Query: single column aggregation Results Paillier: up to 16.6 minutes No encryption: <1 second How does Seabed do? Seabed is 100x faster than Paillier, even in the worst case! 20 How much storage does SPLASHE need? Size increase relative to plaintext 1000x 100x 10x SPLASHE Enhanced SPLASHE 5 Cols 10 Cols 1 Col DET columns replaced with SPLASHE Dataset 760M rows, real ad-analytics application from Microsoft We replaced 10 DET columns with SPLASHE, one by one Measured: Relative size increase vs. plaintext dataset Results SPLASHE has substantial storage cost Enhanced SPLASHE reduces this cost by up to 10x With 10x more storage, we avoid DET entirely! Reduces risk of information leaks

21 End-to-end latency (s) How efficient is Seabed for real-world applications? 400 Paillier 300 Seabed 200 No Enc. 100 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Queries 1 to 15 Same ad-analytics application from Microsoft Measured: End-to-end latency of 15 queries Results No encryption is about 10x faster than Paillier across all queries Seabed is almost as fast as no encryption (within 15-44%)

It is possible to do analytics on encrypted big data! 22 Summary Big-data analytics on encrypted data is difficult Key challenges: Performance, security We introduce additive symmetric homomorphic encryption (ASHE) Result: much better performance when analyst and data owner trust each other We present a schema transformation called SPLASHE Result: Often avoids the need for weaker encryption better security Seabed: an extension of Spark that uses ASHE and SPLASHE Up to 100x faster than previous systems Seabed is fast enough for real-world big data applications Any Questions? 23 References [EYROCRYPT99] P.Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Proc. EURO- CRYPT, 1999. [SOSP11] Popa, Raluca Ada, et al. "CryptDB: protecting confidentiality with encrypted query processing." Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 2011. [VLDB13] Tu, S., Kaashoek, M. F., Madden, S., & Zeldovich, N. (2013, March). Processing analytical queries over encrypted data. In Proceedings of the VLDB Endowment (Vol. 6, No. 5, pp. 289-300). VLDB Endowment. [SQL16] https://www.microsoft.com/en-us/cloud-platform/sql-server [CCS15] Naveed, Muhammad, Seny Kamara, and Charles V. Wright. "Inference attacks on property-preserving encrypted databases." Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security. ACM, 2015. 24

Recently Viewed Presentations

  • The Existence of God - Michael Johnson

    The Existence of God - Michael Johnson

    Paley's Analogy "In crossing a heath, suppose I pitched my foot against a stone and were asked how the stone came to be there, I might possibly answer that for anything I knew to the contrary it had lain there...
  • HUMAN GROWTH AND DEVELOPMENT - teachercb.com

    HUMAN GROWTH AND DEVELOPMENT - teachercb.com

    Arial Calibri Wingdings Office Theme HUMAN GROWTH AND DEVELOPMENT LIFESPAN CHAPTER 1 INTRODUCTION 1. How do the lives of Ted Kaczynski and Alice Walker illustrate the questions explored in the course textbook?
  • Face Recognition and Detection The Margaret Thatcher Illusion,

    Face Recognition and Detection The Margaret Thatcher Illusion,

    Sample results Summary (Viola-Jones) Fastest known face detector for gray images Three contributions with broad applicability: Cascaded classifier yields rapid classification AdaBoost as an extremely efficient feature selector Rectangle Features + Integral Image can be used for rapid image analysis...
  • Speed - Fiendishlyclever

    Speed - Fiendishlyclever

    Calculating your speed on a motion graph. To calculate speed at the fastest point. Look for the steepest line. Mark the start and end. Work out the distance and time . Use the speed equation to calculate speed Speed =...
  • Chapter 1: Drawings - site.iugaza.edu.ps

    Chapter 1: Drawings - site.iugaza.edu.ps

    Chapter 1: Drawings. A. Drawing types and scales. In engineering, most design information is shown on drawings. Today, drawings are generally not drawn by hand. They are produced on computer, using CAD (computer-aided design) systems. ... an oblique projection. an...
  • Diffusion &amp; Osmosis - Amazon S3

    Diffusion & Osmosis - Amazon S3

    Diffusion & Osmosis Define Diffusion The movement of molecules from a area in which they are highly concentrated to a area in which they are less concentrated.
  • Our 50 States: [Name of Your State]

    Our 50 States: [Name of Your State]

    Select the best antonym for the word adroit. A. adept B. dexterous C. awkward D. nimble Although some women are loquacious and conversational, others hardly talk at all. Select the best antonym for the word loquacious: extroverted communicative sociable reserved...
  • Chapter 16

    Chapter 16

    A/an _____ is any new thing, idea, or behavior pattern that emerges from within a society. Creation Innovation Construct invention Answer: d An invention is any new thing, idea, or behavior pattern that emerges from within a society. 3. _____...