Improving ML Applications in Shared Computing Environments

Improving ML Applications in Shared Computing Environments

Improving ML Applications in Shared Computing Environments Aaron Harlap PARALLEL DATA LABORATORY Carnegie Parallel Data Mellon Laboratory Carnegie Mellon University Talk Outline Background Thesis statement Three case studies Carnegie Parallel Data

Mellon Laboratory http://www.pdl.cmu.edu/ 2 Aaron Harlap April 19 Iterative Convergent ML Matrix Factorization, LDA, Neural Networks, etc Start with an initial guess Iterate over training data improving solution Converge to a good solution Carnegie Parallel Data

Mellon Laboratory http://www.pdl.cmu.edu/ 3 Aaron Harlap April 19 Data-Parallel ML REA D , INC , CLO CK

Parameter server Input data (training data) Parallel iterative workers Model parameters (solution) Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 4

Aaron Harlap April 19 Shared Computing Environments AWS, GCE, Azure, private clusters Challenges: - Performance jitter - Heterogeneous resources - Transient resources - Limited network bandwidth Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 5

Aaron Harlap April 19 Thesis Statement Improvements of 5x or more can be achieved for training ML models in shared computing environments by structuring software frameworks and work distribution to mitigate performance jitter, exploit transient resources, and address communication bandwidth limitations. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

6 Aaron Harlap April 19 Three Case Studies Addressing the straggler problem in iterative convergent ML - Flex-RR [SoCC 16] Agile ML elasticity through tiered reliability in dynamic resource markets - Proteus [EuroSys 17] Pipeline parallelism for DNN training - PipeDream [Under Submission] Carnegie Parallel Data Mellon Laboratory

http://www.pdl.cmu.edu/ 7 Aaron Harlap April 19 Three Case Studies Addressing the straggler problem in iterative convergent ML - Flex-RR [SoCC 16] Agile ML elasticity through tiered reliability in dynamic resource markets - Proteus [EuroSys 17] Pipeline parallelism for DNN training - PipeDream [Under submission] Carnegie

Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 8 Aaron Harlap April 19 Parallelization Models BSP: wait at each clock (barrier) SSP: fastest <= slack + slowest Increase in Slack bound lowers quality of work Carnegie Parallel Data Mellon Laboratory

http://www.pdl.cmu.edu/ 9 Aaron Harlap April 19 Origin of Stragglers One worker slower than others Short Term Causes Garbage collection, objective function computation (computing stopping criteria), resource contention Long Term Causes Load imbalance, heterogeneity of hardware Carnegie Parallel Data Mellon

Laboratory http://www.pdl.cmu.edu/ 10 Aaron Harlap April 19 Effect of Stragglers ulating the effect of stragglers by injecting artificial strag Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

11 Aaron Harlap April 19 Quick Preview of our Results Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 12 Aaron Harlap April 19 New Approach: FlexRR Initial Work Assignments

Rebalanced Work Assignments Worker 1 Worker 2 Worker 3 Worker 4 Worker 1 Worker 2 Worker 3 Worker 4 FlexRR uses: - Flexible consistency bounds (SSP) - Temporary work re-assignment (RapidReassignment) Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

13 Aaron Harlap April 19 RR Design Constraints: - Input data is too big fit all of it into memory - All - to - All communication / synchronization is expensive - Central arbiter can be a bottleneck Solution: Helper Groups - Helpers: eligible to help - if they are ahead they help CarnegieHelpees: Parallel Data Mellon

- eligible to provide help to Laboratory http://www.pdl.cmu.edu/ 14 Aaron Harlap April 19 Helper Groups Helpers pre-load input data - Only 25% replication required - Avoids costly disk reads Limited P2P Communication

- Cheap messages - no overhead Unique set of helpers & helpees - Provides waterfall effect - 4 of each Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 15 Aaron Harlap April 19 RR Protocol Ok

Fast Slow Driven by fast workers Multicast to preset eligible helpees Ignore Im this far (I dont need help) Im this far Help with N-10 to N

Im behind (I need help) (red work Started Workin g No additional resources Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

16 Aaron Harlap April 19 Significant Stragglers on EC2 Netflix (MF) workload (EC2 Clusters) 53% Improvement Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 17 Aaron Harlap April 19

Take-away Messages From FlexRR Stragglers negatively impact ML training Need to combine technique to address stragglers - Flexible synchronization (SSP) - Work re-assignment (RR) Work published and presented at SoCC 16 Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 18 Aaron Harlap April 19

Thesis Statement Improvements of 5x or more can be achieved for training ML models in shared computing environments by structuring software frameworks and work distribution to mitigate performance jitter, exploit transient resources, and address communication bandwidth limitations. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 19 Aaron Harlap April 19

Thesis Statement Improvements of 5x or more can be achieved for training ML models in shared computing environments by structuring software frameworks and work distribution to mitigate performance jitter, exploit transient resources, and address communication bandwidth limitations. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 20

Aaron Harlap April 19 Three Case Studies Addressing the straggler problem in iterative convergent ML - Flex-RR [SoCC 16] Agile ML elasticity through tiered reliability in dynamic resource markets - Proteus [EuroSys 17] Pipeline parallelism for DNN training - PipeDream [Under submission] Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

21 Aaron Harlap April 19 Dynamic Resource Availability Revocable resources are common in clusters - Best effort resource that can be preempted - Yarn, Borg, Mesos, etc Adding the element of cost savings in clouds - Preemptible Instances in Google Compute Engine - Spot Instances in Amazon EC2 Carnegie Parallel Data Mellon Laboratory

http://www.pdl.cmu.edu/ 22 Aaron Harlap April 19 Transient Resources Often Cheaper Often 75-85% cheaper to use Spot Instances Low Cost Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 23

Aaron Harlap April 19 Efficient use of transient resources Support agile elasticity - Scale in and out efficiently and quickly Handle bulk revocations/evictions efficiently - Dont lose progress Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 24 Aaron Harlap April 19

AgileML: New Approach to Elasticity Use tiers of reliable and un-reliable resources - Revocable resources are un-reliable (transient) Maintain all state on reliable resources - E.g. Parameter Servers only on On-demand Instances - Spot Instances run workers only (initially) 3 architecture stages - Based on ratio of transient to reliable resources Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

25 Aaron Harlap April 19 Building the Stages of Reliability Stage #1 Stage #2 Stage #3 On-Demand Instances (Reliable) Elasticity Controller ParamSer v

Worker ParamSer v Worker Worker Worker Worker Worker Spot Instances (Cheap) Transition between stages at run-time - Little/No overhead for transitions - Transitions based on ratios

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 26 Aaron Harlap April 19 Resource Manager for AgileML Acquires resources for AgileML - Minimizes expected cost per work Analyzes Resource Availability - EC2 Spot Market Takes into account application characteristics

- Scalability - Scale in/out overhead - Eviction overhead Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 27 Aaron Harlap April 19 Need Elasticity and Smart Resource Manager StandardStandard Proteus +CKPts +AgileMLBidBrain+ AgileML

StandardBidBrain Proteus +CKPts +CKPts BidBrain+ AgileML Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 28 Aaron Harlap April 19 Take-away Messages From Proteus Agile elastic ML system - Combined with RM for transient resources

Need agile elasticity & smart RM - 85% cost savings Published and presented at EuroSys 17 Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 29 Aaron Harlap April 19 Thesis Statement Improvements of 5x or more can be achieved for training ML models in shared

computing environments by structuring software frameworks and work distribution to mitigate performance jitter, exploit transient resources, and address communication bandwidth limitations. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 30 Aaron Harlap April 19 Thesis Statement Improvements of 5x or more can be

achieved for training ML models in shared computing environments by structuring software frameworks and work distribution to mitigate performance jitter, exploit transient resources, and address communication bandwidth limitations. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 31 Aaron Harlap April 19 Three Case Studies

Addressing the straggler problem in iterative convergent ML - Flex-RR [SoCC 16] Agile ML elasticity through tiered reliability in dynamic resource markets - Proteus [EuroSys 17] Pipeline parallelism for DNN training - PipeDream [Under submission] Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 32

Aaron Harlap April 19 Example DNN 1000s Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 33 Aaron Harlap January 19 DNN Training - How Do They Learn Forward Pass - Make a Prediction

Input Output ackward Pass - Update Solution Depending on Err Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 34 Aaron Harlap April 19 Data-Parallel Training Separate copy of model on each machine

Communicate updates to model parameters - various synchronization models can be used Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 35 Aaron Harlap April 19 Data-Parallel Training Carnegie Parallel Data Mellon

Laboratory http://www.pdl.cmu.edu/ 36 Aaron Harlap April 19 Data-Parallel Training Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 37

Aaron Harlap April 19 Data-Parallel Training Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 38 Aaron Harlap April 19 Overhead of Data Parallel Training Significant communication overheads

Larger clusters increase communication Faster compute overhead increases communication overhead Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 39

Aaron Harlap April 19 Overhead of Data Parallel Training Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 40 Aaron Harlap April 19 New Approach: Pipeline Parallel Assign layers to machines - communicate inter-layer activations

Machine 3 Machine 1 Machine 2 Machine 4 Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 41 Aaron Harlap April 19 Naive Scheduling is Inefficient

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 42 Aaron Harlap April 19 Making Pipelining Possible Need to work on multiple mini-batches simultaneously - Inter-batch parallelism - Same quality of work as data-parallel training During backwards pass: - Need activations from forward pass - Need model parameters from forward pass

PipeDream: - Stashes activations & versions model parameters Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 43 Aaron Harlap April 19 Alternate Forward / Backward Work Carnegie Parallel Data Mellon

Laboratory http://www.pdl.cmu.edu/ 44 Aaron Harlap April 19 Overlaps Computation & Comm Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 45

Aaron Harlap April 19 Overlaps Computation & Comm Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 46 Aaron Harlap April 19 Simple Case of Straight Pipeline Machine 1

Carnegie Parallel Data Mellon Machine 2 Laboratory http://www.pdl.cmu.edu/ 47 Machine 3 Aaron Harlap April 19 Combine Data & Pipeline Parallel Machine 2 Machine 1

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ Machine 4 Machine 3 48 Aaron Harlap April 19 How to Split Layers Measure forward and backward compute time Compute communications time - GPU and Network Speed - Size of layer being cut

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 49 Aaron Harlap April 19 Putting it all Together Carnegie Parallel Data Mellon Laboratory

http://www.pdl.cmu.edu/ 50 Aaron Harlap April 19 Putting it all Together Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 51 Aaron Harlap April 19

Putting it all Together Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 52 Aaron Harlap April 19 Putting it all Together Carnegie Parallel Data Mellon Laboratory

http://www.pdl.cmu.edu/ 53 Aaron Harlap April 19 Evaluating PipeDream Evaluated on seven DNN models - GNMT, VGG-16, Resnet-50, AWD-LM, Cluster Setups Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/

54 Aaron Harlap April 19 Summary of Results Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 55 Aaron Harlap April 19 PipeDream 5x Better

VGG-16 on V100 GPUs (AWS p3.2xlarge) - PipeDream reduces communication by 95% 5x Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 56 Aaron Harlap April 19 PipeDream Reduces Communication PipeDream chooses DP only for Resnet-50

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 57 Aaron Harlap April 19 Take-away Messages From PipeDream Combine pipeline-, data-, model- parallelism - Inter-batch parallelism Reduces time to target accuracy by up to 5x - Reduces communication - Improves compute accelerator efficiency

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 58 Aaron Harlap April 19 Thesis Contributions Addressing stragglers in iterative ML - Combined flexible synchronization and work re-assignment Agile elastic ML system - Design a parameter server system for transient resources - Designed RM for batch workloads on EC2 spot market

Pipeline parallelism for DNN training - Designed a system that combines model parallelism, pipelining, and data-parallelism. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 59 Aaron Harlap April 19 Thesis Statement Improvements of 5x or more can be achieved for training ML models in shared computing environments by structuring software frameworks and work distribution

to mitigate performance jitter, exploit transient resources, and address communication bandwidth limitations. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 60 Aaron Harlap April 19 Thank You My advisors: Greg Ganger & Phil Gibbons My thesis committee: Greg, Phil, Amar Phanishayee, Ameet Talwalkar

My Collaborators: Henggang Cui, Wei Dai, Jinliang Wei, Greg Ganger, Phillip Gibbons, Garth Gibson, Kevin Hsieh, Nandita Vijaykumar, Onur Mutlu, Alexey Tumanov, Andrew Chung, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri. Many others and you the listeners! Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 61 Aaron Harlap April 19 Thesis Contributions

Addressing stragglers in iterative ML - Combined flexible synchronization and work re-assignment Agile elastic ML system - Design a parameter server system for transient resources - Designed RM for batch workloads on EC2 spot market Pipeline parallelism for DNN training - Designed a system that combines model parallelism, pipelining, and data-parallelism. Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 62

Aaron Harlap April 19 Backup Slides Long Term Stragglers 50% of machines given 75% of the workload Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 64 Aaron Harlap October 16 Works well w/ Partial Replication

Netflix workload Replicate from the end input data / work assignment Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 65 Aaron Harlap October 16

LDA Class Comparison Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 66 Aaron Harlap October 15 Parameter Servers are Great for Iterative ML Parameter Servers shard solution state across machines Traditional architecture has

servers and workers on all machines Used by IterStore, MXNet, Bosen Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 67 Aaron Harlap April 19 Stage #1 has a Weakness ParamServs = Instances running server processes

64 c4.2x EC2 Machines MF on Netflix Dataset On-Demand Instances (Reliable) Spot Instances (Cheap) Carnegie Parallel Data Mellon CostlyCheaperCheap Laboratory http://www.pdl.cmu.edu/ 68 Aaron Harlap April 19

ActivePS Helps a Lot 64 c4.2x Machines 4 On-demand Machines MF on Netflix Dataset Carnegie Parallel Data Mellon Costly Cheap Cheap Laboratory http://www.pdl.cmu.edu/ 69 Aaron Harlap April 19

Becomes Slow at High Ratios 64 c4.2x EC2 Machines 1 Server MF on Netflix Dataset Carnegie Parallel Data Mellon Costly Laboratory http://www.pdl.cmu.edu/ 70 Cheap

Aaron Harlap April 19 Proteus is also Faster Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ Proteus Bid-on-demand (BidBrain+ + CKPts AgileML) 71 Aaron Harlap April 17

BidBrain TierML Implementation Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 72 Aaron Harlap April 17 Layer Sizes Carnegie Parallel Data Mellon

Laboratory http://www.pdl.cmu.edu/ 73 Aaron Harlap April 19 Large Batch Training Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 74

Aaron Harlap April 19 Accuracy vs Epoch Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 75 Aaron Harlap April 19 Memory Overheads Carnegie Parallel Data

Mellon Laboratory http://www.pdl.cmu.edu/ 76 Aaron Harlap April 19 Optimizer Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 77

Aaron Harlap April 19 Longer Pipeline Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 78 Aaron Harlap April 19 Pipelining and PipeDream are Faster PipeDream = pipelining + replication - 4 V100 GPUs on 1 Server (Azure)

Carnegie Parallel Data Mellon Laboratory http://www.pdl.cmu.edu/ 79 Aaron Harlap April 19

Recently Viewed Presentations

  • Vyšetření u onemocnění ledvin - Univerzita Karlova

    Vyšetření u onemocnění ledvin - Univerzita Karlova

    Investigation of Renal Diseases Clinical assessment of the renal patient Martina Peiskerová 1.LF UK Praha Klinika nefrologie 9/2007 Investigation of Renal Disease - outline I. Personal history and physical examination II.
  • Foreign Travel Update DOE Order 551.1D Request for

    Foreign Travel Update DOE Order 551.1D Request for

    DOE Order 551.1D. Request for country clearance should be submitted as soon as possible to receive appropriate and timely clearance to enter country (Section 4c.2) Dept. of State requires that personnel embarking on foreign travel for 30 consecutive (business) days...
  • Physics coursework - Mr. C - JCS

    Physics coursework - Mr. C - JCS

    Physics coursework Title2017. Using conductors made of children's play (modelling) dough, investigate quantitatively the effect on . resistance, calculated from measurements of voltage across and current through the conductors, of . changing the conductor length and obtain data to establish...
  • IBM Presentations: Blue Pearl DeLuxe template

    IBM Presentations: Blue Pearl DeLuxe template

    Blue Onyx Deluxe, Blue Pearl Deluxe: Generally for 'customer-facing' presentations - Blue Pearl Deluxe is useful for one-on-one laptop presentations and for easy printing. Textures on the opening screen carry through the blue bands on text slides. - Blue Onyx...
  • 4/20 &amp; 4/21 - 7th Grade Agenda

    4/20 & 4/21 - 7th Grade Agenda

    Arial Wingdings Chap14Day6 1_Chap14Day6 4/23 & 4/24- 7th Grade Agenda Reflection What are Mollusks Three major groups of mollusks Insides of a Squid Comparing Mollusks Do Mollusks have Open or Closed Circulatory System? What are Gastropods? Snail's Radula How do...
  • The Department of Health Committee on Medical Aspects of Food ...

    The Department of Health Committee on Medical Aspects of Food ...

    Catering guidelinesCaterers and dietitians must work together in planning and implementing a 'clean diet'menu for patients. A graded system of dietary restriction where the level of restriction isbased on the severity of immunosuppression is recommended in clinical practice.7 Using a...
  • Increased Federal Enforcement - Virginia Employment Commission

    Increased Federal Enforcement - Virginia Employment Commission

    Each workweek stands alone. A workweek is 7 consecutive 24 hour periods (168 hours). Employers may use schedule adjustments with the workweek to control OT. But, private employers may not use comp time! Potential discipline of employees for working unauthorized...
  • Life & Literature in The Medieval Period What

    Life & Literature in The Medieval Period What

    Marion Zimmer Bradley's The Mists of Avalon, also, in modern times * * This was the precursor of the notion "a gentleman and a soldier," or, simply a "gentleman." Mostly nobles were granted knighthood, and although today anyone can be...