Programming Languages & Software Engineering

Programming Languages & Software Engineering

Collaborating at the Hardware/Software Interface: A Programming-Languages Professors View Dan Grossman University of Washington Spring 2011 Executive Summary Working on many things, thanks to co-advising Brief overview, then focus on a few Chosen focus: Projects spanning hardware and software Helping Luis Ceze advise the SAMPA group Reinvigorate research across the HW/SW divide HW/SW division sometimes a key research question Apply PL principles even when not doing PL research Four ongoing projects DMP: Deterministic execution of multiple threads OSHA: Code-centric thread-sharing partial specifications EnerJ: Language support for low-power approximate computing

Concurrency Exceptions Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 2 Biography / group names PLDI, ICFP, POPL feel like home 1997PhD for Cyclone: type system, compiler for memory-safe C dialect UW faculty 200330% 80% focus on multithreading, 2005 Lots of transactional memory, 2005-2008: meet HW people Co-advising 3-4 students with architect+ Luis Ceze, 2007 Two groups for marketing purposes WASP wasp.cs.washington.edu SAMPA sampa.cs.washington.edu

Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 3 Other ongoing projects Client-side third-party web extensions [with MSR] Aspects for JavaScript (?!) [OOPSLA2010] Modular memory-model semantics And optimizations allowed by data races illegal [with HP] TM semantics issues Dynamic tracking of what can be used in transactions Abstraction violation of escape actions MapReduce progress estimation [SIGMOD2010] Department + international work on undergraduate curriculum

Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 4 Outline DMP: Deterministic execution of multiple threads OSHA: Code-centric thread-sharing partial specifications EnerJ: Language support for low-power approximate computing Concurrency Exceptions Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 5

Deterministic Execution Why? Everybody knows (shared-memory) multithreading makes programming more difficult Why? Many reasons with cumulative effect One primary cause: non-determinism Run each test many times Hard to reproduce bug reports Cant keep replicas in sync Timing-based security attacks Cant reason sequentially (much less modularly) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 6 Two levels of non-determinism Most languages are non-deterministic

Set of legal behaviors for a single program+input Often exponential in program size (Undefined evaluation order has an analogous problem) Most language implementations are non-deterministic Subject to whims of scheduler, hardware, temperature, (Undefined evaluation order not typically like this) Our focus: Implement non-deterministic languages deterministically: execution-level determinism Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 7 Complementary advantages Execution-Level Determinism Works for todays PLs for

wider range of applications No programmer effort Solves (1)-(4), with caveats Can leverage language nondeterminism for performance 1. 2. 3. 4. 5. Spring 2011 Language-Level Determinism Sequential semantics eases reasoning/verification More programmer control over run-time behavior Solves (1)-(5), fewer caveats Can leverage language

restrictions for performance Run each test many times Hard to reproduce bug reports Cant keep replicas in sync Timing-based security attacks Cant reason sequentially (much less modularly) Dan Grossman: Collaborating at the HW/SW Interface 8 Work at UW to date DMP: Special-purpose HW [Devietti et al, ASPLOS09] Initial algorithms, speculative or not dOS: Page-based OS service

[Bergan, OSDI10] Deterministic process groups, dealing with asynchronous inputs CoreDet: Compiler + Run-Time [Bergan et al, ASPLOS10] Static optimizations, parallel SW commit, non-speculative buffering RCDC: HW/SW hybrid [Devietti et al, ASPLOS11] HW only counts and buffers, more scalable algorithms Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

9 Determinism is easy Conceptual starting point: Deterministic serial execution deterministic quantum size (e.g., count instructions) + deterministic scheduling = determinism performance is hard Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 10

Can we do better? No need to serialize non-communicating instructions Ensure non-communication? Hold that thought Split a quantum at first (dynamic) may-communicate instruction parallel mode prefix and serial mode suffix Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 11 First approach: Ownership For each object, track ownership: owned by Ti or shared Any granularity of tracking is sound (cf. cache-line size) In parallel mode, can read owned by me or shared In parallel mode, can write owned by me

Therefore: no communication in parallel mode So how should we initialize and update ownership? Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 12 Ownership update So how should we initialize and update ownership? Correctness vs. performance: Any deterministic policy is correct Our policy, more or less: In serial mode, read changes owned by other to shared

In serial mode, write changes anything to owned by me (Cf. conventional multiprocessor cache, same locality reasons) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 13 Performance needs Need: Short or non-existent serial mode Coarse-grained communication right ownership Balanced quanta Relatively short quanta, with rough instruction-cost model No run-time timing allowed Low overhead for quanta stopping, starting Relatively long quanta (uh-oh, see above) Low overhead for read/write barriers

Including static techniques to remove unnecessary barriers Ensuring correctness was how I got involved Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 14 An interesting optimization Suppose reads R1 and R2: Must alias (w.r.t. ownership tracking; can be different fields) R2 must follow R1 in same quantum (e.g., same basic block) Then R2 needs no instrumentation Are you sure? Easy case: R1 and R2 both happen in parallel mode Easy case: R1 and R2 both happen in serial mode Hard case: R1 in parallel mode, R2 in serial mode

Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 15 Why is this okay? Parallel Serial X : shared X : ??? T1: T2:

read(X) write(X) ... read(X) time After this round: Without optimization, X : shared With optimization, X : owned-by T1 Choice observably affects subsequent rounds Where parallel mode ends affects order of writes But its okay: Executable with or without optimization is deterministic Different deterministic executables (key degree of freedom) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

16 The limitation So far: The ownership approach Plus: Low-ish overhead Check ownership and proceed Minus: Unnecessary transfers to serial mode Example: A lock acquire may transfer a whole data structure to Ti, but each quantum that first accesses a node has to serialize A scalability problem: Worse with more cores (Amdahl) So: The buffering approach More overhead but better scalability Standard trade-off in parallel systems Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

17 Buffering Any thread can access any address in parallel mode But: All writes are buffered All reads consult threads buffer (overhead!) Only synchronization operations end parallel mode before quantum end Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

18 Buffering After parallel mode, deterministically commit buffer Semantically serial But mostly in parallel Cf. graphics z-buffers Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 19 Buffering Short serial mode just for synchronization operations

The fence semantics of these operations forbids reordering with (buffered) writes Fixed in our ASPLOS11 paper (but HW/SW hybrid) Hopefully synchronization rare relative to quantum size Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 20 Buffering Determinism: Parallel modes isolated And end deterministically

Commit is semantically serial Serial is serial Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 21 Results: the short version Pure SW overhead is substantial But if it scales, a great use of extra cores Scalability depends on application (means can mislead) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

22 Attacking overhead and scalability After HW in ASPLOS09 and SW in ASPLOS10, what are the general principles? Overhead is a SW problem that HW-assist can solve Count instructions, provide local buffers, expose via ISA Scalability is an algorithm problem Quantum balance requires careful tuning okay Buffering helps some applications, but fine-grained synchronization still suffers So a new algorithm that avoids serial mode Is ASPLOS11 just the 3rd of N years of more clever algorithms? No Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

23 Memory Models Old HW lesson: Weaker memory-consistency models scale better The reason for weak models Ownership approach was sequentially consistent (SC) Buffering was total-store order (TSO) May as well relax to happens-before (HB) The PL (cf. Java, C++) only gives you HB anyway But the PL does give you HB, so cant go weaker ASPLOS11 results impressive: 5-70% overhead But incomparable because assume HW/SW hybrid For some applications, HB significantly outperforms TSO Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

24 Secret Sauce Why does it matter that this has been a HW+SW project? Same students can argue the semantics, hack the compiler, run architectural simulations They dont find it weird to go to PLDI and ISCA Manage co-advisors with different skill sets Complementary perspectives on what is efficient Counting instructions vs. identifying thread-local data Distinguish algorithmic technique from implementation Can track ownership in HW, SW, or OS Its fun Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 25

Ongoing work The robustness problem Biggest caveat with respect to repeatability Benefits to testing Incorporating acceptable non-determinism Application to distributed systems Build on OS work Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 26 Related work Determinism is a hot topic, so please pardon omissions Deterministic languages:

DPJ [OOSPLA09, POPL11] NESL, Jade, ORCS, Execution-level determinism Kendo [ASPLOS09], good performance but assumes no data races Grace [OOPSLA09] System-level support: entire session at OSDI10 Record-and-replay: some of determinisms benefits Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 27 Outline DMP: Deterministic execution of multiple threads OSHA: Code-centric thread-sharing partial specifications [OOPSLA10]

EnerJ: Language support for low-power approximate computing Concurrency Exceptions Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 28 Shared memory For better or worse (probably worse), shared memory is one dominant paradigm for concurrent programming At PL and HW levels, default is implicit sharing But sharing/communication/interleaving should be rare Rare sharing enables successful program-analysis tools Data-centric: Race detectors, types for safe locking, sharing specs Isolation-centric: Atomicity checkers, TM, Code-centric: our work

Complementary: partially overlapping set of bugs (Orthogonal to static vs. dynamic analysis, language vs. tool, etc.) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 29 Code-centric Consider shared-memory communication as between code points executing in different threads The writer and the subsequent reader Partial specification: which directed pairs are allowed enqueue dequeue

Specification meaning: At memory read, consider last write to same location If write performed by different thread, then pair must be in spec Code-centric: No mention of which memory, how much memory, synchronization protocol, etc. Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 30 Nice idea, but enqueue dequeue

Basic idea doesnt work without solving several problems 1. Helper functions Many writes delegated to callee unaware of sharing 2. Multiple abstraction levels Want to specify/check communication in terms of libraries and library clients (N levels deep, naturally) 3. Encapsulated sharing Often callers should be unaware of some communication 4. Convenient specifications App-specific properties require programmer annotations 5. Efficient enough checking to be more than documentation Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 31 Helper functions

enqueue if(buffer full) Util.resize() dequeue ?? Util.resize By our definition, Util.resize communicates to dequeue But dont want to specify that Util.resize communicates on behalf of its caller So specify it is conceptually inlined Actually the default Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

32 Multiple abstraction levels producer consumer enqueue dequeue array_put array_get Want to specify communication at each abstraction level separately Want to check all the levels

So the dynamic check requires matching call stacks Modulo inlining Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 33 Encapsulated Sharing producer consumer ?? enqueue size++ array_put

dequeue size-array_get Abstractions have internal communication that should not propagate Extend specification language to describe interfaces Dynamic checker stops stack processing after the (dequeue, enqueue) pair Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 34 Convenient specifications Use groups of writers & readers to avoid quadratic blowup Make inlining the default

Evaluation is empirical Thankfully, in good software, communication is sparse Caveat (feature?): No annotations is correct Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 35 A dynamic checker Semantically: On each write, store thread-id and call-stack with location On each read, compare thread-ids and call-stacks To get reasonable overhead for a debugging tool: Check thread-ids first Global memo-table of stack-pairs with full hash-consing For hot reader-stacks, an inline cache for hot writer-stacks

Bottom line: These fast-paths work! 1 execution with over 6 billion reads needed 697 stack-walks Unique communicating stack-pairs are rare Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 36 Debugging tool Usable heavyweight debugging tool # of different stacks matters more than amount of communication Not fast, but useful for humans handling concurrency errors Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

37 The Hardware Connection Turns out our OOPSLA10 paper has nothing to do with HW But across the lab, a different code-centric technique using history of HW coherence events and machine learning for automatic bug diagnosis [MICRO09] Hmm those ideas transfer/extend to software So while Im doing other things , they write a PLDI11 paper: Isolating and Understanding Concurrency Errors Using

Reconstructed Execution Fragments Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 38 Outline DMP: Deterministic execution of multiple threads OSHA: Code-centric thread-sharing partial specifications EnerJ: Language support for low-power approximate computing [PLDI2011] Concurrency Exceptions Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

39 Saving energy Modern platforms expend a huge amount of energy Largely to ensure you get the right answer Often not the right trade-off! Audio, Video Clustering, classification, information retrieval Monte Carlo simulation Game AI Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 40

Many ways to save Energy savings via approximation at various abstraction levels Hardware Lower voltage memory more likely to flip bits Smaller functional units with less precision Compiler Data placement Heuristically skip computations [Agarwal et al] Application Imprecise algorithm (e.g., smaller sample size) Our work: Use PL to make existing techniques safer, easier to use Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 41 Idea #1

Higher levels must tell lower levels what is approximable Ignoring how the lower level achieves approximation Leave amount-of-approximation (QoS) to best effort for now Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 42 Idea #2 Different libraries have different ways to approximate Clients should just indicate which uses are approximable Examples: Image data: okay to flip some bits (HW storage) Average: okay to round floating-point (HW computation) Average: okay to average a sample (alternate algorithm)

Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 43 Idea #3 Programmer must indicate where approximate can influence precise Catch bugs, control propagation of bad computations Just an information-flow problem! (Good!) Comfortable with Very simple but sound information-flow tracking Liberal use of explicit endorsement Most code is precise (the default) But hot computation often approximable Spring 2011

Dan Grossman: Collaborating at the HW/SW Interface 44 Putting it together With these ideas, a very nice area-opening project: Type qualifiers for precise, approx, and inherit-from-context Approx bits can be stored in low-energy hardware, but array lengths, pointers, and type tags cannot Overload methods with approximate versions Client chooses when approximation is allowed Run-time chooses when approximation is used Information-flow type system Keep it simple: No branches on approx data (endorse it) Formal semantics to show non-interference Spring 2011

Dan Grossman: Collaborating at the HW/SW Interface 45 More on data storage [my little piece] For memory safety, only (some) ints and floats can go in approximate storage How can the hardware provide a usable memory hierarchy How can the PL implementation use it (Not carefully considered by prior work on approx caches) Our answer (requiring knowledge of HW and SW) Each cache line can have voltage scaled Object layout just got interesting Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface

46 Example (32-byte cache lines) class X { @precise int a,b,c,d,e; @approx int f,g; } unpadded (bad for Y) padded (bad for X) class Y extends X { @precise int h,i; @approx int j,k; } X a b c

d e f g Y a b c h i j k d e f g X a b c f g d e -- -- Y a b c

f g j k d e h i Note: Partial approx lines waste space & energy Cant start next object on approx line Interesting, but in practice arrays matter the most First few elements in precise line with length (less savings) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 47 Evaluation Annotate benchmark programs to identify legal approximation Measure burden (its low, details omitted) Using prior work, fake-but-realistic HW model to estimate

savings / error probability curve A simulator that injects faults probabilistically Get details like cache-line granularity right (chose unpadded) Measure energy savings and (application-defined) QoS Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 48 Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 49 Spring 2011

Dan Grossman: Collaborating at the HW/SW Interface 50 Future work What if we didnt assume novel hardware? Distinguish low-probability bit-flip from guaranteed rounding? Specifying more than best effort? Is our information-flow system too strong or too weak? A vision of controlled pay-as-you-go precision Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 51

Outline DMP: Deterministic execution of multiple threads OSHA: Code-centric thread-sharing partial specifications EnerJ: Language support for low-power approximate computing Concurrency Exceptions Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 52 Data-race exceptions Data race: read/write or write/write of same location by different threads without intervening synchronization What if the execution platform (e.g., HW) guaranteed precise data-race detection at run-time (e.g., trap)?

1. Catch bugs 2. Drastically simplify defining and implementing memoryconsistency models Everyone agrees on DRF implies SC, i.e., interleaving semantics in the absence of data races But the time between the two accesses in a data race isnt bounded Expensive to maintain all the metadata Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 53 Conflict exceptions [ISCA10]* A slightly weaker guarantee that lets the platform bound checking: (no exception) (sequential consistency) (sequential consistency) (data race) (exception) (no data race)

Advantages Still get some bug-checking, but allows missing long races Still simplify language semantics! *Lucia, Ceze, Strauss, Qadeer, Boehm [not me] Closely related work: DRFx [PLDI10] Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 54 Ongoing work Assuming data-race exceptions

or, in contrast, conflict exceptions what else could a compiler + run-time system do with them? Key idea: Data races may be really unlikely but not errors Hint: Simpler read and write barriers (recover from trap) Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 55 Thanks!

DMP: Deterministic execution of multiple threads OSHA: Code-centric thread-sharing partial specifications EnerJ: Language support for low-power approximate computing Concurrency Exceptions Focus on technology trends: multicore and power Let HW or SW or both be a question, but not the first question And, by the way, 3 of the 4 projects have used formal semantics http://sampa.cs.washington.edu Spring 2011 Dan Grossman: Collaborating at the HW/SW Interface 56 Owen Anderson

Magdalena Balazinska Tom Bergan Luis Ceze Joe Devietti Werner Dietl Kristi Morton

Laura Effinger-Dean Jacob Nelson Emily Fortuna Mark Oskin Benjamin Lerner Adrian Sampson

Brandon Lucia Benjamin Wood

Recently Viewed Presentations

  • PowerPoint Presentation

    PowerPoint Presentation

    Past: what is the likelihood that Marilyn Monroe committed suicide? Combining evidence and non-evidence. Always: Representation & Inference Basic Idea Attach degrees of belief to proposition. Theorem (de Finetti): Probability theory is the only way to do this. if someone...
  • Programming - 2005 New Graduation Program

    Programming - 2005 New Graduation Program

    Biol., Chem., Phys., Earth Sc., Sc&Tech. 11 4 Credits A Math 10 4 Credits A Math 11 4 Credits Physical Ed. 10 4 Credits Career Life Education 10 4 Credits ... UBC and Challenge. Advanced Placement . AP exams are...
  • Education Jobs Fund Program (MS PowerPoint)

    Education Jobs Fund Program (MS PowerPoint)

    The school agrees to share their academic progress monitoring data (ie. NWEA) , Native Star and NASIS data or any other data stated in the application with the Bureau of Indian Education as a means of documenting progress of the...
  • Domino's Pizza - New York University

    Domino's Pizza - New York University

    Corporate Governance is a relationship among stakeholders that is used to determine and control the strategic direction and performance of organizations Concerned with identifying ways to ensure that strategic decisions are made effectively Used in corporations to establish order between...
  • Ownership and Relevance

    Ownership and Relevance

    NORDEFCO "The mainaim and purpose of the Nordic DefenceCooperation (NORDEFCO) is to strengthen the participating nations' national defence, explorecommonsynergies and facilitateefficientcommon solutions."
  • Lrg Fcc 3-16

    Lrg Fcc 3-16

    Netflix and Hulu 0.05 All three 0.06 Amazon Prime only 7.0000000000000007E-2 Amazon and Hulu Hulu only 0.01 None 0.43. Evolution of the Video Marketplace and the Future of Television. FCC Workshop on the State of the Video Marketplace.
  • Successful Economic Union Requirements

    Successful Economic Union Requirements

    Successful Economic Union Requirements European Trade Areas - 1997* The European Economic Area: EU, EFTA, and Associates A Comparison of the EU and NAFTA Commonwealth of Independent States Comparison of Exports Among Members of APEC and the EC Future Multinational...
  • ATM OCN 100 Summer 2002

    ATM OCN 100 Summer 2002

    Times New Roman Arial Monotype Sorts Arial Narrow Arial Rounded MT Bold as02-00 Microsoft Word Document Microsoft Graph 2000 Chart Microsoft Excel Worksheet ATM OCN 100 - Summer 2001 LECTURE 1B Announcements Announcements ATM OCN 100 - Summer 2001 LECTURE...