Towards the design of tomorrows Reliable Computing Systems

Towards the design of tomorrows Reliable Computing Systems

Towards the design of tomorrows Reliable Computing Systems Reiley Jeyapaul PhD, Compiler Microarchitecture Lab, ASU Computing ? Definition of Reliability: Who cares about Reliability ? The quality of being dependable or reliable In computing, how would you define Reliability ? System User (you and me) System architect/designer Why does it matter and to what extent ? Application dependent Media applications Lower priority Financial applications High priority Medical applications Critical priority Research Presentation @ ARM

Jan 08, 2015 Systems Hard Faults (Permanent faults) Stuck-at faults Faulty circuit element (e.g., a wire or output of gate) Delay fault Effects of temperature, processor variations, Ageing : Onset of physical wear-out Electronmigration NBTI (Negative Bias Temperature Instability) Soft Faults (Temporary Faults) Program errors software bugs/incorrect initialization, etc. Environmental factors Cosmic particles, temperature, physical effects, electromagnetic interference Non-environmental factors Loose connections, ageing, process variations, noise

Research Presentation @ ARM Jan 08, 2015 Saving Galileo 1978 Galileo commissioned for Jupiter exploration 1980 Design and Architecture decided 1982 Voyager reaches Jupiter Use of AT 2901 for attitude control Intermittent Resets Sulfur ions from Jupiters volcanic moon were being whipped up to high energy by the Jovian gravity. After extensive testing of Galileo, chief engineer decided not worth flying if soft error problem not solved Overheads

5 years, 5 million dollars Sandia National Laboratories was subcontracted to custom-make radiation hardened AT 2901 Research Presentation @ ARM Jan 08, 2015 It started with nuclear tests 1954-57: Nuclear Tests 1962: Wallmark and Marcus (RCA Labs, Princeton) Electronic anomalies in monitoring equipment Could not be traced to any hardware fault Equipment worked properly after restart Minimum size and Maximum Packing Density of NonRedundant Semiconductor Devices, March 1962 Predicted that cosmic rays would start affecting microelectronics 1962: Telestar - First communication satellite

July 9, 1962: Starfish Prime United States tested a high-altitude nuclear device (called Starfish Prime) which super-energized the Earth's Van Allen Belt where Telstar took orbit 100X increase in radiation Rendered the satellite unoperational worked after reboot 5 Research Presentation @ ARM Jan 08, 2015 Radioactive Contamination 1978: Intel could not deliver chips to AT&T to upgrade switching system from mechanical relays to ICs May and Woods traced problem to packaging Packaging modules were contaminated with Uranium from and old uranium mine upstream. Also proposed the Q_critical model of soft errors Q_critical must be overcome by accumulated charge generated by particle strike to cause a fault. 1986-87: IBM faced problems of radioactive

contamination Traced problem to a distant chemical plant that used radioactive contaminant to clean bottles that were used to store an acid required in chip manufacturing process. 6 Research Presentation @ ARM Jan 08, 2015 Fiscal Losses Mount 2000 Sun Microsystems Cosmic ray strikes on L2 cache with defective error protection Baby Bell (Atlanta), America Online, Ebay, & dozens of other corporations affected Verisign moved to IBM Unix servers (for the most part) 2000: Cisco Line Routers

Caused Suns flagship servers to suddenly and mysteriously crash! Intermittent router resets, due to soft errors on the processor memory, affecting the operation. 2004: Cypress Semiconductors Reported number of incidents of soft errors A single soft error crash the entire system farm Brought a billion dollar automotive factory to halt every month 2005: HP Server farm 2048-CPU server in LANL crashed frequently 7 Research Presentation @ ARM Jan 08, 2015 Reactions from Companies Fujitsu SPARC in 130 nm technology

IBM declared 1000 years system MTBF as product goal 80% of 200k latches protected with parity compare with very few latches protected in Mckinley ISSCC, 2003 very hard to achieve this goal in a cost-effective way Bossen, 2002 IRPS Workshop Talk nVIDIA Fermi GPUs Protect all memory and register file using ECC 8 Research Presentation @ ARM Jan 08, 2015 Soft Error Skeptics Applications crash more often due to software bugs

Limited # of bugs in mature software (e.g., servers, company environment) If we dont do anything, soft errors will be the dominant failure rate This a server problem not a desktop problem Definitely a server (e.g., data center) problem Desktop problem from IT managers point of view Soft error rates increasing exponentially with scaling Will soon become a problem even for embedded systems Soft error is not a problem today 9 Industry is at the cross-over point

Future is worse,Presentation IF we [email protected] anythingJan 08, 2015 Research ARM Radiation Induced Soft Errors Typicall y 10 = 1.64 x 1010 sec = 5.10x1011 sec Induced current has a rapid rise time but a more gradual fall time Research Presentation @ ARM Jan 08, 2015 Soft Error Trends DRAM SRAM

System error rate of DRAMs is fairly constant Increasing exponentially Logic Increasing exponentially Research Presentation @ ARM Jan 08, 2015 Increasing Soft Error Rates Reducing features sizes and lower supply voltage More number of transistors per chip Decreasing capacitive nodes and noise margins

Q_critical reducing Exponentially more low-energy particles than high-energy ones More functionality is moving on-chip Higher probability of error due to more faults. Increasing clock rates Larger fraction of time between setup and hold times for better error latching 12 Research Presentation @ ARM Jan 08, 2015 One Failure per Day per Chip [Shivakumar et al 2002] Soft error rates could increase from one error per year to one error per day in a decade! 13 Research Presentation @ ARM Jan 08, 2015 Transient Faults, Bit Flips, Soft Errors, etc. activation

propagation ERROR FAULT fault latency FAILURE error latency Storage Device (e.g., Memory, Cache, Registers) Soft Errors MA/SW Masking Transient Faults Circuit Masking Logical Device (e.g., ALU) Sequential Device (e.g., FF) Bit Flips = Transient Faults = Soft

Errors System (e.g., Crash) System Failures Processor Pipeline Research Presentation @ ARM Jan 08, 2015 Approaching the Soft Error Problem Hardware Interface Visible: Processor Hardware Chip design Obscured: Application behavior Soft Error Perspective: Fault Soft Error (bit-flip) Protection: Correct the Soft Error 15 Research Presentation @ ARM Jan 08, 2015 Processor Hardware Layers Device Level Circuit Level Chip Packaging

16 Processor Microarchitecture Research Presentation @ ARM Jan 08, 2015 Hardware Techniques Transient faults are incident on the hardware bit Transistor and gate is affected by transient faults Attack the problem at its source Packaging techniques to shield the transistors Refining packaging material Circuit and device techniques to make them resilient to transient faults SOI (Silicon on Insulator), Fault resilient (hardened) latches. Architecture techniques to detect and correct bit-faults Parity (detect only) ECC (1-bit detect and correct)

SECDED (1-bit detect and correct, 2-bit detect) Research Presentation @ ARM Jan 08, 2015 Circuitry Electrical Masking Pulse attenuated by electrical resistance in the circuit Pulse still strong enough to be latched at output 18 Research Presentation @ ARM Jan 08, 2015 Circuitry Logical Masking Value unchanged at the gate 19 Research Presentation @ ARM Jan 08, 2015

Circuitry Logical Masking Error propagated to the output 20 Research Presentation @ ARM Jan 08, 2015 Circuitry Temporal Masking Transient Fault Soft Error A transient pulse at the latching window: 1) Before tsetup masked (not latched) 2) After tsetup, Before thold race condition 3) At the latching window not masked (latched) [Firouzi ROCS 2010] 21 Research Presentation @ ARM Jan 08, 2015 Hardware Techniques for Protection Shielding at the package-layer

Device level Techniques Method to prevent strikes Limited by packaging design/cost and technology available Scalability, design flexibility and cost are concerns Fabrication cost governs commercial applicability Circuit Level Techniques Masking Effects : Electrical Masking Temporal Masking Logical Masking Circuit implementations enhance the masking effects to protect SEU Limitations: Hardware overheads of area / power and design cost Overhead vs Need governs commercial applicability of methods 22 Research Presentation @ ARM

Jan 08, 2015 Approaching the Soft Error Problem Software Interface Visible: Application behavior Obscured: Processor Microarchitecture Chip design Soft Error Perspective: Anomaly in program execution Protection: Recover from incorrect execution 23 Research Presentation @ ARM Jan 08, 2015 How do Soft Errors Manifest in the System? Applicatio Applicatio n/ n/ Software Software Compiler Compiler Executabl Executabl

ee Binary Binary Random chargedparticles causing biterrors in h/w components Outcomes Outcomes from from aa Soft Soft Error Error 1) 1) Output Output Data Data Corruption Corruption 2) 2) Incorrect Incorrect Program Program Execution Execution 3) 3) System System Crash Crash 4) 4) Silent Silent undetected undetected datadatacorruption corruption Masked Masked Soft Soft Errors Errors 5)

5) Correct Correct Program Program execution execution 6) 6) No No system system Crash Crash PROCESSOR PROCESSOR Program Program Output Output Not all bit-errors in the processor hardware translate into system level errors (or Failures). -- The reason is Masking 24 Research Presentation @ ARM Jan 08, 2015 Software-level Masking Effects - Logical/Arithmetic Masking Program If A > 0 Block 1

Then Block 2 EndIf Expected Expected Block Block 1 1 executed. executed. 25 Scenario 1 Scenario 2 A=5 (0x0101) A A= =7 7 (0x0111) (0x0111) If A > 0 Block 1 Then

Block 2 EndIf Expected Expected Block Block 2 2 is is NOT NOT executed. executed. A=0 (0x0000) A A= =2 2 (0x0010) (0x0010) If A > 0 Block 1 Then Block 2 EndIf Research Presentation @ ARM

Jan 08, 2015 Software-level Masking Effects - Control Flow Masking Scenario 1 Scenario 2 B=5 A=5 A=0 C = 10 B=5 If A > 0 B=B+ 5 Then C=C+ 2 EndIf B=5 C = 10 C = 10

If A > 0 B=B+ 5 Then C=C+ 2 Error Error in in B B is is Masked Masked EndIf If A > 0 B=B+ 5 Then C=C+ 2 EndIf Program 26

Error Error in in B B manifests manifests into into Failure Failure Research Presentation @ ARM Jan 08, 2015 Software-level Masking Effects Other Masking Effects Data-value masking: If the value of one variable in a multiplication is 0, error in the other variable is masked. Error in a variable which may be reset for future use in the program is masked. e.g., error in the index variable of a for loop. Dynamic dead-code: Error in a variable, when used

in a computation block results in incorrect temporary data. But if this temporary data computed is not used in 27 Jan 08, 2015 Research of Presentation @ ARM or execution, the computation program output Transient Faults, Bit Flips, Soft Errors, etc. activation propagation ERROR FAULT fault latency FAILURE error latency Storage Device (e.g., Memory, Cache, Registers) Soft Errors MA/SW Masking

Transient Faults Circuit Masking Logical Device (e.g., ALU) Sequential Device (e.g., FF) Bit Flips = Transient Faults = Soft Errors System (e.g., Crash) System Failures Processor Pipeline Research Presentation @ ARM Jan 08, 2015 Software Techniques for Protection The key ideas include:

Reduce time that vulnerable data resides on the components Detect and correct errors (if any) after execution through the components. Software based techniques vary based on the components protected (coverage). L1 Cache Register File protection Pipeline core and buffers Redundancy based techniques Control flow based techniques 29 Research Presentation @ ARM Jan 08, 2015 Research Front Error Detection and Recovery Techniques Advantage of Compiler-based methods

Recovery through re-start may be an acceptable option Cost of integrated recovery mechanism is substantial Recovery through re-start may not be acceptable for HPC systems Can analyze soft errors that transcend the microarchitecture and software level masking effects Can implement smart optimizations for efficient protection Limitations of Software approaches Granularity of optimizations is larger and therefore is a limitation Added code for detection/correction, are again vulnerable to soft errors and therefore may cancel out the protection acheived. 30 Research Presentation @ ARM Jan 08, 2015 Approaching the Soft Error Problem Hardware-Software Interface Visible: Processor Hardware Application behavior

Obscured: Chip Design Soft Error Perspective: Fault Soft Error (bit-flip) Failure Protection: Protect against Soft Error induced Failures 31 Research Presentation @ ARM Jan 08, 2015 How to Estimate Soft Errors ? Soft Error : A data bit-flip during program execution that translates into erroneous output. Processor Processor Pipeline Pipeline Applicati Applicati on on Binary Binary Buffers Buffers ? Regist Regist er er File File

Outpu Outpu tt Cache Cache (Instruction/ (Instruction/ Data) Data) Analysis is most relevant at the interface between : A data-bit and A sequential element 32 Research Presentation @ ARM Jan 08, 2015 Data Vulnerability & Soft Errors Data processed by the system is exposed by the hardware components in the processor, to: charged-particles that strike the processor and other sources of transient errors - electrical noise, cross-talk, etc.

Exposed data is vulnerable to bit-flips, that manifest into failures in the system as, Erroneous output System failure Output Data errors The probability of a bit-error manifesting into soft errors Time duration data is exposed in h/w An error on an exposed data-bit in the processor, could lead to system errors if, the bit will be used in the process execution in the system. Only the exposed and actively-used data-bits in the system are deemed vulnerable during process execution. 33 Research Presentation @ ARM Jan 08, 2015 Data Cache Instruction Cache Vulnerability in the Cache

34 I R E R I R E R time t0 I t1 t3 t2 R W t4

R t5 W t7 t6 E time t0 t1 t2 t3 t4 Research Presentation @ ARM t5 Jan 08, 2015 Vulnerability Distribution in the processor with protected Cache Research Presentation @ ARM

Jan 08, 2015 Code Transformations for Vulnerability Reduction Vulnerability Vulnerability trend trend not not same same as as performance performance Interesting configurations exist, with either low vulnerability or low runtime. [Shrivastava et al 2010] Loop Interchange on Matrix Multiplication 52X 52X variation variation in in vulnerability vulnerability for for 1% 1% variation variation

in in runtime runtime 36 Opportunities may exist to trade off little runtime for large savings in vulnerability Jan 08, 2015 Research Presentation @ ARM Vulnerability depends on the data access pattern for ( i : 0 i < N ) { for ( k : 0 k < N ) { for ( j : 0 j < N ) { A[i][k] += B[i][j] * C[j] [k] } } } Low Vulnerability But Bad Performance for ( i : 0 i < N ) { for ( j : 0 j < N ) { for ( k : 0 k < N ) { A[i][k] += B[i][j] * C[j] [k] } }

} High Vulnerability But Good Performance Completely compute Need A[i][k] across A[i][k] in the innermost iterations of outermost loop Less lifetime of loop Longer lifetime of 37 Jan 08, 2015 ARM A[i][k] A[i][k]Research Presentation @02/25/2020 Soft Error Protection at H/w-S/w interface Vulnerability of data is directly proportional to the probability of Soft Error Failures Fault Injection Simulation Static Estimation Estimation of Vulnerability is a key factor for:

Comparative analysis of two protection designs Design space exploration of architecture designs 38 Research Presentation @ ARM Jan 08, 2015 Interesting Research Questions How to design cross-layer protection ? Evaluation Metric for Soft Error Protection Vulnerability of an application is not static through time. Vulnerability of Multi-core systems but lacks comprehensive static estimation methods. Time based vulnerability in systems

No comprehensive system-level estimation method Compiler has immense potential to contribute What component should be protected and which layer? How to co-ordinate among the techniques across layers? Inter-thread communication affects vulnerability analysis Soft Error Protection in HPC Systems Communication overhead limits use of multi-core techniques 39 Research Presentation @ ARM Jan 08, 2015 Thank You ! Research Presentation @ ARM Jan 08, 2015

Recently Viewed Presentations

  • AN INTRODUCTION TO JUDAISM  Founded in the 13th

    AN INTRODUCTION TO JUDAISM Founded in the 13th

    Began monotheistic tradition. Covenant with God. The story of the Sacrifice of Abraham. Abraham- around 2000 BCE received a vision from God that instructed him to leave his home in Ur (Mesopotamia) and more to Haran and later Canaan.-began monotheisitic...
  • Idioms 4 Each idiom has two slides. In

    Idioms 4 Each idiom has two slides. In

    When I asked him to help in the garden, he dug in his heels and said that he would carry on playing Minecraft. 91. A penny for your thoughts. Answer. I'll pay you for any bright ideas.? Your thoughts aren't...
  • Midterm Review:

    Midterm Review:

    Can be used in place of the counter-controlled for statement whenever code looping through an array * Acknowledgement Slides on Merge Sort is adapted from slides by Douglas Wilhelm Harder, Mmath ([email protected]) This presentation uses materials from Head First Java...
  • Chapter 4

    Chapter 4

    Mineral Groups. There are 3000 minerals found in Earth's crust. However, only about 30 of these minerals are common. The most common minerals are often referred to as rock-forming minerals because they make up most of the rocks found in...
  • The classic core-periphery model: Myrdal &amp; Friedmann

    The classic core-periphery model: Myrdal & Friedmann

    The classic core-periphery model: Myrdal & Friedmann Demands from center for goods/services yields payments to periphery Abundant Labor Supply of materials and products Periphery Center Scarce Capital Abundant Capital Capital flows to periphery Shortage of labor in center creates stimulus...
  • THE OUTSIDE VENT POSITION for Suburban Fire Departments

    THE OUTSIDE VENT POSITION for Suburban Fire Departments

    *Tom Brennan - we'll never forget you! "The immediate assignment of one or more members to vent the fire-area windows from the outside is an extremely important assignment to be made. After the placement of the initial attack hose line...
  • Resurssin ajanvaraus - UEF

    Resurssin ajanvaraus - UEF

    Weaving interoperability: combining local, regional and national solutions on hospital level IMIA HIS Conference, Oeiras, July 3, 2006 Juha Mykkänen, Mikko Korpela
  • Data-based practitioners: How to use data for decision

    Data-based practitioners: How to use data for decision

    The contents of this presentation were developed under a grant from the U.S. Department of Education, # H373Z120002, and a cooperative agreement, #H326P120002, from the Office of Special Education Programs, U.S. Department of Education.