BIG DATA TECHNOLOGIES LECTURE 2: ARCHITECTURES AND SYSTEMS FOR PARALLEL AND DISTRIBUTED COMPUTING Assoc. Prof. Marc FRNCU, Phd. Habil. [email protected] WHAT IS A PARALLEL ARCHITECTURE A collection of processing elements that cooperate to solve large problems fast Almasi and Gottlieb, 1989 ARM11 (2002-2005) 90% of the embedded systems is based on ARM processors Rasberry Pi Pipeline 8

in 8 steps in-flight instructions (out-of-order) Intel Pentium 4 (2000-2008) 124 in-flight instructions in 31 steps Superscalar: in processor instruction level parallelism Pipeline Intel Skylake (august 2015) Quad core threads per core GPU 2

Tianhe-2 (2013) 32.000 Intel Xeon CPUs with12 cores Intel Xeon Phi CPUs with 57 cores 34 petaflops 48.000 Sunway Taihulight First place in the world in November 2017 SW26010 CPUs with 256 cores 125 petaflops 40.960 Google Cloud Cluster 10

farms Tbps bandwidth (US-Japan) MOTIVATION The performance of sequential systems is limited Computation/data transfer through logic gates, memory devices Latency <> 0 Assuming an ideal environment we still have the limitation due to the light speed Many applications requires performance Nuclear

reaction modelling Climate modelling Big Data Google: 40.000 searches/min Facebook: 1 billion users every day Technological breakthrough What do we do with so many transistors? EXEMPLES Titan Supercomputer 299,000 AMD x86 cores 186,000 NVIDIA GPUs

TRANSISTORS VS. SPEED More transistors pee chip equals more speed Electronic devices that act as switches built to act as logical gates The speed of each operation is given by the time required by each transistor to stop without causing any errors A small transistor will stop/start faster Exemple: 3Ghz = 3 billion ops/sec Increase density to increase speed Intel P4: 170 milllion transistors Intel 15-core Xeon IVY Bridge: 4.3 billion

On average, each year the processing power increased by 60% TRANSISTORS VS. SPEED Moore law Ideally: No. transistors doubles each year (2x) In reality: 5x every 5 years (1,35x) TRANSISTORS VS. SPEED Why cant we increase speed forever? Size of chip = constant But, density of transistors increases 2018: 10nm Intel Cannonlake

Dennards scalability The power (V) needed to operate the transistors = constant even if the number of transistors per chip increases But this is no longer true (as of 2006)! The transistors is are becoming so small that their integrity breaks and they leak current The faster we turn off/on a transistor the more heat it generates For 8,5-9 GHz we require liquid nitrogen! HOW DO WE REDUCE THE POWER CONSUMPTION? Frequency voltage Gate energy voltage2 Executing the same number of cycles at a lower voltage and speed power economy

Example Task with deadline of 100ms Method #1: 50ms at full speed then 50 ms idle Method #2: 100ms at frequency/2 and voltage/2 Energy requirement: energy = voltage/4 4x cheaper SEQUENTIAL PROCESSING 6 hours to wash 4 rows of clothes

PIPELINE SEQUENTIAL PROCESSING Pipeline = start processing IMMEDIATELY Improves system throughput Multiple tasks operate in parallel Reduces time to 3.5 hours INSTRUCTION LEVEL PARALLELISM (ILP) CPU PIPELINE CONDITIONAL PIPELINE (BRANCHING) What happens when we have dependencies between instructions? Especially for if-else branching The processor must erase instructions fetched through the pipeline since reaching a branch means they were incorrectly fetched, i.e., we must fetch the instructions corresponding to the chosen branch.

AMD Ryzen (2017) generate fetch neural networks to predict the uses fetch decode executestore address operands execution path i+2 is a branch instruction: execute either j or i+3 https:// SUPERSCALAR ARCHITECTURES Execute more than one instruction per CPU cycle Send multiple instructions to redundant units Hyperthreading (Intel)

Example Intel P4 One instruction (thread) processes integers (ALU unit for integers) another processes floating point numbers (ALU unit for floats) The OS thinks it deals with 2 processors Is accomplished by combining a series of shared, replicated or partitioned resources: Registers Arithmetic units Cache memory AMDAHLS LAW

How much can we improve by parallelizing? Example Floating point instructions Theoretical speedup: 2x Percentage of the total #instructions:10% 1,053 END OF THE SINGLE CORE? Frequency stopped scalability Dennards scalability

Memory wall Data and instruction must be fetched to the registries (cache) Memory becomes the critical point The ILP wall Dependencies between instructions limit the ILP efficiency MULTI-CORE SOLUTION More cores on the same CPU Better than hyperthreading Real parallelism

Example Reducing speed (frequency) by 30% reduces power by 35% Power frequency3 (or worse) But performance is also reduced by 30% Having 2 cores per chip at 70% speed 140% of the original performance at 70% of the power 40% increase in power at 30% savings in energy IBM POWER4 Introduced n 2001 MULTICORE Intel i7

6 cores 12 threads Parallelism through hyperthreading DIFFERENT ARCHITECTURES Intel Nehalem (Atom, i3, i5, i7) Cores are linked through the QuickPathInterconnect (QPI) Mesh architecture (all with all) 25,6 GB/s Older Intel versions (Penryn) Cores are linked through the FSB (Front Side Bus) Sequential architecture

3,2 GB/S (P4) 12,8 GB/s (Core 2 Extreme) AMD INFINITY FABRIC HyperTransport protocol 30-50 GB/s 512 GB/s for GPU Vega Mesh network Network on a chip, clustering Link between GPUs and SoC CCIX standard: accelerators, FPGA-uri MANY CORE Systems with tens or hundreds of cores Developed for parallel computing High Throughput and low energy consumption (sacrifices

latency) Problems such as cache coherence in multi-core systems (few cores) They use: mesage passing, DMA, PGAS (Partitioned Global Access Space) Not efficient for applications using just one thread Example Xeon Phi with 59-72 cores GPUs: Tesla K80 with 4992 CUDA cores CPU VS. GPU ARCHITECTURE GPU has more transistors for computations INTEL XEON PHI ARCHITECTURE PARALLEL PROCESSING MODELS Flynn classification SISD (Single Instruction Single Data)

Uniprocesor MISD (Multiple Instruction Single Data) Multiple processors on a single data stream Sistolic processor Stream processor SIMD (Single Instruction Multiple Data) Same instruction is executed on multiple processors Each processor has its own memory (different data) Shared memory for control and instructions Good for data level parallelism Vector processor GPUs (partialy) MIMD (Multiple Instruction Multiple Data)

Each processor has its own data and instructions Multiprocessor Processor multithread CENTRALIZED SHARED MEMORY MULTIPROCESSORS Symmetric MultiProcessor (SMP) Processors are interconected through a UMA (Uniform Memory Access) backbone Do no scale well DISTRIBUTED SHARED MEMORY MULTIPROCESSORS SMP clusters NUMA (Non Uniform Memory Access)

Physical memory for each processor (but address space is shared) Avoids starvation: memory is accessed by one processor at a time AMD processors implement the model through HyperTransport (2003), and Intel through QMI (2007) DISTRIBUTED SHARED MEMORY MULTIPROCESSORS Enhanced scalability Low latency for local access Scalable memory bandwidth at low costs But

Higher interprocessor communication times Network technology is vital! Complex software model MESSAGE PASSING MULTIPROCESSORS Multicomputers Communication is based on message passing between processors and not shared memory access Can call remote methods: Remote Procedure Call (RPC) Libraries: MPI (Message Passing Interface) Synchronous communication

Causes process synchronization Address space allocated private addresses to each distinct processors Example Massive Parallel Processing (MPP) IBM Bluegene Clusters Created by linked computers in a LAN SHARED MEMORY VS. MESSAGE

PASSING Shared memory Easy programming Hides the network but does not hide its latency Hardware controlled software To reduce communication overhead Message passing Explicit Can be optimized Natural

communication synchronization When sending messages Programming is challenging as it must consider aspects hidden by the shared memory systems Transport cost DISTRIBUTED SYSTEMS A collection of (probably heterogeneous) automata whose distribution is transparent to the user so that the system appears as one local machine. This is in contrast to a network, where the user is aware that there are several machines, and their location, storage replication, load balancing and functionality is not transparent. Distributed systems usually use some kind of clientserver organization. FOLDOC

A Distributed System comprises several single components on different computers, which normally do not operate using shared memory and as a consequence communicate via the exchange of messages. The various components involved cooperate to achieve a common objective such as the performing of a business process. Schill & Springer PARALLEL VS. DISTRIBUTED SYSTEMS Parallelism Executing multiple tasks at the same time True parallelism requires having as many cores as parallel tasks Concurrent execution Thread based computing Can use hardware parallelism but usually derives from software

requirements Example: the effects of multiple system calls Become parallelism if parallelism is true Distributed computing Refers to where the computation is performed Computers are linked in a network Memory is distributed Is usually part of the objective If resources are distributed then we have a distributed system Raises

many problems from a programming point of view No global clock, synchronization, unpredicted errors, variable latency, security, interoperatibiliy. DISTRIBUTED SYSTEMS MODELS Miniccomputer Workstation Server workstation Processor pool Cluster Grid Cloud MINICOMPUTER Extension of time sharing systems The user logs on the machine Authenticates remotely though telnet Shared resources

Data bases HPC Minicomputer Minicomputer ARPA net Minicomputer WORKSTATION Process migration The user authenticates on the machine If any networked resources are available then the process migrates there Issues

How do you identify available resources? How do we migrate a process? What happens if another user logs on the available resource? Workstation Workstation 100Gbps LAN Workstation Workstation Workstation WORKSTATION SERVER Client stations

Workstation Servers No hard memory Interactive/graphical processes are executed locally All files and computations are sent on the server Each machine is dedicated to a certain type of job Communication model RPC (Remote Procedure Call)

Workstation C RMI (Remote Method Invocation) Workstation Java A client process invokes a server process There is no migration between machines 100Gbps LAN MiniMiniMiniComputerComputer Computer file server http servercycle server PROCESSOR POOL

Client User authenticates on remote machine All services are sent to servers Server Allocates the required number of processors to each client Better usage but less interaction 100Gbps LAN Server 1 Server N CLUSTER

Client Client-server model Server Consists of many interconnected machines through a high speed network The aim is performance Parallel processing Workstation Workstation Workstation

100Gbps LAN http server2 http server1 http server N MasterSlave Slave node 1 2 1Gbps SAN Slave N GRID Aim

Collect processing power from many clusters or parallel systems and make it available to users Similar in concept with a power grid Remote resources are integrated with local ones SuperMinicomputer computer Cluster High-speed Information high way Supercomputer Cluster Big Data

Large problems requiring many resources On demand You just buy what you use HPC distributed computing Workstation Data is distributed Shared computing

Communication between resources Workstation Workstation CLOUD Distributed systems where access to resources is virtualized and on demand, while keeping the topology hidden Per per use (per second, GB, query, etc.) Access levels

Infrastructure (Iaas) Platform (Paas) Services (Saas) Data (DaaS) Amazon EC2 Google Compute Cloud Microsoft Azure Workstation Specific services

VM VM Internet VM Database Workstation Workstation SUMMARY Shared memory Homogeneous resources Nsec access Message passing Homogeneous/heterogeneous resources Microsec access

Distributed systems Heterogeneous resources Msec access SOURCES sed-last-years / http:// https:// nts http:// skadron/cs433_s09_processors/arm11.pdf wuch/course/eef011/4p/eef011-6.pdf NEXT LECTURE Parallelizing algorithms APIs and platforms for parallel and distributed computing OpenMP MPI Uniform CUDA Hadoop Parallel C

Recently Viewed Presentations

  • Historical Fiction WHAT IS HISTORICAL FICTION? Historical fiction

    Historical Fiction WHAT IS HISTORICAL FICTION? Historical fiction

    Bud, Not Buddy. Fictional Story: Ten-year-old Bud is a run-away orphan on a mission. His mother never told him who his father was, but she left Bud some clues before he died. Some of those clues included posters of jazz...
  • Department of Forest Biomaterials 1 WHAT IS MARKETPLACE?

    Department of Forest Biomaterials 1 WHAT IS MARKETPLACE?

    Complete FedEx shipping section on the 'My Profile' page in MarketPlace (see below). ... ("5" account) and other accounts. Neither approver will take action on the order and it will "sit" indefinitely. When satisfied with the entry,
  • Shastri School - Amazon Web Services

    Shastri School - Amazon Web Services

    By: Sruthi Ramakrishnan and Swathi Ramakrishnan Project Stewards, Swathi and Sruthi, in front of the school building This is our sixth visit to Shastri School. We have enjoyed seeing the school develop through the years since we first got involved...
  • Shared Culture  Both were independent city-states  Both speak

    Shared Culture Both were independent city-states Both speak

    Girls received little to no education. Most learned household crafts (weaving/sewing) ... First priority of citizens was to be a good soldier. All men served until they were 60. Most disciplined, well-trained, and feared military force in the world ......
  • CVPR05 tutorial - Cornell University

    CVPR05 tutorial - Cornell University

    A good property of a transform is invertibility. Both Fourier and wavelet transforms are invertible. Many other image-based processes are not invertible. E.g. Distance transform, JPEG compression, edge detection, blurring
  • The &quot;Red&quot; Scare -

    The "Red" Scare -

    The "Red" Scare McCarthyism in the 40's and 50's The "Red" Scare During the late 40's and early 50's Americans feared that communists were trying to take over the US. This was called "red" scare. Spy Trials Several Americans were...
  • Basic Presentation Outline for a 10 Minute Speech (adjust ...

    Basic Presentation Outline for a 10 Minute Speech (adjust ...

    Basic Presentation Outline for a 10 Minute Speech (adjust accordingly) Body (7 minutes) ... Word choice: Know your definitions, avoid speaking like a thesaurus. Avoid jargon and technical terms when possible—if they are necessary, define any terms people might not...
  • # of Experts # of Experts Very Essential

    # of Experts # of Experts Very Essential

    Small Scale - Top Two By Area Clinical Trials in Schizophrenia Test-Retest Reliability Lack of Ceiling/Floor Effects Cognitive Neuroscience of Schizophrenia Construct Validity Test-Retest Reliability Cognitive Rehab of Schizophrenia Construct Validity Test-Retest Reliability Human Cognitive Neuroscience Construct Validity Lack of...