Multicore Cache Coherence Control by a Parallelizing Compiler

Multicore Cache Coherence Control by a Parallelizing Compiler

Software Cache Coherent Control by Parallelizing Compiler Boma A. Adhi, Masayoshi Mase, Yuhei Hosokawa, Yohei Kishimoto Taisuke Onishi, Hiroki Mikami, Keiji Kimura, Hironori Kasahara Department of Computer Science and Engineering Waseda University Waseda University @ LCPC 2017, Texas A&M University What is Cache Coherency? PE0 PE0 Cache - - - - PE1 CPU CPU cache cache - - - - Cache coherency: Data should be

consistent from all CPU Core Interconnection Network Shared memory 1 x Waseda University @ LCPC 2017, Texas A&M University - - - PE1 Cache What is Cache Coherency? PE0 PE1 load x; load x; PE0 Cache 1 - - - CPU CPU

cache cache 1 - - - x x Interconnection Network Shared memory 1 x Waseda University @ LCPC 2017, Texas A&M University - - - PE1 Cache What is Cache Coherency? PE0 PE1 x = 2; PE0 Cache

2 - - - CPU 1 CPU - - - PE1 Cache x x x is now invalid cache cache Interconnection Network Shared memory 1 x Waseda University @ LCPC 2017, Texas A&M University -

- - Why Compiler Controlled Cache Coherency? Current many-cores CPU (Xeon Phi, TilePro 64, etc) uses hardware cache coherency mechanism. Hardware cache coherency for many-cores (hundreds to thousand cores) : Hardware will be complex & occupy large area in silicon Huge power consumption Design takes long time & larger cost Software based coherency: Small hardware, low power & scalable No efficient software coherence control method has been proposed. Waseda University @ LCPC 2017, Texas A&M University Proposed Software Coherence Method on OSCAR Parallelizing Compiler Coarse grain task parallelization with earliest condition analysis (control and data dependency analysis). OSCAR compiler automatically controls coherence using following simple program restructuring methods: To cope with stale data problems: Data synchronization by compilers To cope with false sharing problem: Data Alignment Array Padding Non-cacheable Buffer MTG generated by earliest executable condition analysis Waseda University @ LCPC 2017, Texas A&M University The Renesas RP2 Embedded Multicore

2 clusters of 4 cores hardware cache coherent control SMP, No coherence support for more than 5 cores Waseda University @ LCPC 2017, Texas A&M University Non Coherent Cache Architecture Local Cache V D tag data V: valid bit D: dirty bit local cache control instructions from the owner CPU self-invalidate Writeback flush PE0 PE1 PEn CPU CPU CPU Local Cache

Local Cache Local Cache Interconnection Network Shared Memory Waseda University @ LCPC 2017, Texas A&M University Problem in Non-coherent Architecture (1): Stale Data a b c Stale Data: Global Variable Declaration int a = 0; int b = 0; int c = 10; Shared Memory PE0 PE1 b = a; An obsolete shared data exists on a processor cache after a new value is updated by another processor core

a = 0; b = 0; PE1 Cache Time PE0 Cache a = 20; a = 20; If the data on the cache continue to exist c = a; Correct value with coherent control PE1 Cache a=20 by PE0 is not published and PE1 calculates c using stale data Value is not coherent Waseda University @ LCPC 2017, Texas A&M University a = 0; c = 0; Solving Stale Data Problem The compiler automatically inserts a writeback instruction and a synchronization barrier on PE0 code. Then, it inserts a selfinvalidate instruction on PE1 after the synchronization barrier. Waseda University @ LCPC 2017, Texas A&M University

Problem in Non-coherent Architecture (2): False Sharing Global Valuable False sharing: A condition which multiple cores share the same memory block or cache line. Then, each processor updates the different elements on the same block/cache line. Declaration int a = 0; int b = 0; PE0 a and b are different elements on the same cache line PE1 a = 10; b = 20; Shared Memory PE0 Cache line a b Line replacement

Line replacement With cache coherency: PE1 Cache line Memory a b Depending on the timing of the line replacement, the stored data will be inconsistent. Waseda University @ LCPC 2017, Texas A&M University time Solving False Sharing (1): Data Alignment Data alignment solve the false sharing problem. Different data accessed by different PE should not share a single cache line. Variable Declaration PE0 int a __attribute__((aligned(16))) = 0; int b __attribute__((aligned(16))) = 0; /* a, b are assigned to the first element of a = 10; each cache line */ a and b are assigned to the first PE0 Cache elements of different cache lines line1 This approach works for scalar variables

and small-sized onedimensional array. line2 Mem Updates by PE0 and PE1 are separately performed by write back Waseda University @ LCPC 2017, Texas A&M University PE1 b = 20; PE0 Cache Two-dimensional Array Problem Splitting an array cleanly along the cache line is not always possible. Lowest dimension is not integer multiply of cache line cache line (0,0) (1,0) PE0 write (2,0) (3,0) PE0 & PE1 share cache line (4,0) int a[6][6]; PE0 PE1 for (i = 0; i < 3; i++) { for (j = 0; j < 6; j++) {

a[i][j] = i * j; } } for (i = 3; i < 6; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } } (5,0) PE1 write Waseda University @ LCPC 2017, Texas A&M University Solving False Sharing (2): Array Padding Memory The compiler inserts a padding (dummy data) to the end of the array to match the cache line size a (1,0) (1,1) (1,2) (1,3) cache line (1,4) (1,5) (1,6) (1,7) (2,0) PE0 write (3,0) (4,0) int a[6][8] __attribute__((aligned(16))); /* a is aligned to the first element of the cache line */

PE0 PE1 for (i = 0; i < 3; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } } for (i = 3; i < 6; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; } } (5,0) PE1 write (6,0) Waseda University @ LCPC 2017, Texas A&M University a Solving False Sharing (3): Non-cacheable Buffer Data transfer using non-cacheable buffer: The idea is to put a small area in the main memory that should not be copied to the cache along the border between area modified by different processor core.

int a[6][6] __attribute__((aligned(16))); /* a is assigned to the first element of the cache line */ int nc_buf[1][6] __attribute((section(UNCACHE))); /* nc_buf is prepared in non-cacheable area */ cache line (1,0) (2,0) (3,0) PE0 write (4,0) (5,0) PE1 write PE1 PE0 for (i = 1; i < 4; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; b[i-1][j] = i * j; } } for (i = 4; i < 5; i++) { for (j = 0; j < 6; j++) { a[i][j] = i * j; nc_buf[i-4][j] = i * j; } } for (i = 4; i < 5; i++) { for (j = 0; j < 6; j++) { b[i-1][j] = nc_buf[i-4][j]; } }

for (i = 5; i < 6; i++) { for (j = 0; j < 6; j++) { a[i][j] = I * j; b[i-1][j] = i * j; } } Waseda University @ LCPC 2017, Texas A&M University b cache line (0,0) (1,0) PE0 write (3,0) Data transfer using buffer (5,0) PE1 write (2,0) (4,0) Performance Evaluation 10 Benchmark Application in C In Parallelizable C format (Similar to Misra C in embedded field) Fed into source to source automatic parallelization OSCAR Compiler with noncache coherence control Parallelized code then fed into Renesas C Compiler Evaluated in RP2 Processor 8 core Dual 4 core modules Hardware cache coherency for up to 4 cores in the same module

Waseda University @ LCPC 2017, Texas A&M University Performance of Hardware Coherence Control & Software Coherence Control on 8-core RP2 6.00 5.66 SMP(Hardware Coherence) Speedup 5.00 4.76 4.63 4.37 4.00 3.00 2.95 2.65 2.63 2.52 2.00 1.00 NCC(Software Coherence) 3.71 3.34 3.65 1.45 1.38

1.07 1 3.28 2.9 3.19 2.99 3.67 3.49 3.32 2.87 2.86 2.86 2.7 2.36 2.36 1.76 1.67 11.1 1.9 1.76 1.06 1 1.81 1.79 1.01 1 4.92 2.01 1.84

1.07 1 1.95 1.87 1.32 1.32 1.03 1 1.05 1 0.00 Application/the number of processor core Waseda University @ LCPC 2017, Texas A&M University 1.79 1.77 1.05 1 3.17 3.02 2.19 2.19 1.89 1.7 1.67 1.55 1.4 1.07 1.02 1 1 Speedup SPEC Result 5.00

4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 SMP(Hardware Coherence) NCC(Software Coherence) 4.76 4.63 4.37 3.65 2.63 2.52 11.07 1.45 1.38 3.28 2.9 2.95 2.65 1.761.9 1.76 1.67 1 1.1 3.19 2.99 11.06

Application/the number of processor core Waseda University @ LCPC 2017, Texas A&M University 1.81 1.79 11.01 NAS Parallel Benchmark and MediaBench II 6.00 5.66 SMP(Hardware Coherence) NCC(Software Coherence) Speedup 5.00 3.71 3.34 4.00 1.00 3.49 3.32 2.87 2.86 3.00 2.00 3.67 2.01 1.84

11.07 2.36 2.36 1.32 1.32 11.03 1.95 1.87 11.05 2.86 2.70 1.79 1.77 11.05 0.00 Application/the number of processor core Waseda University @ LCPC 2017, Texas A&M University 4.92 3.17 3.02 2.19 2.19 1.89 1.7 1.67 1.55 1.40 11.07 11.02 Software Coherence Impact on Performance SMP: Hardware cache coherence (baseline). NCC(soft): Both stale data handling and false sharing

turned on. Hardware CC off. NCC(hard): All stale data handling , false sharing turned on and also Hardware CC on. Stale Data Handling only. Hardware CC off. False Sharing Avoidance only. Hardware CC off. Waseda University @ LCPC 2017, Texas A&M University Conclusions OSCAR compiler automatically controls coherence using following simple program restructuring methods: To cope with stale data problems: Data synchronization by compilers To cope with false sharing problem: Data Alignment Array Padding Non-cacheable Buffer Evaluated using 10 benchmark applications Renesas RP2 8 core multicore processor. SPEC2000 SPEC2006 NAS Parallel Benchmark MediaBench II Waseda University @ LCPC 2017, Texas A&M University Conclusion (contd) Provides good speedup automatically, few examples with 4 cores:

2.63 times SPEC2000 equake (2.52 with hardware) 3.28 times SPEC2009 lbm (2.9 with hardware) 3.71 times on NPB cg (3.34 with hardware) 3.02 times on MediaBench MPEG2 Encoder (3.17 with hardware) Enable usage of 8 cores automatically on RP2 with good speedup: 4.37 times SPEC2000 equake 4.76 times SPEC2009 lbm 5.66 times on NPB cg 4.92 times on MediaBench MPEG2 Encoder Waseda University @ LCPC 2017, Texas A&M University Thank you Waseda University @ LCPC 2017, Texas A&M University

Recently Viewed Presentations

  • Mitosis - MS. MCCABE&#x27;S CLASSES - Home

    Mitosis - MS. MCCABE'S CLASSES - Home

    UNIT 3 QUIZ. Write your name, date, and period on the quiz. Eyes on your own paper, zero/F for even attempting to cheat. No talking. When you finish, turn in your quiz at the blue basket on top of the...
  • Lecture Presentation to accompany Investment Analysis ...

    Lecture Presentation to accompany Investment Analysis ...

    The rest of the portfolio would be managed actively in one or several additional "plus" sectors, where it is felt that there is a higher probability of achieving positive abnormal rates of return because of potential inefficiencies Immunization Strategies A...
  • Marketing Strategy: Implementing Marketing Principles and Data Analytics

    Marketing Strategy: Implementing Marketing Principles and Data Analytics

    STP is often combined with a customer-centric approach or strategy, in which the firm recognizes the long-term value of its core customer segment and puts it at the core of all major internal business processes and decisions. ... Finally, sustainable...
  • WRITING Implementation 2017 Oregon Official Scoring Guide ...

    WRITING Implementation 2017 Oregon Official Scoring Guide ...

    Sentence Fluency. At the high school level, Sentence Fluency is not explicitly addressed in the Common Core. In the Smarter Balanced rubric, reference is made to "syntactic control is strong." This phrase is used . at . the "6" level.
  • Quarter 3- Week 2: The Empire at Its Peak; Persecution

    Quarter 3- Week 2: The Empire at Its Peak; Persecution

    Quarter 3- Week 2: The Empire at Its Peak; Persecution. Universal Theme Five: The Liberation of God's People (50 BC - 300 AD).Like last week, the events covered in the readings this week look like anything but "the liberation of...
  • OSHA Bloodborne Pathogen and Tuberculosis Training

    OSHA Bloodborne Pathogen and Tuberculosis Training

    ECU Infection Control Tuberculosis Airborne Pathogen Old Enemy New Battle TB Trends by Case Rate Per 100,000 Population TB in NC and Pitt Co 2007 2007=335 cases reported in NC, ranking NC 22nd in the nation 2004= 7 cases 2006=...
  • Arachnoid cysts - wikifoundryattachments.com

    Arachnoid cysts - wikifoundryattachments.com

    Arachnoid cysts can be relatively asymptomatic. Most of time . incidentally. diagnosed. We have demonstrated the rare complication of rupture of a middlecranial fossa arachnoid cyst into the . subdural space without. hemorrhage. Rupture arachnoid cyst is very very rare....
  • An introduction to IT4IT Charles Betz Charlie Betz

    An introduction to IT4IT Charles Betz Charlie Betz

    TOGAF, ArchiMate, and IT4IT at The Open Group. Eclipse Process Framework (EPF) at the Eclipse Foundation. Business partners with Sparx, HP, and IBM. The Open Group. Enables all organizations that use information technology to do things better, faster and cheaper.