Why Systolic Architectures

•

0 j'aime•254 vues

Mindos Cheng

Paper sharing on Systolic Array.

Technologie

https://www.appliedimage.com/reference-info/using-nbs-1010a-resolution-test-target/

Dark Silicon
https://www.publicdomainpictures.net/en/view-image.php?image=44607&picture=portrait-of-the-dark-sides-man

Systolic Array
• Easy to describe in software language.

• easy to program w some kind of domain speciﬁc
language.

• Elegant

• Layout friendly

Memory Bandwidth
grows slowly
Spec Year MB/s
DDR 2000 2667
DDR2 2003 5333
DDR3 2007 12800
DDR4 2014 19200

Increasing Operations / IO
H. T. Kung 1982

y1 = y1+w1x1
ϵ = ϵ+w2x1
ϵ = ϵ+w3x1
ϵ = ϵ+w4x1
y2 = y2+w1x2
y1 = y1+w2x2
ϵ = ϵ+w3x2
ϵ = ϵ+w4x2
y3 = y3+w1x3
y2 = y2+w2x3
y1 = y1+w3x3
ϵ = ϵ+w4x3
y4 = y4+w1x4
y3 = y3+w2x4
y2 = y2+w3x4
y1 = y1+w4x4
y5 = y5+w1x5
y4 = y4+w2x5
y3 = y3+w3x5
y2 = y2+w4x5
output : y1
time(broadcast)
space(move)
H. T. Kung 1982

H. T. Kung 1982
Better precision of summation, 
if MAC has more digit than bus
Require separate bus for collecting  
output from individual cells

Without global data communication
Better precision of summation  
(same as B2)
Systolic output path 
(or use next row in 2D)
Nodes are activated half of the time.

x1
x1
x1
x2
x2
x2
x2
x2x3
x3
x3
x3
x3
x4
x4
x4
w1
w1
w1 w2
w2
w2 w3
circle 0
circle 1
circle 2
circle 3
circle 4
circle 5
circle 6
Half nodes are
activated at any
given time.

Without global data communication
1 node / cycle
1 node / 2 cycles
Register to keep w
Better precision of summation

w1
w1
w2
w2
w1
w1
x1
x1
x1
x2
x2
x2
x2x3
x3
x3
x3
w3
w2
w3
w2
w1
x3
w1
x4
w2
x4
w3
w3
x4
x4x5
x5
w1
y4
y4
y4
x6
w1
w1
w2
w2
w3
Y5
x4x5
x6
x7
w2
w3
w3
w1
w2
w1
w3
w2
w2 w1
w1
w3
x5x6
x7
x8

Odd-Even Transposition
Sort Active
Comp & Swap
O((n/k)log(n/k)) + O(k(n/k))
H. T. Kung 1979

• insert()

• delete()

• extract_min()
Priority Queue Operations
For n operations:
O(n log n) O(n)
Key : One operation
can be issued after
another in time.O(1)

Priority Queue Operations
insert(k) delete(k) extract_min()
Sink down the  
element with key k
A) Sink down a fake
element with key k
to ﬁnd target.
B) Remove the target.
C) Bubble up the
below ones.
A) Take ﬁrst element
B) Bubble up the below
ones.

xi = R(xi−1, . . . , xi−k)
xi = axi−1 + bxi−2 + cxi−k + d

Cloud TPU
Google Cloud Platform Blog  
https://cloud.google.com/tpu/
TPU V3TPU V2
TPU V2 Pod

TPU Programming
• A cloud TPU has 4 chips x 2
cores x 1 or 2 MXU

• MXU

• 128x128 systolic array

• 16K MAC / cycle

• bﬂoat16

• TPU memory prefer 8 bytes
alignment.

• 8 or 16GB HBM2 / core
https://cloud.google.com/tpu/docs/tpus 
https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/
Titan X has  
3.5K cuda cores
So, each TPU V3 card has
4 chips x 2 cores x 2 MXU x 16K MAC / cycle
= 256K MAC / cycle at most.

https://cloud.google.com/tpu/docs/system-architecture

TPU Programming
• XLA compiler for TensorFlow programs.

• Tiling => Need reshape

• Shape => No dynamic batch

• Padding => under utilize TPU, more memory usage

• op_proﬁle tool

TPU Programming
• Dense vector and matrix computations are fast

• M x M, M x v, Convolution

• Data movement on PCIe is slow.

• Only dense parts of the model, loss and gradient subgraphs are on TPU.

• I/O, reading data, writing checkpoint, preprocessing data is on CPU.

• decoding compressed images, randomly sampling/cropping, assembling training minibatches

• Non-matrix operations will likely not achieve high MXU utilization.

• add, reshape, or concatenate

• feature dimension => 128 x

• Batch dimension => 8 x

TPUEstimator
• TPUEstimator provides a graph operator to build and run
a replicated computation
https://www.tensorﬂow.org/api_docs/python/tf/contrib/tpu/TPUEstimator

Module: tf.contrib.tpu
https://www.tensorﬂow.org/api_docs/python/tf/contrib/tpu

https://en.wikipedia.org/wiki/The_Boss_Baby

Recommandé

On Mining Bitcoins - Fundamentals & OutlooksFilip Maertens

1606015 m1 yamamotorobo_lab

Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.

PostgreSQL Blackhole FDW - lightning talk 2013Andrew Dunstan

Scaling the #2ndhalfSalo Shp

Secure and privacy-preserving data transmission and processing using homomorp...DefCamp

Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.

Convolutional Neural NetworkJun Young Park

Recommandé

On Mining Bitcoins - Fundamentals & OutlooksFilip Maertens

1606015 m1 yamamotorobo_lab

Fast Wavelet Tree Construction in PracticeRakuten Group, Inc.

PostgreSQL Blackhole FDW - lightning talk 2013Andrew Dunstan

Scaling the #2ndhalfSalo Shp

Secure and privacy-preserving data transmission and processing using homomorp...DefCamp

Fast Identification of Heavy Hitters by Cached and Packed Group TestingRakuten Group, Inc.

Convolutional Neural NetworkJun Young Park

Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire

Fast indexes with roaring #gomtl-10 Daniel Lemire

Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentPôle Systematic Paris-Region

20190611 Study Neural NetworkKSuzukiii

Engineering fast indexesDaniel Lemire

Evolution of MacrosOsvaldas Grigas

Cryptography : From Demaratus to RSAbenlamm

Introduction to CryptographyDavid Evans

Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.

Introduction to Homomorphic EncryptionChristoph Matthies

CryptographyDavid Evans

対応点を用いないローリングシャッタ歪み補正と映像安定化論文doboncho

MXNet WorkshopAmazon Web Services

Erlang/N2O at KNPMeetup 2015Oleg Zinchenko

Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsMITSUNARI Shigeo

Bitcoin ScriptDavid Evans

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

C vs Java: Finding Prime NumbersAdam Feldscher

Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...MITSUNARI Shigeo

Hash functionssameezahur

Gpu perf-presentationGiannisTsagatakis

GPGPU Computationjtsagata

Contenu connexe

Tendances

Next Generation Indexes For Big Data Engineering (ODSC East 2018)Daniel Lemire

Fast indexes with roaring #gomtl-10 Daniel Lemire

Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur BittorrentPôle Systematic Paris-Region

20190611 Study Neural NetworkKSuzukiii

Engineering fast indexesDaniel Lemire

Evolution of MacrosOsvaldas Grigas

Cryptography : From Demaratus to RSAbenlamm

Introduction to CryptographyDavid Evans

Faster Practical Block Compression for Rank/Select DictionariesRakuten Group, Inc.

Introduction to Homomorphic EncryptionChristoph Matthies

CryptographyDavid Evans

対応点を用いないローリングシャッタ歪み補正と映像安定化論文doboncho

MXNet WorkshopAmazon Web Services

Erlang/N2O at KNPMeetup 2015Oleg Zinchenko

Practical Two-level Homomorphic Encryption in Prime-order Bilinear GroupsMITSUNARI Shigeo

Bitcoin ScriptDavid Evans

Unite2019 HLOD를 활용한 대규모 씬 제작 방법장규 서

C vs Java: Finding Prime NumbersAdam Feldscher

Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...MITSUNARI Shigeo

Hash functionssameezahur

Tendances (20)

Next Generation Indexes For Big Data Engineering (ODSC East 2018)

Fast indexes with roaring #gomtl-10

Osis18_Cloud : DeepTorrent Stockage distribué perenne basé sur Bittorrent

20190611 Study Neural Network

Engineering fast indexes

Evolution of Macros

Cryptography : From Demaratus to RSA

Introduction to Cryptography

Faster Practical Block Compression for Rank/Select Dictionaries

Introduction to Homomorphic Encryption

Cryptography

対応点を用いないローリングシャッタ歪み補正と映像安定化論文

MXNet Workshop

Erlang/N2O at KNPMeetup 2015

Practical Two-level Homomorphic Encryption in Prime-order Bilinear Groups

Bitcoin Script

Unite2019 HLOD를 활용한 대규모 씬 제작 방법

C vs Java: Finding Prime Numbers

Efficient Two-level Homomorphic Encryption in Prime-order Bilinear Groups and...

Hash functions

Similaire à Why Systolic Architectures

Gpu perf-presentationGiannisTsagatakis

GPGPU Computationjtsagata

Gpu and The Brick Wallugur candan

Is your SQL Exadata-aware?Mauro Pagano

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...Daosheng Mu

Brkdct 3101Nguyen Van Linh

OakTable World Sep14 clonedb Connor McDonald

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...Tokyo Institute of Technology

44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick44CON

got HW crypto-slides_hardwearGunnar Alendal

Jetson AGX Xavier and the New Era of Autonomous MachinesDustin Franklin

Potapenko, vyukov forewarned is forearmed. a san and tsanDefconRussia

그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...ozlael ozlael

Masked Occlusion CullingIntel® Software

Developing Next-Generation Games with Stage3D (Molehill) Jean-Philippe Doiron

Triangle Visibility bufferWolfgang Engel

Jeff Johnson, Research Engineer, Facebook at MLconf NYCMLconf

Address/Thread/Memory SanitizerPlatonov Sergey

Architect CheatsheetKarthik Ethirajan

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Akihiro Hayashi

Similaire à Why Systolic Architectures (20)

Gpu perf-presentation

GPGPU Computation

Gpu and The Brick Wall

Is your SQL Exadata-aware?

Using The New Flash Stage3D Web Technology To Build Your Own Next 3D Browser ...

Brkdct 3101

OakTable World Sep14 clonedb

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memor...

44CON 2014 - Stupid PCIe Tricks, Joe Fitzpatrick

got HW crypto-slides_hardwear

Jetson AGX Xavier and the New Era of Autonomous Machines

Potapenko, vyukov forewarned is forearmed. a san and tsan

그래픽 최적화로 가...가버렷! (부제: 배치! 배칭을 보자!) , Batch! Let's take a look at Batching! -...

Masked Occlusion Culling

Developing Next-Generation Games with Stage3D (Molehill)

Triangle Visibility buffer

Jeff Johnson, Research Engineer, Facebook at MLconf NYC

Address/Thread/Memory Sanitizer

Architect Cheatsheet

Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...

Plus de Mindos Cheng

Deep Learning Accelerator Design TechniquesMindos Cheng

Tensor CoreMindos Cheng

Open GL ES AndroidMindos Cheng

Federated learningMindos Cheng

OpenGL ES 3.0 2013Mindos Cheng

Introduction to G0V.tw 2013Mindos Cheng

Google IO 2016Mindos Cheng

GTC 2016 Taiwan StartupsMindos Cheng

GTC 2016 Taiwan DemosMindos Cheng

GTC 2016 Taiwan GeneralMindos Cheng

ORB SLAM Proposal for NTU GPU Programming Course 2016Mindos Cheng

Few Things about Mobile GPUMindos Cheng

Graph-powered Machine Learning at Google @ Google BlogMindos Cheng

Plus de Mindos Cheng (13)

Deep Learning Accelerator Design Techniques

Tensor Core

Open GL ES Android

Federated learning

OpenGL ES 3.0 2013

Introduction to G0V.tw 2013

Google IO 2016

GTC 2016 Taiwan Startups

GTC 2016 Taiwan Demos

GTC 2016 Taiwan General

ORB SLAM Proposal for NTU GPU Programming Course 2016

Few Things about Mobile GPU

Graph-powered Machine Learning at Google @ Google Blog

Dernier

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

A Domino Admins Adventures (Engage 2024)Gabriella Davis

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun

Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

A Call to Action for Generative AI in 2024Results

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

Dernier (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

Injustice - Developers Among Us (SciFiDevCon 2024)

A Domino Admins Adventures (Engage 2024)

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Data Cloud, More than a CDP by Matt Robison

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams

08448380779 Call Girls In Friends Colony Women Seeking Men

Automating Google Workspace (GWS) & more with Apps Script

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

08448380779 Call Girls In Civil Lines Women Seeking Men

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

CNv6 Instructor Chapter 6 Quality of Service

Exploring the Future Potential of AI-Enabled Smartphone Processors

A Call to Action for Generative AI in 2024

[2024]Digital Global Overview Report 2024 Meltwater.pdf

How to Troubleshoot Apps for the Modern Connected Worker

Presentation on how to chat with PDF using ChatGPT code interpreter

Why Systolic Architectures

1. Systolic Array

2. https://www.appliedimage.com/reference-info/using-nbs-1010a-resolution-test-target/

3. Dark Silicon https://www.publicdomainpictures.net/en/view-image.php?image=44607&picture=portrait-of-the-dark-sides-man

4. Systolic Array 脈衝陣列列？！

6. Systolic Array • Easy to describe in software language. • easy to program w some kind of domain speciﬁc language. • Elegant • Layout friendly

7. Memory Bandwidth grows slowly Spec Year MB/s DDR 2000 2667 DDR2 2003 5333 DDR3 2007 12800 DDR4 2014 19200

8. Increasing Operations / IO H. T. Kung 1982

9. Convolution Problem

10. y1 = y1+w1x1 ϵ = ϵ+w2x1 ϵ = ϵ+w3x1 ϵ = ϵ+w4x1 y2 = y2+w1x2 y1 = y1+w2x2 ϵ = ϵ+w3x2 ϵ = ϵ+w4x2 y3 = y3+w1x3 y2 = y2+w2x3 y1 = y1+w3x3 ϵ = ϵ+w4x3 y4 = y4+w1x4 y3 = y3+w2x4 y2 = y2+w3x4 y1 = y1+w4x4 y5 = y5+w1x5 y4 = y4+w2x5 y3 = y3+w3x5 y2 = y2+w4x5 output : y1 time(broadcast) space(move) H. T. Kung 1982

11. H. T. Kung 1982 Better precision of summation,  if MAC has more digit than bus Require separate bus for collecting   output from individual cells

12. Could be pipelined adder tree

13. Without global data communication Better precision of summation   (same as B2) Systolic output path  (or use next row in 2D) Nodes are activated half of the time.

14. x1 x1 x1 x2 x2 x2 x2 x2x3 x3 x3 x3 x3 x4 x4 x4 w1 w1 w1 w2 w2 w2 w3 circle 0 circle 1 circle 2 circle 3 circle 4 circle 5 circle 6 Half nodes are activated at any given time.

15. Without global data communication 1 node / cycle 1 node / 2 cycles Register to keep w Better precision of summation  

16. w1 w1 w2 w2 w1 w1 x1 x1 x1 x2 x2 x2 x2x3 x3 x3 x3 w3 w2 w3 w2 w1 x3 w1 x4 w2 x4 w3 w3 x4 x4x5 x5 w1 y4 y4 y4 x6 w1 w1 w2 w2 w3 Y5 x4x5 x6 x7 w2 w3 w3 w1 w2 w1 w3 w2 w2 w1 w1 w3 x5x6 x7 x8

17.

18.

19.

20. Sorting

21. Odd-Even Transposition Sort Active Comp & Swap O((n/k)log(n/k)) + O(k(n/k)) H. T. Kung 1979

22. Finite Impulse Response Filtering

23. In Matrix Form H. T. Kung 1979

24. H. T. Kung 1979

25. H. T. Kung 1979 0 0 0 0

26. Priority Queue

27. • insert() • delete() • extract_min() Priority Queue Operations For n operations: O(n log n) O(n) Key : One operation can be issued after another in time.O(1)

28. Priority Queue Operations insert(k) delete(k) extract_min() Sink down the   element with key k A) Sink down a fake element with key k to ﬁnd target. B) Remove the target. C) Bubble up the below ones. A) Take ﬁrst element B) Bubble up the below ones.

29. Recurrence Evaluation

30. xi = R(xi−1, . . . , xi−k) xi = axi−1 + bxi−2 + cxi−k + d

31. Removing Loops

32. Alternatives

33.

34.

35. Cloud TPU Google Cloud Platform Blog   https://cloud.google.com/tpu/ TPU V3TPU V2 TPU V2 Pod

36. TPU Programming • A cloud TPU has 4 chips x 2 cores x 1 or 2 MXU • MXU • 128x128 systolic array • 16K MAC / cycle • bﬂoat16 • TPU memory prefer 8 bytes alignment. • 8 or 16GB HBM2 / core https://cloud.google.com/tpu/docs/tpus  https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/ Titan X has   3.5K cuda cores So, each TPU V3 card has 4 chips x 2 cores x 2 MXU x 16K MAC / cycle = 256K MAC / cycle at most.

37. https://cloud.google.com/tpu/docs/system-architecture

38. TPU Programming • XLA compiler for TensorFlow programs. • Tiling => Need reshape • Shape => No dynamic batch • Padding => under utilize TPU, more memory usage • op_proﬁle tool

39. TPU Programming • Dense vector and matrix computations are fast • M x M, M x v, Convolution • Data movement on PCIe is slow. • Only dense parts of the model, loss and gradient subgraphs are on TPU. • I/O, reading data, writing checkpoint, preprocessing data is on CPU. • decoding compressed images, randomly sampling/cropping, assembling training minibatches • Non-matrix operations will likely not achieve high MXU utilization. • add, reshape, or concatenate • feature dimension => 128 x • Batch dimension => 8 x

40. TPUEstimator • TPUEstimator provides a graph operator to build and run a replicated computation https://www.tensorﬂow.org/api_docs/python/tf/contrib/tpu/TPUEstimator

41. Module: tf.contrib.tpu

42. Module: tf.contrib.tpu https://www.tensorﬂow.org/api_docs/python/tf/contrib/tpu

43. Afﬁnity

44. https://en.wikipedia.org/wiki/The_Boss_Baby