Tensor Core

•

0 j'aime•569 vues

Mindos Cheng

A brief study for Nvidia Tensor Core.

Technologie

Tensor Core
"SIMD" for GPU
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

Tensor Cores
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

Tensor Cores
https://www.nvidia.com/en-us/data-center/tensorcore/

12X
https://www.nvidia.com/en-us/data-center/tensorcore/

Supported Types
namespace experimental {
namespace precision {
struct u4; // 4-bit unsigned
struct s4; // 4-bit signed
struct b1; // 1-bit
}
enum bmmaBitOp { bmmaBitOpXOR = 1 };
enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 };
}
• Input : FP16, u8, s8, u4, s4, b1

• Accumulator : FP16, FP32, int

• Also in experimental:

Mixed Precision
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

CUDA Library
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
also in TensorRT 3
cuBLAS cuDNN

CUDA WMMA API
https://en.wikipedia.org/wiki/Joanna_J%C4%99drzejczyk

CPU Level
simpleTensorCoreGEMM.cu
https://github.com/parallel-forall/code-samples/blob/master/posts/tensor-cores/simpleTensorCoreGEMM.cu
call kernel function in wrap

Warp-Level
http://on-demand.gputechconf.com/gtc/2017/presentation/s7132-mark-harris-new-cuda-features-and-beyond.pdf
(In short)

Warp-Level : 
Initialization
Values
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
simpleTensorCoreGEMM.cu
Kernel function in wrap

Warp-Level : 
Fragments on Registers
Fragment Type
Clear Acc
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

Warp-Level :
Tile Calculation(compute one tile of the output matrix per warp)
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/
= x +

Warp-Level :
Finishing
Optional Scaling
C = alpha * Acc + beta * C
Store to Memory
https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

Availability
• V100, Titan V

• RTX 2070, RTX 2080, RTX 2080 Ti, etc.

Contenu connexe

Tendances

Introduction to OpenCLUnai Lopez-Novoa

Slab Allocator in Linux KernelAdrian Huang

Enable DPDK and SR-IOV for containerized virtual network functions with zunheut2008

FD.io Vector Packet Processing (VPP)Kirill Tsym

Switchdev - No More SDKKernel TLV

BPF - in-kernel virtual machineAlexei Starovoitov

Linux Internals - Part IIIEmertxe Information Technologies Pvt Ltd

System Verilog Tutorial - VHDLE2MATRIX

QEMU Disk IO Which performs Better: Native or threads?Pradeep Kumar

DPDK KNI interfaceDenys Haryachyy

Dave Gilbert - KVM and QEMUDanny Abukalam

Linux Systems Performance 2016Brendan Gregg

Physical Memory Management.pdfAdrian Huang

Linux dma enginepradeep_tewani

DPDK In DepthKernel TLV

Static partitioning virtualization on RISC-VRISC-V International

FD.IO Vector Packet ProcessingKernel TLV

Linux Kernel and Driver Development TrainingStephan Cadene

[若渴]Study on Side Channel Attacks and Countermeasures Aj MaChInE

Evil Shell: Hacking Linux UsersMohammed ALDOUB

Tendances (20)

Introduction to OpenCL

Slab Allocator in Linux Kernel

Enable DPDK and SR-IOV for containerized virtual network functions with zun

FD.io Vector Packet Processing (VPP)

Switchdev - No More SDK

BPF - in-kernel virtual machine

Linux Internals - Part III

System Verilog Tutorial - VHDL

QEMU Disk IO Which performs Better: Native or threads?

DPDK KNI interface

Dave Gilbert - KVM and QEMU

Linux Systems Performance 2016

Physical Memory Management.pdf

Linux dma engine

DPDK In Depth

Static partitioning virtualization on RISC-V

FD.IO Vector Packet Processing

Linux Kernel and Driver Development Training

[若渴]Study on Side Channel Attacks and Countermeasures

Evil Shell: Hacking Linux Users

Similaire à Tensor Core

7nm "Navi" GPU - A GPU Built For Performance AMD

Java Jit. Compilation and optimization by Andrey KovalenkoValeriia Maliarenko

Building an ActionScript Game Server with over 15,000 Concurrent ConnectionsRenaun Erickson

Experiences with Power 9 at A*STAR CRCGanesan Narayanasamy

Introduction to CUDARaymond Tay

GPU: Understanding CUDAJoaquín Aparicio Ramos

Persistent Memory Programming with PmemkvIntel® Software

Vc4c development of opencl compiler for videocore4nomaddo

C++ AMP 실천 및 적용 전략 명신 김

Hardware & Software Platforms for HPC, AI and MLinside-BigData.com

100Gbps OpenStack For Providing High-Performance NFVNTT Communications Technology Development

QEMU and Raspberry Pi. Instant Embedded DevelopmentGlobalLogic Ukraine

GPU for DLNikolay Karelin

Cuda introductionHanibei

PostgresOpen 2013 A Comparison of PostgreSQL Encryption OptionsFaisal Akber

S12075-GPU-Accelerated-Video-Encoding.pdfgopikahari7

Jvm profiling under the hoodRichardWarburton

Node.js - Advanced BasicsDoug Jones

Scale Out Your Graph Across Servers and Clouds with OrientDBLuca Garulli

한컴MDS_Virtual Target Debugging with TRACE32HANCOM MDS

Similaire à Tensor Core (20)

7nm "Navi" GPU - A GPU Built For Performance

Java Jit. Compilation and optimization by Andrey Kovalenko

Building an ActionScript Game Server with over 15,000 Concurrent Connections

Experiences with Power 9 at A*STAR CRC

Introduction to CUDA

GPU: Understanding CUDA

Persistent Memory Programming with Pmemkv

Vc4c development of opencl compiler for videocore4

C++ AMP 실천 및 적용 전략

Hardware & Software Platforms for HPC, AI and ML

100Gbps OpenStack For Providing High-Performance NFV

QEMU and Raspberry Pi. Instant Embedded Development

GPU for DL

Cuda introduction

PostgresOpen 2013 A Comparison of PostgreSQL Encryption Options

S12075-GPU-Accelerated-Video-Encoding.pdf

Jvm profiling under the hood

Node.js - Advanced Basics

Scale Out Your Graph Across Servers and Clouds with OrientDB

한컴MDS_Virtual Target Debugging with TRACE32

Plus de Mindos Cheng

Deep Learning Accelerator Design TechniquesMindos Cheng

Open GL ES AndroidMindos Cheng

Why Systolic ArchitecturesMindos Cheng

Federated learningMindos Cheng

OpenGL ES 3.0 2013Mindos Cheng

Introduction to G0V.tw 2013Mindos Cheng

Google IO 2016Mindos Cheng

GTC 2016 Taiwan StartupsMindos Cheng

GTC 2016 Taiwan DemosMindos Cheng

GTC 2016 Taiwan GeneralMindos Cheng

ORB SLAM Proposal for NTU GPU Programming Course 2016Mindos Cheng

Few Things about Mobile GPUMindos Cheng

Graph-powered Machine Learning at Google @ Google BlogMindos Cheng

Plus de Mindos Cheng (13)

Deep Learning Accelerator Design Techniques

Open GL ES Android

Why Systolic Architectures

Federated learning

OpenGL ES 3.0 2013

Introduction to G0V.tw 2013

Google IO 2016

GTC 2016 Taiwan Startups

GTC 2016 Taiwan Demos

GTC 2016 Taiwan General

ORB SLAM Proposal for NTU GPU Programming Course 2016

Few Things about Mobile GPU

Graph-powered Machine Learning at Google @ Google Blog

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Slack Application Development 101 Slidespraypatel2

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Developing An App To Navigate The Roads of BrazilV3cube

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

How to convert PDF to text with Nanonetsnaman860154

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Slack Application Development 101 Slides

2024: Domino Containers - The Next Step. News from the Domino Container commu...

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

GenCyber Cyber Security Day Presentation

Developing An App To Navigate The Roads of Brazil

Unblocking The Main Thread Solving ANRs and Frozen Frames

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Finology Group – Insurtech Innovation Award 2024

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

How to convert PDF to text with Nanonets

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...

Scaling API-first – The story of a global engineering organization

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Boost PC performance: How more available memory can improve productivity

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...

Presentation on how to chat with PDF using ChatGPT code interpreter

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

Tensor Core

1. Tensor Core "SIMD" for GPU https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

2. Tensor Cores https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

3. Tensor Cores https://www.nvidia.com/en-us/data-center/tensorcore/

4. 12X https://www.nvidia.com/en-us/data-center/tensorcore/

5. Supported Types namespace experimental { namespace precision { struct u4; // 4-bit unsigned struct s4; // 4-bit signed struct b1; // 1-bit } enum bmmaBitOp { bmmaBitOpXOR = 1 }; enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } • Input : FP16, u8, s8, u4, s4, b1 • Accumulator : FP16, FP32, int • Also in experimental:

6. = x + m k k n m n m n

8. Mixed Precision https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

9. Programming

10. CUDA Library https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ also in TensorRT 3 cuBLAS cuDNN

11. CUDA WMMA API https://en.wikipedia.org/wiki/Joanna_J%C4%99drzejczyk

12. CPU Level simpleTensorCoreGEMM.cu https://github.com/parallel-forall/code-samples/blob/master/posts/tensor-cores/simpleTensorCoreGEMM.cu call kernel function in wrap

13. Warp-Level http://on-demand.gputechconf.com/gtc/2017/presentation/s7132-mark-harris-new-cuda-features-and-beyond.pdf (In short)

14. Warp-Level :  Initialization Values https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ simpleTensorCoreGEMM.cu Kernel function in wrap

15. Warp-Level :  Fragments on Registers Fragment Type Clear Acc https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

16. Warp-Level : Tile Calculation(compute one tile of the output matrix per warp) https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ = x +

17. Warp-Level : Finishing Optional Scaling C = alpha * Acc + beta * C Store to Memory https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/

18. Availability • V100, Titan V • RTX 2070, RTX 2080, RTX 2080 Ti, etc.