SlideShare une entreprise Scribd logo
1  sur  65
게임기술과
수퍼컴퓨팅의
공생관계
김태용, NVIDIA

taeyongk@nvidia.com
강연자에 대한 간략한 소개

컴퓨터 그래픽 공부
헐리웃 영화효과에서 사용되는 특수효과 기술개발
현재 엔비디아에서 게임용 특수효과를 위해
GPU를 사용한 병렬처리 알고리즘들 연구중
들어가는 말

컴퓨터는 더 이상 빨라지지 않는다
(다만 넓어질 뿐이다)
병렬 컴퓨팅의 간략한 역사
2012
18,688 개의
노드를 사용한
AMD Opteron/
NVIDIA Tesla
Titan 시스템
( ORNL )

1985
최초의 분산메모리
병렬시스템 (Intel
iPSC/1, 32 CPUs)
이
오크리지(ORNL)
에서 사용

Image source:
The Far Side, Gary Larson

1980s

Today
컴퓨팅의 새로운 벽
• Clock speed 에 의존하는
단일코어 CPU 성능의 한계
• 발열
• 에너지
" 파티는 아직 끝나지 않았다.
하지만, 경찰이 찾아왔고 음악은 이미
멈췄다. “
– Peter Kogge -
문제는 전력!

=

Jaguar (Nov. ‘11)
2.3 petaflops @ 7 megawatts
(224,256 x86 CPU cores)

7,000 Homes = 7 megawatts
조그만 도시의 총 전력사용량!
120 petaflops | 376 megawatts
샌프란시스코의 총전력사용량!
난세의 영웅~
두둥!

©

Unreal Engine 4 Epic Games
GPU (Graphic Processing Unit, 그래픽
처리장치)
날로 번창하는 게임산업이 수퍼컴퓨터기술의 투자자

2016년까지 게임산업의 규모는 $82
billion (87조원) 예상
현재의 실시간 그래픽기술은 수십억개의 병렬
연산을 필요로 함
게임을 하는데 왜 그렇게 많은
병렬연산이 필요한가요?

수백만개의 삼각형들

수백만개의 픽셀들

Image plane

입력된 삼각형

버텍스 변환

표면 분할
(Tessellation
)

Camera

카메라 변환

Rasterize

칼라생성
(Shading)
게임에서 사용되는 기술은 더이상
게임만을 위한것이 아니다?
수퍼컴퓨팅용 과학연산은 일초에 수천조 (quadrillions)
단위의 병렬연산 필요
예) 왜 오늘 날씨예보가 잘 안맞나요?
GPU = 주용도는 그래픽, 하지만 내부는
수퍼컴퓨터

CPU = 몇개의 복잡한
연산전문 프로세서들

GPU = 간단한 연산에 최적화된
수많은 프로세서들

범용연산을 단일쓰레드로
최대의 성능을 낼수 있게
설계

범용연산은 느리나 연산총량
(throughput)에 맞춰 설계
단위전력당 성능 최적화
GPU의 진화과정
“Kepler”
7B xtors
GeForce 8800
681M xtors

RIVA 128
3M xtors

1995

GeForce 256
23M xtors

2000

고정연산처리

GeForce 3
60M xtors

2001

GeForce FX
250M xtors

2003

프로그래밍가능한 쉐이더

2006

2012

범용프로그래밍 가능 (CUDA)
무어의 법칙
• 18개월 마다 (혹은 2년마다)
회로의 직접도 2배 증가
• CPU 의 경우 지난 수십년간
단일코어 성능 향상
• “The Power Wall”
무어의 법칙 (GPU의 경우)
• 18개월 마다 (혹은 2년마다)
회로의 직접도 2배 증가
• 늘어난 회로의 직접도는
코어갯수의 증가로 이어짐
• 수백-수천개의 ‘코어’로 구성
“새로운” 무어의 법칙
• 코어 하나의 성능향상은 더이상 없다, 다만 코어의 갯수만
늘어날 뿐이다
• 성능향상을 위해 새로운 병렬 알고리즘 개발필요
• 새로운 다중코어 환경에서 적합한 Data-parallelism 필요
게임기법과 수퍼컴퓨팅알고리즘의
공통분모
1 입자 움직임계산 (PARTICLE SIMULATION)
입자움직임 계산 — 게임의 경우
예) 머리카락 시뮬레이션

NVIDIA Hair Demo
입자움직임 계산 — 게임의 경우
옷 시뮬레이션

Vertical
©

Samaritan demo Epic Games /NVIDIA Apex Clothing

Horizontal

Shear
입자움직임 계산— 분자생물학

리보솜 계산 (simulated by NAMD, visualized by VMD)
새로운 약품개발연구 및 기존 생물학 이론 검증

Bond

Atom

Forces
게임연산과 수퍼컴퓨팅의 공통분모 첫번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 최대의 성능을 위한 로직 병렬화 작업
• 독립된 연산을 막는 의존성문제
• 예) 입자연산의 경우, 입자간의 상호의존성

• 동시에 여러개의 연산이 같은 메모리주소에 접근하는 문제

• 대응방법
• 병렬처리에 적합한 새로운 알고리즘 사용

• Graph Coloring을 통한 의존성제거, 다중 패스 적용
데이터 충돌 방지를 위한 기법들

• 쓰기 연산이 같은 노드에 사용되는 것 방지
• 원자적 (atomic) 연산 -> 연산의 직렬화 -> 느림
• Graph Coloring을 통한 패스의 생성
• 각각의 패스에서는 완전히 독립적 -> 최대한의 병렬화
• 패스의 갯수를 줄이는게 관건
2 컨볼루션 (신호처리, 이미지 프로세싱 등)

0
0
0
입력
픽셀값

0
0
0
0

0
1

1
1
0
0
0

0
1
2
2

1
1
1

0
1
2
2
2
1
1

0
1
2
2

2
1
1

0
0
1
1
1
1

1

출력값은 픽셀위치의 주위 모든
픽셀값들의 값과 커널에 의해 결정되는
가중치 평균값 (weighted sum) 이 된다

0

0
0
0
0
0

4

0

0
0

0

0
0

0
0
-4

Convolution
커널
새로운 픽셀값
(출력)

-8
컨볼루션— 게임의 경우

카메라 초점효과 (Depth of field)

In focus

©

Halo 3 Bungie Studios
컨볼루션— 게임의 경우
뽀사시 효과 (Film bloom)

과도한 빛이
광원주위로
산란되는 현상

©

Crysis Crytek GmbH
CONVOLUTION — 게임의 경우

반투명한 피부질감 렌더링 (Subsurface scattering)

불투명한 오브젝트의
경우

반투명한 물체의 내부
광원 산란효과

Jimenez 2008
Used In Samaritan, Crysis2 and other games

직접광원

텍스쳐 convolution
컨볼루션 — 수퍼컴퓨팅의 경우

지질탐사 (유정위치 탐색등)에서 사용되는 Reverse Time Migration
파동시뮬레이션의 정확도를 돕이기 위해 가변량계산시 convolution사용

Petroleum Geo Services

complex wave interaction near a salt tooth propagated using AxRTM
게임연산과 수퍼컴퓨팅의 공통분모 두번째

• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 컨볼루션 연산시 픽셀위치에 가까운 픽셀들 데이터 사용
• 게임의 경우 – 텍스쳐 캐쉬를 이용한 성능향상

• 일반연산의 경우 – 공유 메모리 (shared memory) 와 상수 메모리
(constant memory)를 사용한 성능향상
• 같은 메모리 구조를 가지는 하드웨어를 사용하므로 메모리
접근방식개선을 통한 최적화기법 공유
3 편미분 방정식 (PARTIAL DIFFERENTIAL
EQUATIONS, PDEs)
유체역학에서의 편미분방정식 계산

속도

압력
게임에서의 유체효과를 위한 물리연산
수퍼컴퓨터에서의 유체역학

기상예측, 제품 디자인, 공기역학설계 등

On the Development of a High-Order, Multi-GPU Enabled, Compressible
Viscous Flow Solver for Mixed Unstructured Grids. P. Castonguay et al.
게임연산과 수퍼컴퓨팅의 공통분모 세번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 수학연산 등 높은 계산량 (FLOPS) 이 필요한 분야에 사용
• 게임

• 정해진 시간안에 최대한의 효과
• 속도 > 정확도
• 단정밀도 부동소수점 연산에 최적화 (single precision float) / GeForce 제품군

• 수퍼컴퓨팅

• 최대의 데이터와 높은 수준의 정확도 요구
• 정확도 > 속도
• 배정밀도 부동소수점 연산에 최적화 (double precision) / Tesla 제품군
4. 푸리에 변환 (FFT)

+
=

+
FFT — 게임의 경우

렌즈 플레어 등의 후처리 효과

HDRI 등에서 발생하는
고휘도 픽셀입력값

FFT

Frequent
주파수 영역 이미지
Domain
(Frequency Domain
Image )
Image

x
주파수 영역 커널
Frequent
(Frequency
Domain
Domain
Kernel
Kernel )

효과가 적용된 이미지
FFT-1
©

3D Mark 11 Futuremark
©

3D Mark 11 Futuremark
©

3D Mark 11 Futuremark
FFT — 게임의 경우
파도 시뮬레이션

NVIDIA Ocean Demo
FFT — 수퍼컴퓨팅의 경우
난류 시뮬레이션 (Turbulence simulation), 단백질 합성, 분자동역학, 영상처리, 암호학
게임연산과 수퍼컴퓨팅의 공통분모 네번째
• 자주사용되는 핵심 기술 (FFT, 선형대수등)에 대한
라이브러리들 존재
• 게임
• 주로 게임용 개발환경인 DirectX/Direct Compute사용
• 게임엔진및 독립적인 미들웨어 사용

• 수퍼컴퓨팅

• CUDA 용 기본 라이브러리들
• CUFFT, CUBLAS 등

• 가속라이브러리의 핵심커널은 대동소이함 (같은 하드웨어
구조)
수퍼컴퓨팅과 게임응용분야간의 근본적인 유사성
메모리연산에 의한 병목의 경우들 (메모리 사용량 > 계산량)

Gaming

HPC

Ambient occlusion

Sparse Matrix vector multiply
수퍼컴퓨팅과 게임응용분야간의 근본적인 유사성
수학연산에 의한 병목의 경우들 (계산량 > 메모리 사용량)

©

Team Fortress 2 Valve

Gaming

복잡한 광원효과 계산

AMBER를 사용한 혈액 응집 시뮬레이션

HPC

단백질과 지질 (lipid) 시뮬레이션
게임연산과 수퍼컴퓨팅의 공통분모 마지막

• 병렬처리 성능의 소프트웨어적인 개선 방법

• 병목을 찾아라
• 메모리 병목 (bandwidth bound) vs 계산 병목 (compute
bound)?
• 하드웨어에 대한 이해 필요
• Parallel Nsight 등의 툴들 사용
GPU Computing – Game On!
Growth of GPU Computing
100M

430M

CUDA –Capable GPUs

CUDA-Capable GPUs

150K

1.6M

CUDA Downloads

CUDA Downloads

50

1

Supercomputers

Supercomputer

640

60

University Courses

University Courses

37,000

4,000

Academic Papers

Academic Papers

2008

2013
2008 Supercomputing Exhibit Floor
2012 Supercomputing Exhibit Floor
ADVANCING HEALTHCARE

The Chinese Academy of Sciences used
a GPU-powered supercomputer to
model a complete H1N1 virus for the
first time.
ENHANCING FINANCE

With NVIDIA Tesla GPUs, J.P Morgan
.
achieved a 40x speedup of its risk
calculations.
MEDICAL BREAKTHROUGHS

GPUs enable doctors to perform beating
heart surgery with robotic arms that predict
and adjust for movement.

Laboratoire d’Informatique de Robotique et
de Microelectronique de Montpellier
정보 검색
Tweets Per Day
500

Millions

400

500M

300

Tweets
200

1M
Expressions

100

<5

0

Minutes
2007

2008

2009

2010

2011

2012
오디오 검색
SHAZAM
300M

300M

User Queries (per month)

200M

100M

0

2008

2009

2010

2011

2012

2013
패턴 검색
게임 효과의 미래?

현재

가까운 미래
수퍼컴퓨팅의 미래?

현재

미래에는?
수퍼컴퓨팅의 또다른 미래?

현재

미래?
Mobile GPUs are coming

Relative Graphics Horsepower

300X

Mobile
Kepler
8800 GTX

200X

PS3
100X

iPad 4

iPhone 4

Galaxy S2

1X

2009

2010

2011

2012

2013

2014
Gesture

Augmented Reality

Beautification

Speech Recognition

Facial Recognition

모바일 수퍼컴퓨터?
Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계
http://nvidiakoreapsc.com

Contenu connexe

En vedette

Deploying flask with nginx & uWSGI
Deploying flask with nginx & uWSGIDeploying flask with nginx & uWSGI
Deploying flask with nginx & uWSGI정주 김
 
2016년 10월 파이썬 사용자 모임 오프닝
2016년 10월 파이썬 사용자 모임 오프닝2016년 10월 파이썬 사용자 모임 오프닝
2016년 10월 파이썬 사용자 모임 오프닝Hyun-woo Park
 
Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기
Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기
Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기Hyun-woo Park
 
파이콘 한국 2015 디자인 후기
파이콘 한국 2015 디자인 후기파이콘 한국 2015 디자인 후기
파이콘 한국 2015 디자인 후기Hyun-woo Park
 
Dive into OpenSource
Dive into OpenSourceDive into OpenSource
Dive into OpenSourceHyun-woo Park
 
파이썬 삼총사 : Tox, Travis 그리고 Coveralls
파이썬 삼총사 : Tox, Travis 그리고 Coveralls파이썬 삼총사 : Tox, Travis 그리고 Coveralls
파이썬 삼총사 : Tox, Travis 그리고 CoverallsHyun-woo Park
 
Theano 와 Caffe 실습
Theano 와 Caffe 실습 Theano 와 Caffe 실습
Theano 와 Caffe 실습 정주 김
 
WE HAVE ALMOST NOTHING, SMARTSTUDY
WE HAVE ALMOST NOTHING, SMARTSTUDYWE HAVE ALMOST NOTHING, SMARTSTUDY
WE HAVE ALMOST NOTHING, SMARTSTUDYHyun-woo Park
 
Django in Production
Django in ProductionDjango in Production
Django in ProductionHyun-woo Park
 
그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라Hyun-woo Park
 
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출 NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출 정주 김
 
스타트업에서 기술책임자로 살아가기
스타트업에서 기술책임자로 살아가기스타트업에서 기술책임자로 살아가기
스타트업에서 기술책임자로 살아가기Hyun-woo Park
 

En vedette (13)

Logging 101
Logging 101Logging 101
Logging 101
 
Deploying flask with nginx & uWSGI
Deploying flask with nginx & uWSGIDeploying flask with nginx & uWSGI
Deploying flask with nginx & uWSGI
 
2016년 10월 파이썬 사용자 모임 오프닝
2016년 10월 파이썬 사용자 모임 오프닝2016년 10월 파이썬 사용자 모임 오프닝
2016년 10월 파이썬 사용자 모임 오프닝
 
Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기
Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기
Tox, Travis 그리고 Codecov 로 오픈소스 생태계에 기여하기
 
파이콘 한국 2015 디자인 후기
파이콘 한국 2015 디자인 후기파이콘 한국 2015 디자인 후기
파이콘 한국 2015 디자인 후기
 
Dive into OpenSource
Dive into OpenSourceDive into OpenSource
Dive into OpenSource
 
파이썬 삼총사 : Tox, Travis 그리고 Coveralls
파이썬 삼총사 : Tox, Travis 그리고 Coveralls파이썬 삼총사 : Tox, Travis 그리고 Coveralls
파이썬 삼총사 : Tox, Travis 그리고 Coveralls
 
Theano 와 Caffe 실습
Theano 와 Caffe 실습 Theano 와 Caffe 실습
Theano 와 Caffe 실습
 
WE HAVE ALMOST NOTHING, SMARTSTUDY
WE HAVE ALMOST NOTHING, SMARTSTUDYWE HAVE ALMOST NOTHING, SMARTSTUDY
WE HAVE ALMOST NOTHING, SMARTSTUDY
 
Django in Production
Django in ProductionDjango in Production
Django in Production
 
그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라그런데 스타트업이 뭐더라
그런데 스타트업이 뭐더라
 
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출 NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
NDC 2016 김정주 - 기계학습을 활용한 게임어뷰징 검출
 
스타트업에서 기술책임자로 살아가기
스타트업에서 기술책임자로 살아가기스타트업에서 기술책임자로 살아가기
스타트업에서 기술책임자로 살아가기
 

Similaire à Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계

모바일 게임 최적화
모바일 게임 최적화 모바일 게임 최적화
모바일 게임 최적화 tartist
 
니시카와젠지의 3 d 게임 팬을 위한 ps4
니시카와젠지의 3 d 게임 팬을 위한 ps4니시카와젠지의 3 d 게임 팬을 위한 ps4
니시카와젠지의 3 d 게임 팬을 위한 ps4민웅 이
 
[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례
[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례
[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례(Joe), Sanghun Kim
 
테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)
테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)
테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)BoanLabDKU
 
장재화, Replay system, NDC2011
장재화, Replay system, NDC2011장재화, Replay system, NDC2011
장재화, Replay system, NDC2011재화 장
 
[232] 수퍼컴퓨팅과 데이터 어낼리틱스
[232] 수퍼컴퓨팅과 데이터 어낼리틱스[232] 수퍼컴퓨팅과 데이터 어낼리틱스
[232] 수퍼컴퓨팅과 데이터 어낼리틱스NAVER D2
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AITommy Lee
 
Cloud based NGS framework
Cloud based NGS frameworkCloud based NGS framework
Cloud based NGS frameworkHyungyong Kim
 
게임프로젝트에 적용하는 GPGPU
게임프로젝트에 적용하는 GPGPU게임프로젝트에 적용하는 GPGPU
게임프로젝트에 적용하는 GPGPUYEONG-CHEON YOU
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA홍배 김
 
Landscape 구축, Unreal Engine 3 의 차세대 terrain system
Landscape 구축, Unreal Engine 3 의 차세대 terrain systemLandscape 구축, Unreal Engine 3 의 차세대 terrain system
Landscape 구축, Unreal Engine 3 의 차세대 terrain systemdrandom
 
[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_sparkNAVER D2
 
Ibm과 nvidia가 제안하는 딥러닝 플랫폼
Ibm과 nvidia가 제안하는 딥러닝 플랫폼Ibm과 nvidia가 제안하는 딥러닝 플랫폼
Ibm과 nvidia가 제안하는 딥러닝 플랫폼ibmrep
 
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리MinGeun Park
 
[조진현]Kgc2012 c++amp
[조진현]Kgc2012 c++amp[조진현]Kgc2012 c++amp
[조진현]Kgc2012 c++amp진현 조
 
[E-commerce & Retail Day] 인공지능서비스 활용방안
[E-commerce & Retail Day] 인공지능서비스 활용방안[E-commerce & Retail Day] 인공지능서비스 활용방안
[E-commerce & Retail Day] 인공지능서비스 활용방안Amazon Web Services Korea
 
NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템
NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템
NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템burnaby yang
 

Similaire à Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계 (20)

모바일 게임 최적화
모바일 게임 최적화 모바일 게임 최적화
모바일 게임 최적화
 
니시카와젠지의 3 d 게임 팬을 위한 ps4
니시카와젠지의 3 d 게임 팬을 위한 ps4니시카와젠지의 3 d 게임 팬을 위한 ps4
니시카와젠지의 3 d 게임 팬을 위한 ps4
 
[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례
[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례
[IBM 김상훈] AI 최적화 플랫폼 IBM AC922 소개와 활용 사례
 
테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)
테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)
테슬라 도조 프로젝트 (What is Tesla's Dojo Supercomputer?)
 
장재화, Replay system, NDC2011
장재화, Replay system, NDC2011장재화, Replay system, NDC2011
장재화, Replay system, NDC2011
 
[232] 수퍼컴퓨팅과 데이터 어낼리틱스
[232] 수퍼컴퓨팅과 데이터 어낼리틱스[232] 수퍼컴퓨팅과 데이터 어낼리틱스
[232] 수퍼컴퓨팅과 데이터 어낼리틱스
 
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI
 
Cloud based NGS framework
Cloud based NGS frameworkCloud based NGS framework
Cloud based NGS framework
 
게임프로젝트에 적용하는 GPGPU
게임프로젝트에 적용하는 GPGPU게임프로젝트에 적용하는 GPGPU
게임프로젝트에 적용하는 GPGPU
 
Binarized CNN on FPGA
Binarized CNN on FPGABinarized CNN on FPGA
Binarized CNN on FPGA
 
Nvidia architecture
Nvidia architectureNvidia architecture
Nvidia architecture
 
Landscape 구축, Unreal Engine 3 의 차세대 terrain system
Landscape 구축, Unreal Engine 3 의 차세대 terrain systemLandscape 구축, Unreal Engine 3 의 차세대 terrain system
Landscape 구축, Unreal Engine 3 의 차세대 terrain system
 
Processor
ProcessorProcessor
Processor
 
[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark[264] large scale deep-learning_on_spark
[264] large scale deep-learning_on_spark
 
Ibm과 nvidia가 제안하는 딥러닝 플랫폼
Ibm과 nvidia가 제안하는 딥러닝 플랫폼Ibm과 nvidia가 제안하는 딥러닝 플랫폼
Ibm과 nvidia가 제안하는 딥러닝 플랫폼
 
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
[데브루키/141206 박민근] 유니티 최적화 테크닉 총정리
 
[조진현]Kgc2012 c++amp
[조진현]Kgc2012 c++amp[조진현]Kgc2012 c++amp
[조진현]Kgc2012 c++amp
 
[E-commerce & Retail Day] 인공지능서비스 활용방안
[E-commerce & Retail Day] 인공지능서비스 활용방안[E-commerce & Retail Day] 인공지능서비스 활용방안
[E-commerce & Retail Day] 인공지능서비스 활용방안
 
Cuda intro
Cuda introCuda intro
Cuda intro
 
NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템
NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템
NDC 2017 비주얼 선택과 집중 - 3on3 아트 포스트모템
 

Deview2013 - 게임기술과 수퍼컴퓨팅의 공생관계

Notes de l'éditeur

  1. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  2. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  3. Parallel computing enjoyed a heyday in the late 80s and early 90s.The first distributed memory parallel computer, consisting of 32 Intel CPUs, was delivered to ORNL in 1985.For the next 10 years, proponents of parallel computing built massively parallel machines -- exotic, expensive and accessible only to a relatively small, technicalelite.Since then, these machines have been replaced by clusters of commodity hardware. One example,last year the Titan supercomputer came online at Oak Ridge National Labs, It’s powered by more than 18 thousand nodes consisting of AMD Opteron CPUs and NVIDIA Kepler GPUs.
  4. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  5. To get a sense of the importance of this issue lets look at an example. The Jaguar supercomputer last year delivered 2.3 petaflops using 7 megawatts, which is the power consumed by a small city.What if we want to scale this system to deliver 120 petaflops?
  6. It turns out that to deliver 120 petaflops we need more than 370 megawatts of power, which is enough to power all of San Francisco.So clearly, this approach is unsustainable – we need to be able to deliver higher and higher amounts of flops for HPC, but we need to do it with significantly less amount of power
  7. Fortunately there is good news – high performance computing has an unlikely hero – the video game industry…
  8. and the GPUs that power it.
  9. As you all know, GPUs have a day job, which is rendering computer graphics for the massive and growing video game industry.This day job is pretty demanding.Gamers demand higher and higherfidelity, instantaneous response and immersive experiences.The desire for better computer graphics is insatiable and drives the industry forward.As I mentioned, this is a huge industry. Just as a data point, it is expected to reach 82 billion by 2016, which is about the size of the movie industry. The economies of scale derived from such a large base have enabled GPUs to fund R&amp;D into parallel computing, which has and continues to benefit the HPC community. Photo: 275,000 people attended Gamescom -- world’s largest gaming event -- held in Cologne, Germany, in 2011. Data: Gaming market $82B (does not include hardware): PriceWaterhouseCoopers, “Global entertainment and media outlook: 2012–2016”
  10. So why is it that computer graphics requires so much parallel computing?In order to answer that, lets take a look at just one example – one frame rendered from unreal engine 4
  11. In order to calculate the color of each of the millions of pixels on the screen we need to start with the 3D representation of the scene -Millions of independent trianglesFor each triangle we need to transform it, tessellate it, clip it, project it to 2D space, rasterize it to pixels and finally shade each pixel, and for each location on the screen we are shading multiple overlapping pixelsThe shading itself has to take into account complex effects such as the interaction of light with different materials, and then we need to apply a host of environmental and post processing effects.All of these calculations over all the triangles and all the pixels are done in parallel, and they have to be done in less than a 30th of a second
  12. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  13. It’s also easy to see why HPC requires quadrillions of parallel computations -- we are trying to simulate and compute massive systems, whether it’s climate over large regions of the earth, or a detailed simulation over the wing of an airplane.As the most energy-efficient parallel processors, GPUs accelerate energy simulation, protein folding, DNA alignment, climate modeling, astrophysics, seismic exploration, computational finance, radioastronomy, heart surgery…the list goes on and on.
  14. To get an idea of why GPUs are better at delivering performance per watt lets look at an architectural representation of the two.CPUs have a few cores and are optimized for delivering fast single threaded performance. This means - they are running each core at a higher frequency which costs a lot more power - They are spending a lot more power on scheduling the instruction than executing it. As an example an out of order core spends 2nJ to schedule a 25pJ FMUL (80x more energy). So a lot of power is going into things like Speculative execution, out of order execution and branch prediction and only a tiny fraction is going into flopsIn contrast GPUs have lots of tiny cores that are running at a lower clock, and each core is devoting most of its area to flops rather than instruction scheduling etc. This means we get slow single threaded performance with longer latencies, but we have a lot more cores to improve the overall throughput.CPUs:~50X energy to schedule an instruction than to execute itTodo:Get the energy to schedule instructions for the GPUAdd an equivalent bullet point for memory bandwidth
  15. GPUs have always been massively parallel throughput machines, but both their processing power and programmability evolved over time. Lets take a quick look at the evolution of the GPUStarting from 1995 with the RIVA 128 with 3M transistors, we have come a long way in 15 years, increasing the transistor count by more than 3 orders of magnitudeAt the same time as the computing power of GPUs has gone up, so has their programmability.The first couple of generations of GPUs were fixed functions, so writing any new algorithms on them meant tricks with multitexture, stencil buffer, depth buffer, blending, etc.Today writing HPC applications that take advantage of the GPU is extremely straightforward using CUDA, which is NVIDIA’s parallel programming platform, but this didn’t always used to be the case..In 2001, DirectX 8 introduced programmable pixel and vertex shaders, which made computing on GPUs a bit simpler – now people could write computing applications on GPUs using tricks like render to texture, and have each pixel be a unit of computation.Finally in 2006 the first CUDA capable card, the Geforce 8 was introduced, and this truly enabled GPUs to be used in a seamless and transparent way for any parallel computing workload without having to go through any graphics abstractions.
  16. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  17. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  18. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  19. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  20. In HPC particle simulation is used amongst other things for molecular dymamics.Molecular dynamics is a computer simulation of physical movements of atoms and molecules. NAMD is a powerful and widely used tool for simulating complex molecular systems and biomolecular processes on GPUs, and has applications in quickening the pace of drug discovery and other vital research in unraveling biological processes.
  21. So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  22. Perhaps the simplest, and the most popular approach to solve this problem is the Gauss-Seidel method.The idea is very simple. We apply the constraint correction for each constraint, and iterate through the constraints one-by-oneUntil things converge.This simple approach works pretty well in practice, but may not converge very well for large number of constraints.
  23. The next example that we have is convolution. I am going to explain it in reference to an image in 2D but really it can be done in any amount of dimensions. We have a kernel here and it is translated over each pixel in the source image. At each location we do a pair wise multiplicatoin of the kernel and the source vlaiue sum these togegther and this is the value we place in the destination image.Source Left: http://www.biomachina.org/courses/structures/01.htmlSource Right: http://www.westworld.be
  24. Convolution is useful for many effects in video games, for example depth of field. This is a cinematic effect which tries to mimic the properties of a real camera lens, where objects away from the focal point appear blurred. For example in this image the character in the front is in focus and everything is out of focus.This effect helps the game look more cinematic and also helps direct the user’s focus to the areas that the game developer would like them to focus on
  25. Seismic imaging is a method of exploration geophysics that estimates the properties of the earth’s subsurface from reflected seismic waves. RTM is the current state of the art in seismic imaging. It is a computationally expensive technique that uses the full two-way acoustic wave equation instead of a simple one way propagation in order to better analyze complex situations, particularly sub salt. The image on the right for example shows complex wave interaction near a salt tooth which can be useful for trying to understand the oil content beneath the surface of the earth. The different colors denote different velocities and the concentric half circles show pressure waves. RTM in time domain uses convolution to compute discrete spatial derivatives in 3D by doing 1D convolutions in all dimensions of interestSource Left: http://www.pgs.com/ja/Pressroom/News/PGS_Releases_the_Ultimate_Dep/Source Right: http://www.acceleware.com/rtm
  26. So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  27. This PDE basically states that the rate at which some quantity x is changing over time is given by some function f, which may itself depend on x and t.To use this equation we will discretize our space (x) and time (t). After that there are many approaches to solving the equations at each position and time step.
  28. This equation is specifying the evolution of the velocity field u, over time, t.We compute the solution to this equation by discretizing the space over a regular grid, initializing the velocity and pressure fields and then evolving them based on the three steps that we discussed earlier for each timestep. The first term is the advection term, which states that the velocity field pushes itself forward. The second term states that the velocity is affected by external forces f, like gravity. Finally, the projection operator P projects the field onto its divergence free part. This ensures that the fluid satisfies important properties, like the amount of fluid flowing into a region is the same as the amount of fluid flowing out of a section. In order to do this we need to calculate and use the pressure field.
  29. The Navier-Stokes equations are used for a number of important applications in HPC, including modeling the weather, ocean currents, water flow in a pipe, blood flow and air flow around a wing. The particular example above is simulating the flow of air around the deformable wings of an insect or bird, and is done on multiple GPUs.P. Castonguay, D.M. Williams, P.E. Vincent, A. Jameson, On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids, 20th AIAA Computational Fluid Dynamics conference, AIAA 2011-3229
  30. So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  31. A Fourier Transform can decompose asignal in individual frequencies.
  32. Fast Fourier Transform (FFT) is an important computational scheme commonly used to reduce the amount of overall calculation by transforming operations into spectral space. In particular, 3-D FFT is used in highperformance computing applications such as direct numerical simulations, Protein docking simulations, including cryptography, large polynomial multiplications, image and audio processing and Molecular dynamics. One particular example of the use of FFT is to simulate turbulent flows, which is central to studying everything from the formation of hurricanes to the mixing process in the chemical industry. An example of such work has been carried out by researchers at Peking University using the Tianhe-1A GPU supercomputer.
  33. So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  34. These use cases were just some examples. There are similarities between HPC and Gaming workloads at a very fundamental level, for example
  35. Image on left: team fortressImage on right: blood coagulation factor IX simulated by AMBER (90K atoms including water)AMBER: A molecular dynamics package primarily designed for biomolecular systems such as proteins and nucleic acids.
  36. So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  37. So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
  38. But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
  39. The Chinese Academy of Sciences uses GPU supercomputers to conduct some of the largest scale molecular simulations in the world. Recently the Academy became the first to model a complete H1N1 virus. The approach marked a new supercomputer-centric way of dealing with the problems of epidemiology and virology that wasn’t possible just a few years ago.
  40. For investment banks, the ability to calculate risks across a range of complex variables quickly is critical to success. With NVIDIA Tesla GPUs, J.P. Morgan achieved a 40x speedup of its risk calculations.DATA: NV press release: http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&amp;version=live&amp;prid=784689&amp;releasejsp=release_157&amp;xhtml=true
  41. This is truly a medical breakthrough…GPUs are helping doctors perform beating heart surgery.Performing beating heart surgery is extremely risky and can be done by only 2% of surgeons. Medical researchers at France’s LIRMM use GPUs to ‘virtually’ still a beating heartThisenables the surgeons to treat patients by guiding robotic arms that predict and adjust for movement.
  42. The video game industry is driven by an insatiable demand for more realistic images and effects. The images that we are able to simulate and render today, for example the dust cloud on the left, are still far away from where we would like them to be, for example the image on the right. This improvement in fidelity is going to come both from an increase in computational throughput and an improvement in the algorithms, where gaming will be borrowing more and more from HPC.
  43. HPC requires a similar increase in computational throughput per watt in order to be able to solve the greatest challenges the world faces. For example, today, computer simulations provide great insight into weather. But to tackle climate change, we need a more accurate view for which we need systems that improve both throughput and energy efficiency.
  44. HPC requires a similar increase in computational throughput per watt in order to be able to solve the greatest challenges the world faces. For example, today, computer simulations provide great insight into weather. But to tackle climate change, we need a more accurate view for which we need systems that improve both throughput and energy efficiency.
  45. Device: GFLOPS, Rel (yr)iPhone 4:1.6,1 (2010)Galaxy S2:10.8, 6.75 (2011)iPad 4:76, 47.5 (2012)[T4: 81, 50.6 (2013)]PS3:234,146.25 GT 550M:284,177.58800 GTX:346,216.25Mkepler: 384, 240