그래픽작업과 게임을 위해 발전해온 그래픽 프로세서 (GPU)는 최근에 수퍼컴퓨터에 없어서는 안될 핵심 요소로 자리 잡고 있습니다.
CUDA등의 범용 개발툴은 거의 모든 종류의 프로그래밍문제에 다양한 방식으로 적용되고 있습니다.
그런데 게임기술개발과 HPC 프로그래밍은 과연 정말 전혀 다른 영역일까요?
이 세션에서는 GPU를 활용한 속도 최적화를 위해 병렬처리와 관련한 알고리즘 적인 접근방식이 게임이나 그래픽과 수퍼컴퓨터에서 사용되는 기술간에 어떤 유사성이 있는지에 대해 논의 합니다.
또 게임개발에서 습득한 여러 GPU 기술들이 다른 문제들에 어떻게 적용, 확장될수 있는지도 설명합니다.
4. 병렬 컴퓨팅의 간략한 역사
2012
18,688 개의
노드를 사용한
AMD Opteron/
NVIDIA Tesla
Titan 시스템
( ORNL )
1985
최초의 분산메모리
병렬시스템 (Intel
iPSC/1, 32 CPUs)
이
오크리지(ORNL)
에서 사용
Image source:
The Far Side, Gary Larson
1980s
Today
5. 컴퓨팅의 새로운 벽
• Clock speed 에 의존하는
단일코어 CPU 성능의 한계
• 발열
• 에너지
" 파티는 아직 끝나지 않았다.
하지만, 경찰이 찾아왔고 음악은 이미
멈췄다. “
– Peter Kogge -
6. 문제는 전력!
=
Jaguar (Nov. ‘11)
2.3 petaflops @ 7 megawatts
(224,256 x86 CPU cores)
7,000 Homes = 7 megawatts
조그만 도시의 총 전력사용량!
14. 수퍼컴퓨팅용 과학연산은 일초에 수천조 (quadrillions)
단위의 병렬연산 필요
예) 왜 오늘 날씨예보가 잘 안맞나요?
15. GPU = 주용도는 그래픽, 하지만 내부는
수퍼컴퓨터
CPU = 몇개의 복잡한
연산전문 프로세서들
GPU = 간단한 연산에 최적화된
수많은 프로세서들
범용연산을 단일쓰레드로
최대의 성능을 낼수 있게
설계
범용연산은 느리나 연산총량
(throughput)에 맞춰 설계
단위전력당 성능 최적화
24. 입자움직임 계산— 분자생물학
리보솜 계산 (simulated by NAMD, visualized by VMD)
새로운 약품개발연구 및 기존 생물학 이론 검증
Bond
Atom
Forces
25. 게임연산과 수퍼컴퓨팅의 공통분모 첫번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 최대의 성능을 위한 로직 병렬화 작업
• 독립된 연산을 막는 의존성문제
• 예) 입자연산의 경우, 입자간의 상호의존성
• 동시에 여러개의 연산이 같은 메모리주소에 접근하는 문제
• 대응방법
• 병렬처리에 적합한 새로운 알고리즘 사용
• Graph Coloring을 통한 의존성제거, 다중 패스 적용
26. 데이터 충돌 방지를 위한 기법들
• 쓰기 연산이 같은 노드에 사용되는 것 방지
• 원자적 (atomic) 연산 -> 연산의 직렬화 -> 느림
• Graph Coloring을 통한 패스의 생성
• 각각의 패스에서는 완전히 독립적 -> 최대한의 병렬화
• 패스의 갯수를 줄이는게 관건
30. CONVOLUTION — 게임의 경우
반투명한 피부질감 렌더링 (Subsurface scattering)
불투명한 오브젝트의
경우
반투명한 물체의 내부
광원 산란효과
Jimenez 2008
Used In Samaritan, Crysis2 and other games
직접광원
텍스쳐 convolution
31. 컨볼루션 — 수퍼컴퓨팅의 경우
지질탐사 (유정위치 탐색등)에서 사용되는 Reverse Time Migration
파동시뮬레이션의 정확도를 돕이기 위해 가변량계산시 convolution사용
Petroleum Geo Services
complex wave interaction near a salt tooth propagated using AxRTM
32. 게임연산과 수퍼컴퓨팅의 공통분모 두번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 컨볼루션 연산시 픽셀위치에 가까운 픽셀들 데이터 사용
• 게임의 경우 – 텍스쳐 캐쉬를 이용한 성능향상
• 일반연산의 경우 – 공유 메모리 (shared memory) 와 상수 메모리
(constant memory)를 사용한 성능향상
• 같은 메모리 구조를 가지는 하드웨어를 사용하므로 메모리
접근방식개선을 통한 최적화기법 공유
36. 수퍼컴퓨터에서의 유체역학
기상예측, 제품 디자인, 공기역학설계 등
On the Development of a High-Order, Multi-GPU Enabled, Compressible
Viscous Flow Solver for Mixed Unstructured Grids. P. Castonguay et al.
37. 게임연산과 수퍼컴퓨팅의 공통분모 세번째
• 수많은 독립된 연산사용 ( 수백만 ~ 수억개)
• 수학연산 등 높은 계산량 (FLOPS) 이 필요한 분야에 사용
• 게임
• 정해진 시간안에 최대한의 효과
• 속도 > 정확도
• 단정밀도 부동소수점 연산에 최적화 (single precision float) / GeForce 제품군
• 수퍼컴퓨팅
• 최대의 데이터와 높은 수준의 정확도 요구
• 정확도 > 속도
• 배정밀도 부동소수점 연산에 최적화 (double precision) / Tesla 제품군
39. FFT — 게임의 경우
렌즈 플레어 등의 후처리 효과
HDRI 등에서 발생하는
고휘도 픽셀입력값
FFT
Frequent
주파수 영역 이미지
Domain
(Frequency Domain
Image )
Image
x
주파수 영역 커널
Frequent
(Frequency
Domain
Domain
Kernel
Kernel )
효과가 적용된 이미지
FFT-1
44. FFT — 수퍼컴퓨팅의 경우
난류 시뮬레이션 (Turbulence simulation), 단백질 합성, 분자동역학, 영상처리, 암호학
45. 게임연산과 수퍼컴퓨팅의 공통분모 네번째
• 자주사용되는 핵심 기술 (FFT, 선형대수등)에 대한
라이브러리들 존재
• 게임
• 주로 게임용 개발환경인 DirectX/Direct Compute사용
• 게임엔진및 독립적인 미들웨어 사용
• 수퍼컴퓨팅
• CUDA 용 기본 라이브러리들
• CUFFT, CUBLAS 등
• 가속라이브러리의 핵심커널은 대동소이함 (같은 하드웨어
구조)
48. 게임연산과 수퍼컴퓨팅의 공통분모 마지막
• 병렬처리 성능의 소프트웨어적인 개선 방법
• 병목을 찾아라
• 메모리 병목 (bandwidth bound) vs 계산 병목 (compute
bound)?
• 하드웨어에 대한 이해 필요
• Parallel Nsight 등의 툴들 사용
55. MEDICAL BREAKTHROUGHS
GPUs enable doctors to perform beating
heart surgery with robotic arms that predict
and adjust for movement.
Laboratoire d’Informatique de Robotique et
de Microelectronique de Montpellier
56. 정보 검색
Tweets Per Day
500
Millions
400
500M
300
Tweets
200
1M
Expressions
100
<5
0
Minutes
2007
2008
2009
2010
2011
2012
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
Parallel computing enjoyed a heyday in the late 80s and early 90s.The first distributed memory parallel computer, consisting of 32 Intel CPUs, was delivered to ORNL in 1985.For the next 10 years, proponents of parallel computing built massively parallel machines -- exotic, expensive and accessible only to a relatively small, technicalelite.Since then, these machines have been replaced by clusters of commodity hardware. One example,last year the Titan supercomputer came online at Oak Ridge National Labs, It’s powered by more than 18 thousand nodes consisting of AMD Opteron CPUs and NVIDIA Kepler GPUs.
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
To get a sense of the importance of this issue lets look at an example. The Jaguar supercomputer last year delivered 2.3 petaflops using 7 megawatts, which is the power consumed by a small city.What if we want to scale this system to deliver 120 petaflops?
It turns out that to deliver 120 petaflops we need more than 370 megawatts of power, which is enough to power all of San Francisco.So clearly, this approach is unsustainable – we need to be able to deliver higher and higher amounts of flops for HPC, but we need to do it with significantly less amount of power
Fortunately there is good news – high performance computing has an unlikely hero – the video game industry…
and the GPUs that power it.
As you all know, GPUs have a day job, which is rendering computer graphics for the massive and growing video game industry.This day job is pretty demanding.Gamers demand higher and higherfidelity, instantaneous response and immersive experiences.The desire for better computer graphics is insatiable and drives the industry forward.As I mentioned, this is a huge industry. Just as a data point, it is expected to reach 82 billion by 2016, which is about the size of the movie industry. The economies of scale derived from such a large base have enabled GPUs to fund R&D into parallel computing, which has and continues to benefit the HPC community. Photo: 275,000 people attended Gamescom -- world’s largest gaming event -- held in Cologne, Germany, in 2011. Data: Gaming market $82B (does not include hardware): PriceWaterhouseCoopers, “Global entertainment and media outlook: 2012–2016”
So why is it that computer graphics requires so much parallel computing?In order to answer that, lets take a look at just one example – one frame rendered from unreal engine 4
In order to calculate the color of each of the millions of pixels on the screen we need to start with the 3D representation of the scene -Millions of independent trianglesFor each triangle we need to transform it, tessellate it, clip it, project it to 2D space, rasterize it to pixels and finally shade each pixel, and for each location on the screen we are shading multiple overlapping pixelsThe shading itself has to take into account complex effects such as the interaction of light with different materials, and then we need to apply a host of environmental and post processing effects.All of these calculations over all the triangles and all the pixels are done in parallel, and they have to be done in less than a 30th of a second
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
It’s also easy to see why HPC requires quadrillions of parallel computations -- we are trying to simulate and compute massive systems, whether it’s climate over large regions of the earth, or a detailed simulation over the wing of an airplane.As the most energy-efficient parallel processors, GPUs accelerate energy simulation, protein folding, DNA alignment, climate modeling, astrophysics, seismic exploration, computational finance, radioastronomy, heart surgery…the list goes on and on.
To get an idea of why GPUs are better at delivering performance per watt lets look at an architectural representation of the two.CPUs have a few cores and are optimized for delivering fast single threaded performance. This means - they are running each core at a higher frequency which costs a lot more power - They are spending a lot more power on scheduling the instruction than executing it. As an example an out of order core spends 2nJ to schedule a 25pJ FMUL (80x more energy). So a lot of power is going into things like Speculative execution, out of order execution and branch prediction and only a tiny fraction is going into flopsIn contrast GPUs have lots of tiny cores that are running at a lower clock, and each core is devoting most of its area to flops rather than instruction scheduling etc. This means we get slow single threaded performance with longer latencies, but we have a lot more cores to improve the overall throughput.CPUs:~50X energy to schedule an instruction than to execute itTodo:Get the energy to schedule instructions for the GPUAdd an equivalent bullet point for memory bandwidth
GPUs have always been massively parallel throughput machines, but both their processing power and programmability evolved over time. Lets take a quick look at the evolution of the GPUStarting from 1995 with the RIVA 128 with 3M transistors, we have come a long way in 15 years, increasing the transistor count by more than 3 orders of magnitudeAt the same time as the computing power of GPUs has gone up, so has their programmability.The first couple of generations of GPUs were fixed functions, so writing any new algorithms on them meant tricks with multitexture, stencil buffer, depth buffer, blending, etc.Today writing HPC applications that take advantage of the GPU is extremely straightforward using CUDA, which is NVIDIA’s parallel programming platform, but this didn’t always used to be the case..In 2001, DirectX 8 introduced programmable pixel and vertex shaders, which made computing on GPUs a bit simpler – now people could write computing applications on GPUs using tricks like render to texture, and have each pixel be a unit of computation.Finally in 2006 the first CUDA capable card, the Geforce 8 was introduced, and this truly enabled GPUs to be used in a seamless and transparent way for any parallel computing workload without having to go through any graphics abstractions.
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
In HPC particle simulation is used amongst other things for molecular dymamics.Molecular dynamics is a computer simulation of physical movements of atoms and molecules. NAMD is a powerful and widely used tool for simulating complex molecular systems and biomolecular processes on GPUs, and has applications in quickening the pace of drug discovery and other vital research in unraveling biological processes.
So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
Perhaps the simplest, and the most popular approach to solve this problem is the Gauss-Seidel method.The idea is very simple. We apply the constraint correction for each constraint, and iterate through the constraints one-by-oneUntil things converge.This simple approach works pretty well in practice, but may not converge very well for large number of constraints.
The next example that we have is convolution. I am going to explain it in reference to an image in 2D but really it can be done in any amount of dimensions. We have a kernel here and it is translated over each pixel in the source image. At each location we do a pair wise multiplicatoin of the kernel and the source vlaiue sum these togegther and this is the value we place in the destination image.Source Left: http://www.biomachina.org/courses/structures/01.htmlSource Right: http://www.westworld.be
Convolution is useful for many effects in video games, for example depth of field. This is a cinematic effect which tries to mimic the properties of a real camera lens, where objects away from the focal point appear blurred. For example in this image the character in the front is in focus and everything is out of focus.This effect helps the game look more cinematic and also helps direct the user’s focus to the areas that the game developer would like them to focus on
Seismic imaging is a method of exploration geophysics that estimates the properties of the earth’s subsurface from reflected seismic waves. RTM is the current state of the art in seismic imaging. It is a computationally expensive technique that uses the full two-way acoustic wave equation instead of a simple one way propagation in order to better analyze complex situations, particularly sub salt. The image on the right for example shows complex wave interaction near a salt tooth which can be useful for trying to understand the oil content beneath the surface of the earth. The different colors denote different velocities and the concentric half circles show pressure waves. RTM in time domain uses convolution to compute discrete spatial derivatives in 3D by doing 1D convolutions in all dimensions of interestSource Left: http://www.pgs.com/ja/Pressroom/News/PGS_Releases_the_Ultimate_Dep/Source Right: http://www.acceleware.com/rtm
So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
This PDE basically states that the rate at which some quantity x is changing over time is given by some function f, which may itself depend on x and t.To use this equation we will discretize our space (x) and time (t). After that there are many approaches to solving the equations at each position and time step.
This equation is specifying the evolution of the velocity field u, over time, t.We compute the solution to this equation by discretizing the space over a regular grid, initializing the velocity and pressure fields and then evolving them based on the three steps that we discussed earlier for each timestep. The first term is the advection term, which states that the velocity field pushes itself forward. The second term states that the velocity is affected by external forces f, like gravity. Finally, the projection operator P projects the field onto its divergence free part. This ensures that the fluid satisfies important properties, like the amount of fluid flowing into a region is the same as the amount of fluid flowing out of a section. In order to do this we need to calculate and use the pressure field.
The Navier-Stokes equations are used for a number of important applications in HPC, including modeling the weather, ocean currents, water flow in a pipe, blood flow and air flow around a wing. The particular example above is simulating the flow of air around the deformable wings of an insect or bird, and is done on multiple GPUs.P. Castonguay, D.M. Williams, P.E. Vincent, A. Jameson, On the Development of a High-Order, Multi-GPU Enabled, Compressible Viscous Flow Solver for Mixed Unstructured Grids, 20th AIAA Computational Fluid Dynamics conference, AIAA 2011-3229
So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
A Fourier Transform can decompose asignal in individual frequencies.
Fast Fourier Transform (FFT) is an important computational scheme commonly used to reduce the amount of overall calculation by transforming operations into spectral space. In particular, 3-D FFT is used in highperformance computing applications such as direct numerical simulations, Protein docking simulations, including cryptography, large polynomial multiplications, image and audio processing and Molecular dynamics. One particular example of the use of FFT is to simulate turbulent flows, which is central to studying everything from the formation of hurricanes to the mixing process in the chemical industry. An example of such work has been carried out by researchers at Peking University using the Tianhe-1A GPU supercomputer.
So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
These use cases were just some examples. There are similarities between HPC and Gaming workloads at a very fundamental level, for example
Image on left: team fortressImage on right: blood coagulation factor IX simulated by AMBER (90K atoms including water)AMBER: A molecular dynamics package primarily designed for biomolecular systems such as proteins and nucleic acids.
So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
So lets talk a bit about what HPC and computing workloads and algorithms have in common such that they are effectively able to utilize the same piece of hardware.It turns out they have more similaritiesthan you might think!It is easy to imagine GPUs being good at things like medical imaging, but there are many more application domains that run very well on GPUs
But no matter whether you are doing large scale computing on thousands of nodes or computing on a workstation, you need to do parallel computing if you want to increase the performance. This is because, roughly speaking, computers are no longer growing faster, they are only growing wider and more parallel.And the most important problem that all these computers – supercomputers, workstations, mobile computers have to deal with is power.
The Chinese Academy of Sciences uses GPU supercomputers to conduct some of the largest scale molecular simulations in the world. Recently the Academy became the first to model a complete H1N1 virus. The approach marked a new supercomputer-centric way of dealing with the problems of epidemiology and virology that wasn’t possible just a few years ago.
For investment banks, the ability to calculate risks across a range of complex variables quickly is critical to success. With NVIDIA Tesla GPUs, J.P. Morgan achieved a 40x speedup of its risk calculations.DATA: NV press release: http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&version=live&prid=784689&releasejsp=release_157&xhtml=true
This is truly a medical breakthrough…GPUs are helping doctors perform beating heart surgery.Performing beating heart surgery is extremely risky and can be done by only 2% of surgeons. Medical researchers at France’s LIRMM use GPUs to ‘virtually’ still a beating heartThisenables the surgeons to treat patients by guiding robotic arms that predict and adjust for movement.
The video game industry is driven by an insatiable demand for more realistic images and effects. The images that we are able to simulate and render today, for example the dust cloud on the left, are still far away from where we would like them to be, for example the image on the right. This improvement in fidelity is going to come both from an increase in computational throughput and an improvement in the algorithms, where gaming will be borrowing more and more from HPC.
HPC requires a similar increase in computational throughput per watt in order to be able to solve the greatest challenges the world faces. For example, today, computer simulations provide great insight into weather. But to tackle climate change, we need a more accurate view for which we need systems that improve both throughput and energy efficiency.
HPC requires a similar increase in computational throughput per watt in order to be able to solve the greatest challenges the world faces. For example, today, computer simulations provide great insight into weather. But to tackle climate change, we need a more accurate view for which we need systems that improve both throughput and energy efficiency.