SlideShare une entreprise Scribd logo
1  sur  54
Télécharger pour lire hors ligne
HPC Essentials Prequel: From 0 to
HPC in one hour
OR
five ways to do Kriging
Bill Brouwer
Research Computing and Cyberinfrastructure
(RCC), PSU
wjb19@psu.edu
Outline
● Step 0
– Navigating RCC resources
●
Step 1
– Ordinary Kriging in Octave
● Step 2
– Vectorized octave
● Step 3
– Compiled code
●
Digression on Profiling &Amdahl's Law
● Step 4
– Accelerating using GPU
●
Step 5
– Shared Memory
● Step 6
– Distributed Memory
● Scenarios & Summary
wjb19@psu.edu
Step 0
● Get an account on our systems
● Check out the system details, or let us help pick one for you
● They are Linux systems, you'll need some basic commandline knowledge
– You may want to check out HPC Essentials I seminar, Unix/C overview
● We use the modules system for software, you'll need to load what you
use eg., to see a list of everything available:
module av
eg., load octave:
module load octave
To see which modules you have in your environment
module list
wjb19@psu.edu
Step 0
● There are two main types of systems:
– Interactive, share a single machine with one or
more users, including memory and CPUs, used for
● Debugging
● Benchmarking
● Using a program with a graphical user interface
– You'll need to log in using Exceed onDemand
● Running for short periods of time
wjb19@psu.edu
Step 0
● Batch systems
– Get dedicated memory and CPUs for period of time
● Maximum time is generally 24 hours
● Maximum memory and CPUs depends on the cluster
– You log in to a head node, from which you submit a request
eg., an interactive session for 1 node, 1 processor per node
(ppn) and 4gb total memory:
qsub -I -l walltime=24:00:00 -l mem=4gb -l nodes=1:ppn=1
● To check the status of your request:
qstat -u <your_psu_id>
wjb19@psu.edu
Step 0
● Other notes on clusters:
– Please never run anything significant on head nodes, use PBS to
submit a job instead
– If you request more than 1 CPU, remember your code/workflow needs
to be able to either
● Use multiple CPUs on a single node (set ppn parameter) using
some form of shared memory parallelism
● Use multiple CPUs on multiple nodes (set combination node &ppn
parameters) using some form of distributed memory parallelism
● A combination of the above
● Parallelism applied in an optimal way is high performance
computing
wjb19@psu.edu
High Performance Computing
● Using one or more forms of parallelism to
improve the performance and scaling of your
code
– Vector architecture eg., SSE/AVX in Intel CPU
– Shared memory parallelism eg., using multiple
cores of CPU
– Distributed memory parallelism eg., using
Message Passing Interface (MPI) to communicate
between CPUs or GPUs
– Accelerators eg., Graphics Processing Units
wjb19@psu.edu
Typical Compute Node
CPU
IOH
ICH
QuickPath Interconnect
memory bus
RAM
PCI-express
GPU
PCI-e cards
SATA/USB
Direct Media Interface
non-volatile storage
BIOS
ethernet
NETWORK
volatile storage
wjb19@psu.edu
CPU Architecture
● Composed of several complex processing cores, control elements
and high speed memory areas (eg., registers, L3 cache), as well as
vector elements including special registers
wjb19@psu.edu
Core Core Core Core
Cache
Memory Controller
I/O PCIe
Shared + Distributed Memory
Parallelism
● Shared memory parallelism is :
– usually implemented with pThreads or directive based
programming (OpenMP)
– uses one or more cores in CPU
● Distributed memory parallelism is:
– one or more nodes (composed of CPUs + possibly GPUs)
communicating with each other using high speed network eg.,
Infiniband
– network topology and fabric critical to ensuring optimal
communication
wjb19@psu.edu
Nvidia GPU Streaming
Multiprocessor
CUDA core
wjb19@psu.edu
32768x32 bit registers
interconnect
64kB shared mem/L1 Cache
Dispatch unit Dispatch unit
Warp scheduler Warp Scheduler
Special Function Unit x4
Load/Store Unit x16
Core x 16 x 2
Dispatch Port
FPU Int U
Operand Collector
Result Queue
● GPUs run many light-weight threads at once; device composed of
many more (simpler) cores than CPU
Step 1: Prototype your problem
● Pick a numerical scripting language eg., Octave, free
version of matlab
– Solid, well established, linear algebra based
● Code up a solution (eg., we'll consider ordinary kriging)
● Time all scopes/sections of your code to get a feel for
bottlenecks
● You can use the keyboard statement to set
breakpoints in your code for debugging purposes
wjb19@psu.edu
Step 1: Prototype your problem
● Kriging is a geospatial statistical method eg.,
predicting rainfall for locations where no
measurements exist, based on surrounding
measurements
● Solution involves:
– constructing Gamma matrix
– solve system of equations for every desired
prediction location
wjb19@psu.edu
Step 1: Prototype your problem
function [w,G,g,pred] = krige()
% load input data &output prediction grid
load input.csv; load output.csv;
% init
…
% Gamma; m is size of input space, x,y are coordinates for available data z
for i=1:m
for j=1:m
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))^2+(y(i)-y(j))^2)/3.33));
end
end
% matrix inversion
Ginv = inv(G);
% predictions; n is size of output space, xp,yp are prediction coordinates
% z is available data for x,y coordinates
for i=1:n
g(1:m) = 10.*(1-exp(-sqrt((xp(i)-x).^2+(yp(i)-y).^2)/3.33));
w=Ginv * g';
pred(i) = sum(w(1:m).*z);
end
wjb19@psu.edu
Results 1
● Use tic/toc statements around code blocks for timing; following times are for:
– Initialization
– Gamma construction
– Matrix inversion
– Solution
Octave:1> [a b c d]=krige();
Elapsed time is 0.079224 seconds.
Elapsed time is 40.9722 seconds.
Elapsed time is 0.742576 seconds.
Elapsed time is 10.6134 seconds.
● 80% of the time is spent in constructing the matrix → need to vectorize
● Interpreted languages like Octave benefit from removing loops and replacing
with array operations
– Loops are parsed every iteration by the interpreter
– Vectorizing code by using array operations may take advantage of vector
architecture in CPU
wjb19@psu.edu
Step 2: Vectorize your Prototype
function [w,G,g,pred] = krige()
% load input data &output prediction grid
load input.csv; load output.csv;
% init
…
% Gamma
XI = (ones(m,1)*x)'; YI = (ones(m,1)*y)';
G(1:m,1:m) = 10.*(1-exp(-sqrt((XI-XI').^2+(YI-YI').^2)/3.33));
% matrix inversion
Ginv = inv(G);
% predictions
XP = (ones(m,1)*xp); YP = (ones(m,1)*yp);
XI = (ones(n,1)*x)'; YI = (ones(n,1)*y)';
ZI = (ones(n,1)*z)';
g(1:m,:) = 10.*(1-exp(-sqrt((XP-XI).^2+(YP-YI).^2)/3.33));
w=Ginv * g;
pred = sum(w(1:m,:).*ZI);
wjb19@psu.edu
Results 2
octave:2> [a b c d]=krige();
Elapsed time is 0.0765891 seconds.
Elapsed time is 0.195605 seconds.
Elapsed time is 0.758174 seconds.
Elapsed time is 3.24861 seconds.
● Code is more than 15x times faster, for a relatively small investment
● Vectorized code will have a higher memory overhead, due to the creation of
temporary arrays; harder to read too :)
● When memory or compute time become unacceptable, no choice but to move
to compiled code
● C/C++ are logical choices in a Linux environment
– Very stable, heavily used, Linux OS itself is written in C
– Expressive languages containing many innovations, algorithms and data
structures
– C++ is object oriented, allows for design of large sophisticated projects
wjb19@psu.edu
Step 3 : Compiled Code
● Unlike a scripted language, C/C++ must be compiled to run on the CPU,
converting a human readable language into machine code
● Several compilers are available on the clusters including Intel, PGI and
the GNU compiler collection
● In compilation and linking steps we must specify headers (with
interfaces) and libraries (with functions) need by our application
● Try to avoid reinventing the wheel, always use available libraries if you
can instead of reimplementing algorithms, data structures
● As opposed to scripting, now responsible for memory management eg.,
allocating on the heap (dynamically at runtime) or on the stack
(statically at compile time)
wjb19@psu.edu
Step 3 : Compiled Code
● In porting Octave/Matlab code to C/C++ you should always consider using
these libraries at least:
– Armadillo, C++ wrappers for BLAS/LAPACK, syntax very similar to
Octave/Matlab
– BLAS/LAPACK itself
● BLAS==Basic Linear Algebra
● LAPACK==Linear Algebra PACKage
● Both come in many optimized flavors eg., Intel MKL
● If you want to know more about Linux basics including writing/compiling C
code, you could check out HPC Essentials I
● If you want to know more about C++, you could check out HPC Essentials V
wjb19@psu.edu
Step 3 : Compiled Code
#include "armadillo"
#include <mkl.h>
#include <iostream>
using namespace std;
using namespace arma;
int main(){
mat G; vec g;
//load data, initialize variables, calculate Gamma
for (int i=0; i<m; i++)
for (int j=0; j<m; j++){
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))
+(y(i)-y(j))*(y(i)-y(j)))/3.33));
}
char uplo = 'U'; int N = m+1; int info;
int * ipiv = new int[N]; double * work = new double[3*N];
// factorize using the LU decomp. routine from LAPACK
dgetrf(&N, &N, G.memptr(), &N, ipiv, &info);
//solve
int nrhs=1; char trans='N';
for (int i=0; i<n; i++){
g.rows(0,m-1) = ...
dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
pred(i,0)=dot(z,g.rows(0,m-1));
…
} wjb19@psu.edu
Results 3
● Compiled code is comparable in speed to vectorized code, although we could
make some algorithmic changes to improve further:
– The Gamma matrix is symmetric, no need to calculate values for j >= i (ie.,
just calculate/store a triangular matrix)
– Calculating the inverse is expensive and inaccurate, better to (for eg.,)
factorize a matrix and use direct solve eg., using forward/backward
substitution (we did do this, but using full matrix/LU decomp.)
– Armadillo uses operator overloading &expression templates to allow a
vectorized approach to programming, although we leave loops in for the
moment, to allow parallelization later
● If you have bugs in your code, use gdb to debug
● Always profile completely in order to solve all issues and get a complete
handle on your code
wjb19@psu.edu
Important Code Profiling Methods
● Solving memory leaks; use valgrind
● Poor memory access patterns/cache usage
– Use valgrind --tool=cachegrind to assess cache hits +
misses
● Heap memory usage
– Memory management has performance impact, assess with
valgrind --tool=massif
● And before you consider moving to parallel, develop a call
profile for your code eg., in terms of total instructions executed
for each scope, using valgrind --tool=callgrind
wjb19@psu.edu
Amdahl's Law
●
The problems in science we seek to solve are becoming increasingly
large, as we go down in scale (eg., quantum chemistry) or up (eg.,
astrophysics)
●
As a natural consequence, we seek both performance and scaling in our
scientific applications, thus we parallelize as we run out of resources using
a single processor
● We are limited by Amdahl's law, an expression of the maximum
improvement of parallel code over serial:
1/((1-P) +P/N)
where
P is the portion of application code we parallelize, and N is the number of
processors ie., as N increases, the portion of remaining serial code becomes
increasingly expensive, relatively speaking
wjb19@psu.edu
Amdahl's Law
● Unless the portion of code we can parallelize approaches 100%,we see
rapidly diminishing returns with increasing numbers of processors
wjb19@psu.edu
Step 4 : Accelerate
● In general not all algorithms are amenable, and there is the
communication bottleneck between CPU and GPU to overcome
● However, linear algebra operations are extremely efficient on GPU,
you can expect 2-10x over a whole CPU socket (ie., running all cores)
for many operations
● The language for programming Nvidia series GPUs is CUDA; much
like C but you need to know the architecture well and/or:
– Use libraries like cuBLAS (what we'll try)
– Use directive based programming in the form of openACC
– Use the OpenCL language (cross platform, but not heavily supported by Nvidia
like CUDA)
wjb19@psu.edu
Step 4 : Accelerate
#include "armadillo"
#include <mkl.h>
#include <iostream>
#include <cuda.h>
using namespace std;
using namespace arma;
int main(){
mat G; vec g;
//load data, initialize variables, calculate Gamma as before
//factorize using the LU decomp. routine from LAPACK, as before
//allocate memory on GPU and transfer data
//solve on gpu; two steps, solve two triangular systems
cublasDtrsm(...);
cublasDtrsm(...);
//free memory on GPU and transfer data back
}
wjb19@psu.edu
Results 4
● Minimal code changes, recompilation using nvcc compiler, available by loading
any CUDA module on lion-GA (where you'll also need to run)
● We still perform matrix factorization on CPU side, move data to GPU for
performing solve in two steps
● This overall solution is roughly 6x the single CPU thread solution presented
previously, for larger data sizes
● General rule of thumb → minimize communication btwn CPU + GPU, use GPU
when you can occupy all SMPs per device, don't bother for small problems,
cost of communication outweighs benefits
● There is ongoing work performed in porting LAPACK routines to GPU eg.,
check out our LU/QR work, or the significant MAGMA project from UT/ORNL
● If you're interested in trying CUDA and GPUs further, you could check out HPC
Essentials IV
wjb19@psu.edu
Step 5: Shared memory
●
We've determined through profiling that it's worthwhile parallelizing our loops
● By linking against Intel MKL we also have access to threaded functions
● Will simply use OpenMP directive based programming for this example
● We are generally responsible for deciding what variables need to be shared
by threads, and which variables should be privately owned by threads
● If we fail to make these distinctions where needed, we end up with race
conditions
– Threads operate on data in an uncoordinated fashion ,and data elements have
unpredictable/erroneous values
● Outside the scope of this talk, but just as pernicious is deadlock, when
threads (and indeed whole programs) hang due to improper coordination
wjb19@psu.edu
Step 5 : Shared Memory
#include "armadillo"
#include <mkl.h>
#include <iostream>
#include <omp.h>
...
int main(){
...
//load data, initialize variables, calculate Gamma
#pragma omp parallel for
for (int i=0; i<m; i++)
for (int j=0; j<m; j++){
G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j))
+(y(i)-y(j))*(y(i)-y(j)))/3.33));
}
// factorize using the LU decomp. routine from LAPACK
dgetrf(&N, &N, G.memptr(), &N, ipiv, &info);
//initialize data for solve, for all right hand sides
#pragma omp parallel for
for (int i=0; i<n; i++)
for (int j=0; j<m; j++)
g(i,j) = ...
//multithreaded solve for all RHS
dgetrs(&trans,&N,&n,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
//assemble predictions
wjb19@psu.edu
Results 5
● In linking, must specify -fopenmp if using GNU compiler,
or -openmp for Intel
● At runtime, need to export the environment variable
OMP_NUM_THREADS to the desired number
● Exporting this number to something beyond the total
number of cores you have access to will result in severe
performance degradation
● Outside the scope of this talk, but often need to tune CPU
affinity for best performance
● For more information, please check out HPC Essentials II
wjb19@psu.edu
Step 6 : Distributed Memory
● A good motivation for moving to distributed memory is, in a simple case,
a shortage of memory on a single node
● From a practical perspective, scheduling distributed CPU cores is easier
than shared memory cores ie., your PBS queuing time is shorter :)
● We will use the message passing interface (MPI), a venerable standard
developed over the last 20 years or so, with language bindings for C
and fortran
● On the clusters, we use OpenMPI (not to be confused with OpenMP);
once you load the module, by using the wrapper compilers, compilation
and linking paths are taken care of for you
● Aside from needing to link with other libraries like Intel MKL, compiling
and linking a C++ MPI program can be as simple as :
module load openmpi
mpic++ my_program.cpp
wjb19@psu.edu
Step 6 : Distributed Memory
#include "armadillo"
#include <mkl.h>
#include <iostream>
#include <mpi.h>
int main(int argc, char * argv[]){
int rank, size;
MPI_Status status;
MPI_Init(&argc, &argv);
// size== total processes in this MPI_COMM_WORLD pool
MPI_Comm_size(MPI_COMM_WORLD, &size);
// rank== my identifier in pool
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
// load data, initialize variables, calculate Gamma, perform factorization
// solve just for my portion of predictions
int lower = (rank * n) / size;
int upper = ((rank+1) * n) / size)-1;
for (int i=lower; i<upper; i++){
g.rows(0,m-1) = ...
dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info);
pred(i,0)=dot(z,g.rows(0,m-1));
…
}
//gather results back to root process
wjb19@psu.edu
Results 6
● When you run a MPI job using PBS, you need to use the mpirun script to
setup your environment and spawn processes on the different CPUs allocated
to you by the scheduler:
mpirun my_application.x
●
Here we simply divided the output space between different processors ie.,
each processor in the pool calculated a portion of the predictions
●
However a collective call was needed (not shown) after the solve steps, a
gather statement to bring all the results to the root process (with rank 0)
● This was the only communication between different processes throughout the
calculation ie., this was close to embarrassingly parallel → no communication,
great scaling with processors
● Despite the high bandwidths available on modern networks, the cost of
latency is generally the limiting factor is using distributed memory parallelism
● For more on MPI you could check out HPC Essentials III
wjb19@psu.edu
Review
● Let's review some of the things we've
discussed
● I'll splash up several scenario's and we'll
attempt to score them
wjb19@psu.edu
Score Card
Score What this feels like Your HPC vehicle
+5 Civilized society Something German
+4 Evening with friends American Muscle
+3 Favorite show renewed A Honda
+2 Twinkies are back Sonata
+1 A fairy gets its wings Camry
0 meh Corolla
-1 A fairy dies Neon
-2 Twinkies are gone Pinto
-3 Favorite show canceled Le Barron
-4 Evening with Facebook Yugo
-5 Zombie Apocalypse Abrams tank
wjb19@psu.edu
Scenario 1
● You get an account for hammer, maybe install
and use Exceed onDemand, load and use the
Matlab/Octave module after logging in
wjb19@psu.edu
Scenario 1
● Score : 0
● Meh. You'll run a little faster, probably have
more memory. But this isn't HPC and you
could almost do this on your laptop. You're
driving a Corolla, doing 45 mp/h in the fast
lane.
wjb19@psu.edu
Scenario 2
● You vectorize your loops and/or create a
compiled MEX (Matlab) or OCT (Octave)
function
wjb19@psu.edu
Scenario 2
● A fairy gets its wings! You move up to the Camry!
● By vectorizing loops you use internal functions that are
interpreted once at runtime, and under the hood may even
get to utilize the vector architecture of the CPU.
● Tricky loops eg., those with conditionals are best converted to
MEX/OCT functions eg., for OCTAVE you want the
mkoctfile utility
● If compiling new functions, don't forget to link with HPC
libraries eg., Intel MKL or AMD ACML where possible.
wjb19@psu.edu
Scenario 3
● Instead of submitting a PBS job you do all this
on the head node of a batch cluster
wjb19@psu.edu
Scenario 3
● A fairy dies! You drive a Neon at 35 mp/h in the HPC fastlane!
●
Things could be worse for you, but using memory and CPU on head
nodes can grind processes like parallel filesystems to a halt, making
other users and sys admin feel downright melancholy. Screens
freeze, commands return at the speed of pitchblende.
● If you need dedicated resources and/or to run for more than a few
minutes, please use an interactive cluster or PBS :
https://rcc.its.psu.edu/user_guides/system_utilities/pbs/
wjb19@psu.edu
Scenario 4
● You use Armadillo to port your Matlab/Octave
code to C++, and use version control to
manage your project (eg., SVN, git/github)
wjb19@psu.edu
Scenario 4
● Twinkies are back! You think Hyundai finally have it
together and splash out on the Sonata!
● Vectorized Octave/Matlab code is hard to beat.
However you may wish to scale outside the node
someday, integrate into an existing C++ project or
perhaps use rich C++ objects (found in Boost for
eg.,) so this is the way to go. Actually there are
myriad reasons.
● Don't forget to compile first with '-Wall -g' options,
then when it's working and you get the right answer,
optimize!
wjb19@psu.edu
Scenario 5
● You port your Matlab/Octave code to C++
without use of libraries or version control
wjb19@psu.edu
Scenario 5
● No twinkies! You drive a pinto that bursts into flames
immediately!
● Reinventing the wheel is a very bad, time
consuming idea. Armadillo uses expression
templates to create very efficient code at compile
time, without it you could end up with an inefficient
mess.
● Neglect to use version control and you will surely
regret it. Probably right around a publication
deadline too. And while we're on the topic please
backup your data.
wjb19@psu.edu
Scenario 6
● You target sections of your version controlled
C++ code for acceleration, after understanding
it better by profiling using valgrind
--tool=callgrind
wjb19@psu.edu
Scenario 6
● Score : +3
● Futurama is back! You get a new civic!
● Believe the hype, GPUs are here to stay and will
accelerate many algorithms, especially linear algebra.
● Take advantage of libraries like CUBLAS before rolling
your own code, check in at the CUDAZONE to see what
applications and code examples exist already. Get familiar
with CUDA, we are an Nvidia CUDA Research Center :
https://research.nvidia.com/content/penn-state-crc-
summary
wjb19@psu.edu
Scenario 7
● Your non-version controlled C++ code has bad
memory access patterns, memory leaks, creates
many temporaries.
● Score : -3
● Bye-Bye Futurama ! Hello Le Barron!
● Ignore good memory and cache access patterns at
your peril
● Use valgrind (default) and valgrind --tool=cachegrind
to learn more. Avoid temporaries by using libraries
like Armadillo, or learning and using expression
templates.
wjb19@psu.edu
Scenario 8
● Scenario 6 and you introduce shared memory parallelism using
OpenMP. You look into and tune CPU affinity.
● Score : +4
● You provide Babette's feast for your friends and elicit a
penchant for the Ford mustang.
● OpenMP is relatively easy eg., a pragma around a for loop.
● Don't forget to check thread performance with valgrind
--tool=helgrind
● Now your code is a thing of beauty, properly version controlled,
profiled completely (well you could run massif as well) and
you're able to use all the compute hardware in a single
heterogeneous node.
wjb19@psu.edu
Scenario 9
● Scenario 7 AND you decide to thrash disk. Plus you try to
write >= 1M files
● Score : -4
● Yugo is only cool in that Portlandia bit, and Facebook was
only good for a brief period in 2006.
● Disk I/O kills in a HPC context, plus the maximum file limit at
time of writing is 1M
– You give control to the kernel and your application ceases to
execute for some time (a voluntary context switch)
– You might be contending for disk with other processes
– You introduce the lower memory bandwidth (BW) and higher
latency (Delta) of disk versus system memory
– Parallel filesystems → all of the above plus network BW and Delta
wjb19@psu.edu
Scenario 10
● Scenario 8 AND you decide to scale outside the
node with MPI. You look into Patterns. GOF is on
the nightstand.
● Score : +5
● You are a cultured individual and you drive a
German vehicle. You care about engineering.
● Don't forget Amdahl's law
● Even with IB networks, minimize communication,
consider new paradigms in distributed memory
parallelism (check out MPI revision 3).
wjb19@psu.edu
Scenario 11
● Scenario 9 and you do it all on the head node,
including OpenMP for 1% of your loops. You also
export OMP_NUM_THREADS=20 and you have
10 cores. There's no coordination between
threads, races all over the place. You have
about 40 MPI processes trying to read the same
file as well, without parallel file I/O.
wjb19@psu.edu
Scenario 11
● Score : -5
● The end is nigh and you're taking out zombies and
HPC infrastructure in your Abrams tank, moving at
1mph, getting 0.2 miles to the gallon
● You ignored all the other advice, and now you throw
out Amdahl's law too.
● AND you have no coordination between any of your
threads or processes.
● AND you're trying to run more threads and processes
than the system can support concurrently, so context
switching takes place furiously.
● Expect a not-so-rosy email from sys admin :-)wjb19@psu.edu
Summary
● High performance computing is leveraging one or more
forms of parallelism in a performant way
● Often the best gains come from writing vectorized octave
code, or making algorithmic changes
● Before you parallelize, fully profile your code and keep
Amdahl's law in mind
● All forms of parallelism have their limitations, but in
general:
– GPU accelerators are excellent for linear algebra
– Shared memory using OpenMP works well for simple, nested
loops
– Consider using MPI (distributed memory) for 'big data', but
limit communication wjb19@psu.edu

Contenu connexe

Tendances

Конверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеКонверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеPlatonov Sergey
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerSeiya Tokui
 
Hpx runtime system
Hpx runtime systemHpx runtime system
Hpx runtime systemCOMAQA.BY
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition동호 이
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerPlatonov Sergey
 
Programming at Compile Time
Programming at Compile TimeProgramming at Compile Time
Programming at Compile TimeemBO_Conference
 
HPX: C++11 runtime система для параллельных и распределённых вычислений
HPX: C++11 runtime система для параллельных и распределённых вычисленийHPX: C++11 runtime система для параллельных и распределённых вычислений
HPX: C++11 runtime система для параллельных и распределённых вычисленийPlatonov Sergey
 
Linuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talkLinuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talkLenz Gschwendtner
 
DUSK - Develop at Userland Install into Kernel
DUSK - Develop at Userland Install into KernelDUSK - Develop at Userland Install into Kernel
DUSK - Develop at Userland Install into KernelAlexey Smirnov
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerLinaro
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
DConf 2016: Bitpacking Like a Madman by Amaury SechetDConf 2016: Bitpacking Like a Madman by Amaury Sechet
DConf 2016: Bitpacking Like a Madman by Amaury SechetAndrei Alexandrescu
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute modelsPedram Mazloom
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiCysinfo Cyber Security Community
 

Tendances (20)

Конверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемыеКонверсия управляемых языков в неуправляемые
Конверсия управляемых языков в неуправляемые
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Hpx runtime system
Hpx runtime systemHpx runtime system
Hpx runtime system
 
3DD 1e Linux
3DD 1e Linux3DD 1e Linux
3DD 1e Linux
 
Matrix transposition
Matrix transpositionMatrix transposition
Matrix transposition
 
Address/Thread/Memory Sanitizer
Address/Thread/Memory SanitizerAddress/Thread/Memory Sanitizer
Address/Thread/Memory Sanitizer
 
Programming at Compile Time
Programming at Compile TimeProgramming at Compile Time
Programming at Compile Time
 
HPX: C++11 runtime система для параллельных и распределённых вычислений
HPX: C++11 runtime система для параллельных и распределённых вычисленийHPX: C++11 runtime система для параллельных и распределённых вычислений
HPX: C++11 runtime система для параллельных и распределённых вычислений
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
Linuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talkLinuxconf 2011 parallel languages talk
Linuxconf 2011 parallel languages talk
 
Survey onhpcs languages
Survey onhpcs languagesSurvey onhpcs languages
Survey onhpcs languages
 
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
 
DUSK - Develop at Userland Install into Kernel
DUSK - Develop at Userland Install into KernelDUSK - Develop at Userland Install into Kernel
DUSK - Develop at Userland Install into Kernel
 
Coding style for good synthesis
Coding style for good synthesisCoding style for good synthesis
Coding style for good synthesis
 
Q4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-VectorizerQ4.11: Using GCC Auto-Vectorizer
Q4.11: Using GCC Auto-Vectorizer
 
Dma
DmaDma
Dma
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
DConf 2016: Bitpacking Like a Madman by Amaury SechetDConf 2016: Bitpacking Like a Madman by Amaury Sechet
DConf 2016: Bitpacking Like a Madman by Amaury Sechet
 
Example uses of gpu compute models
Example uses of gpu compute modelsExample uses of gpu compute models
Example uses of gpu compute models
 
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul PillaiA look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
A look into the sanitizer family (ASAN & UBSAN) by Akul Pillai
 

Similaire à HPC Essentials 0

NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Sparksamthemonad
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleSri Ambati
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLCommand Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLMark Wong
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdfTigabu Yaya
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin敬倫 林
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesJeff Larkin
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Preferred Networks
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APISri Ambati
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPIyaman dua
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine ParallelismSri Prasanna
 
Andriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsAndriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsOWASP Kyiv
 
Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesEelco Visser
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaFerdinand Jamitzky
 

Similaire à HPC Essentials 0 (20)

NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
H2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt DowleH2O Design and Infrastructure with Matt Dowle
H2O Design and Infrastructure with Matt Dowle
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiNatural Language Processing with CNTK and Apache Spark with Ali Zaidi
Natural Language Processing with CNTK and Apache Spark with Ali Zaidi
 
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdflecture_GPUArchCUDA04-OpenMPHOMP.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Week1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC BeginWeek1 Electronic System-level ESL Design and SystemC Begin
Week1 Electronic System-level ESL Design and SystemC Begin
 
Cray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best PracticesCray XT Porting, Scaling, and Optimization Best Practices
Cray XT Porting, Scaling, and Optimization Best Practices
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
GBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O APIGBM in H2O with Cliff Click: H2O API
GBM in H2O with Cliff Click: H2O API
 
hybrid-programming.pptx
hybrid-programming.pptxhybrid-programming.pptx
hybrid-programming.pptx
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Intermachine Parallelism
Intermachine ParallelismIntermachine Parallelism
Intermachine Parallelism
 
Andriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tipsAndriy Shalaenko - GO security tips
Andriy Shalaenko - GO security tips
 
Compiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual MachinesCompiler Construction | Lecture 12 | Virtual Machines
Compiler Construction | Lecture 12 | Virtual Machines
 
Gpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cudaGpu workshop cluster universe: scripting cuda
Gpu workshop cluster universe: scripting cuda
 

Dernier

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 

Dernier (20)

COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 

HPC Essentials 0

  • 1. HPC Essentials Prequel: From 0 to HPC in one hour OR five ways to do Kriging Bill Brouwer Research Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  • 2. Outline ● Step 0 – Navigating RCC resources ● Step 1 – Ordinary Kriging in Octave ● Step 2 – Vectorized octave ● Step 3 – Compiled code ● Digression on Profiling &Amdahl's Law ● Step 4 – Accelerating using GPU ● Step 5 – Shared Memory ● Step 6 – Distributed Memory ● Scenarios & Summary wjb19@psu.edu
  • 3. Step 0 ● Get an account on our systems ● Check out the system details, or let us help pick one for you ● They are Linux systems, you'll need some basic commandline knowledge – You may want to check out HPC Essentials I seminar, Unix/C overview ● We use the modules system for software, you'll need to load what you use eg., to see a list of everything available: module av eg., load octave: module load octave To see which modules you have in your environment module list wjb19@psu.edu
  • 4. Step 0 ● There are two main types of systems: – Interactive, share a single machine with one or more users, including memory and CPUs, used for ● Debugging ● Benchmarking ● Using a program with a graphical user interface – You'll need to log in using Exceed onDemand ● Running for short periods of time wjb19@psu.edu
  • 5. Step 0 ● Batch systems – Get dedicated memory and CPUs for period of time ● Maximum time is generally 24 hours ● Maximum memory and CPUs depends on the cluster – You log in to a head node, from which you submit a request eg., an interactive session for 1 node, 1 processor per node (ppn) and 4gb total memory: qsub -I -l walltime=24:00:00 -l mem=4gb -l nodes=1:ppn=1 ● To check the status of your request: qstat -u <your_psu_id> wjb19@psu.edu
  • 6. Step 0 ● Other notes on clusters: – Please never run anything significant on head nodes, use PBS to submit a job instead – If you request more than 1 CPU, remember your code/workflow needs to be able to either ● Use multiple CPUs on a single node (set ppn parameter) using some form of shared memory parallelism ● Use multiple CPUs on multiple nodes (set combination node &ppn parameters) using some form of distributed memory parallelism ● A combination of the above ● Parallelism applied in an optimal way is high performance computing wjb19@psu.edu
  • 7. High Performance Computing ● Using one or more forms of parallelism to improve the performance and scaling of your code – Vector architecture eg., SSE/AVX in Intel CPU – Shared memory parallelism eg., using multiple cores of CPU – Distributed memory parallelism eg., using Message Passing Interface (MPI) to communicate between CPUs or GPUs – Accelerators eg., Graphics Processing Units wjb19@psu.edu
  • 8. Typical Compute Node CPU IOH ICH QuickPath Interconnect memory bus RAM PCI-express GPU PCI-e cards SATA/USB Direct Media Interface non-volatile storage BIOS ethernet NETWORK volatile storage wjb19@psu.edu
  • 9. CPU Architecture ● Composed of several complex processing cores, control elements and high speed memory areas (eg., registers, L3 cache), as well as vector elements including special registers wjb19@psu.edu Core Core Core Core Cache Memory Controller I/O PCIe
  • 10. Shared + Distributed Memory Parallelism ● Shared memory parallelism is : – usually implemented with pThreads or directive based programming (OpenMP) – uses one or more cores in CPU ● Distributed memory parallelism is: – one or more nodes (composed of CPUs + possibly GPUs) communicating with each other using high speed network eg., Infiniband – network topology and fabric critical to ensuring optimal communication wjb19@psu.edu
  • 11. Nvidia GPU Streaming Multiprocessor CUDA core wjb19@psu.edu 32768x32 bit registers interconnect 64kB shared mem/L1 Cache Dispatch unit Dispatch unit Warp scheduler Warp Scheduler Special Function Unit x4 Load/Store Unit x16 Core x 16 x 2 Dispatch Port FPU Int U Operand Collector Result Queue ● GPUs run many light-weight threads at once; device composed of many more (simpler) cores than CPU
  • 12. Step 1: Prototype your problem ● Pick a numerical scripting language eg., Octave, free version of matlab – Solid, well established, linear algebra based ● Code up a solution (eg., we'll consider ordinary kriging) ● Time all scopes/sections of your code to get a feel for bottlenecks ● You can use the keyboard statement to set breakpoints in your code for debugging purposes wjb19@psu.edu
  • 13. Step 1: Prototype your problem ● Kriging is a geospatial statistical method eg., predicting rainfall for locations where no measurements exist, based on surrounding measurements ● Solution involves: – constructing Gamma matrix – solve system of equations for every desired prediction location wjb19@psu.edu
  • 14. Step 1: Prototype your problem function [w,G,g,pred] = krige() % load input data &output prediction grid load input.csv; load output.csv; % init … % Gamma; m is size of input space, x,y are coordinates for available data z for i=1:m for j=1:m G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))^2+(y(i)-y(j))^2)/3.33)); end end % matrix inversion Ginv = inv(G); % predictions; n is size of output space, xp,yp are prediction coordinates % z is available data for x,y coordinates for i=1:n g(1:m) = 10.*(1-exp(-sqrt((xp(i)-x).^2+(yp(i)-y).^2)/3.33)); w=Ginv * g'; pred(i) = sum(w(1:m).*z); end wjb19@psu.edu
  • 15. Results 1 ● Use tic/toc statements around code blocks for timing; following times are for: – Initialization – Gamma construction – Matrix inversion – Solution Octave:1> [a b c d]=krige(); Elapsed time is 0.079224 seconds. Elapsed time is 40.9722 seconds. Elapsed time is 0.742576 seconds. Elapsed time is 10.6134 seconds. ● 80% of the time is spent in constructing the matrix → need to vectorize ● Interpreted languages like Octave benefit from removing loops and replacing with array operations – Loops are parsed every iteration by the interpreter – Vectorizing code by using array operations may take advantage of vector architecture in CPU wjb19@psu.edu
  • 16. Step 2: Vectorize your Prototype function [w,G,g,pred] = krige() % load input data &output prediction grid load input.csv; load output.csv; % init … % Gamma XI = (ones(m,1)*x)'; YI = (ones(m,1)*y)'; G(1:m,1:m) = 10.*(1-exp(-sqrt((XI-XI').^2+(YI-YI').^2)/3.33)); % matrix inversion Ginv = inv(G); % predictions XP = (ones(m,1)*xp); YP = (ones(m,1)*yp); XI = (ones(n,1)*x)'; YI = (ones(n,1)*y)'; ZI = (ones(n,1)*z)'; g(1:m,:) = 10.*(1-exp(-sqrt((XP-XI).^2+(YP-YI).^2)/3.33)); w=Ginv * g; pred = sum(w(1:m,:).*ZI); wjb19@psu.edu
  • 17. Results 2 octave:2> [a b c d]=krige(); Elapsed time is 0.0765891 seconds. Elapsed time is 0.195605 seconds. Elapsed time is 0.758174 seconds. Elapsed time is 3.24861 seconds. ● Code is more than 15x times faster, for a relatively small investment ● Vectorized code will have a higher memory overhead, due to the creation of temporary arrays; harder to read too :) ● When memory or compute time become unacceptable, no choice but to move to compiled code ● C/C++ are logical choices in a Linux environment – Very stable, heavily used, Linux OS itself is written in C – Expressive languages containing many innovations, algorithms and data structures – C++ is object oriented, allows for design of large sophisticated projects wjb19@psu.edu
  • 18. Step 3 : Compiled Code ● Unlike a scripted language, C/C++ must be compiled to run on the CPU, converting a human readable language into machine code ● Several compilers are available on the clusters including Intel, PGI and the GNU compiler collection ● In compilation and linking steps we must specify headers (with interfaces) and libraries (with functions) need by our application ● Try to avoid reinventing the wheel, always use available libraries if you can instead of reimplementing algorithms, data structures ● As opposed to scripting, now responsible for memory management eg., allocating on the heap (dynamically at runtime) or on the stack (statically at compile time) wjb19@psu.edu
  • 19. Step 3 : Compiled Code ● In porting Octave/Matlab code to C/C++ you should always consider using these libraries at least: – Armadillo, C++ wrappers for BLAS/LAPACK, syntax very similar to Octave/Matlab – BLAS/LAPACK itself ● BLAS==Basic Linear Algebra ● LAPACK==Linear Algebra PACKage ● Both come in many optimized flavors eg., Intel MKL ● If you want to know more about Linux basics including writing/compiling C code, you could check out HPC Essentials I ● If you want to know more about C++, you could check out HPC Essentials V wjb19@psu.edu
  • 20. Step 3 : Compiled Code #include "armadillo" #include <mkl.h> #include <iostream> using namespace std; using namespace arma; int main(){ mat G; vec g; //load data, initialize variables, calculate Gamma for (int i=0; i<m; i++) for (int j=0; j<m; j++){ G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j)) +(y(i)-y(j))*(y(i)-y(j)))/3.33)); } char uplo = 'U'; int N = m+1; int info; int * ipiv = new int[N]; double * work = new double[3*N]; // factorize using the LU decomp. routine from LAPACK dgetrf(&N, &N, G.memptr(), &N, ipiv, &info); //solve int nrhs=1; char trans='N'; for (int i=0; i<n; i++){ g.rows(0,m-1) = ... dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info); pred(i,0)=dot(z,g.rows(0,m-1)); … } wjb19@psu.edu
  • 21. Results 3 ● Compiled code is comparable in speed to vectorized code, although we could make some algorithmic changes to improve further: – The Gamma matrix is symmetric, no need to calculate values for j >= i (ie., just calculate/store a triangular matrix) – Calculating the inverse is expensive and inaccurate, better to (for eg.,) factorize a matrix and use direct solve eg., using forward/backward substitution (we did do this, but using full matrix/LU decomp.) – Armadillo uses operator overloading &expression templates to allow a vectorized approach to programming, although we leave loops in for the moment, to allow parallelization later ● If you have bugs in your code, use gdb to debug ● Always profile completely in order to solve all issues and get a complete handle on your code wjb19@psu.edu
  • 22. Important Code Profiling Methods ● Solving memory leaks; use valgrind ● Poor memory access patterns/cache usage – Use valgrind --tool=cachegrind to assess cache hits + misses ● Heap memory usage – Memory management has performance impact, assess with valgrind --tool=massif ● And before you consider moving to parallel, develop a call profile for your code eg., in terms of total instructions executed for each scope, using valgrind --tool=callgrind wjb19@psu.edu
  • 23. Amdahl's Law ● The problems in science we seek to solve are becoming increasingly large, as we go down in scale (eg., quantum chemistry) or up (eg., astrophysics) ● As a natural consequence, we seek both performance and scaling in our scientific applications, thus we parallelize as we run out of resources using a single processor ● We are limited by Amdahl's law, an expression of the maximum improvement of parallel code over serial: 1/((1-P) +P/N) where P is the portion of application code we parallelize, and N is the number of processors ie., as N increases, the portion of remaining serial code becomes increasingly expensive, relatively speaking wjb19@psu.edu
  • 24. Amdahl's Law ● Unless the portion of code we can parallelize approaches 100%,we see rapidly diminishing returns with increasing numbers of processors wjb19@psu.edu
  • 25. Step 4 : Accelerate ● In general not all algorithms are amenable, and there is the communication bottleneck between CPU and GPU to overcome ● However, linear algebra operations are extremely efficient on GPU, you can expect 2-10x over a whole CPU socket (ie., running all cores) for many operations ● The language for programming Nvidia series GPUs is CUDA; much like C but you need to know the architecture well and/or: – Use libraries like cuBLAS (what we'll try) – Use directive based programming in the form of openACC – Use the OpenCL language (cross platform, but not heavily supported by Nvidia like CUDA) wjb19@psu.edu
  • 26. Step 4 : Accelerate #include "armadillo" #include <mkl.h> #include <iostream> #include <cuda.h> using namespace std; using namespace arma; int main(){ mat G; vec g; //load data, initialize variables, calculate Gamma as before //factorize using the LU decomp. routine from LAPACK, as before //allocate memory on GPU and transfer data //solve on gpu; two steps, solve two triangular systems cublasDtrsm(...); cublasDtrsm(...); //free memory on GPU and transfer data back } wjb19@psu.edu
  • 27. Results 4 ● Minimal code changes, recompilation using nvcc compiler, available by loading any CUDA module on lion-GA (where you'll also need to run) ● We still perform matrix factorization on CPU side, move data to GPU for performing solve in two steps ● This overall solution is roughly 6x the single CPU thread solution presented previously, for larger data sizes ● General rule of thumb → minimize communication btwn CPU + GPU, use GPU when you can occupy all SMPs per device, don't bother for small problems, cost of communication outweighs benefits ● There is ongoing work performed in porting LAPACK routines to GPU eg., check out our LU/QR work, or the significant MAGMA project from UT/ORNL ● If you're interested in trying CUDA and GPUs further, you could check out HPC Essentials IV wjb19@psu.edu
  • 28. Step 5: Shared memory ● We've determined through profiling that it's worthwhile parallelizing our loops ● By linking against Intel MKL we also have access to threaded functions ● Will simply use OpenMP directive based programming for this example ● We are generally responsible for deciding what variables need to be shared by threads, and which variables should be privately owned by threads ● If we fail to make these distinctions where needed, we end up with race conditions – Threads operate on data in an uncoordinated fashion ,and data elements have unpredictable/erroneous values ● Outside the scope of this talk, but just as pernicious is deadlock, when threads (and indeed whole programs) hang due to improper coordination wjb19@psu.edu
  • 29. Step 5 : Shared Memory #include "armadillo" #include <mkl.h> #include <iostream> #include <omp.h> ... int main(){ ... //load data, initialize variables, calculate Gamma #pragma omp parallel for for (int i=0; i<m; i++) for (int j=0; j<m; j++){ G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j)) +(y(i)-y(j))*(y(i)-y(j)))/3.33)); } // factorize using the LU decomp. routine from LAPACK dgetrf(&N, &N, G.memptr(), &N, ipiv, &info); //initialize data for solve, for all right hand sides #pragma omp parallel for for (int i=0; i<n; i++) for (int j=0; j<m; j++) g(i,j) = ... //multithreaded solve for all RHS dgetrs(&trans,&N,&n,G.memptr(),&N,ipiv,g.memptr(),&N,&info); //assemble predictions wjb19@psu.edu
  • 30. Results 5 ● In linking, must specify -fopenmp if using GNU compiler, or -openmp for Intel ● At runtime, need to export the environment variable OMP_NUM_THREADS to the desired number ● Exporting this number to something beyond the total number of cores you have access to will result in severe performance degradation ● Outside the scope of this talk, but often need to tune CPU affinity for best performance ● For more information, please check out HPC Essentials II wjb19@psu.edu
  • 31. Step 6 : Distributed Memory ● A good motivation for moving to distributed memory is, in a simple case, a shortage of memory on a single node ● From a practical perspective, scheduling distributed CPU cores is easier than shared memory cores ie., your PBS queuing time is shorter :) ● We will use the message passing interface (MPI), a venerable standard developed over the last 20 years or so, with language bindings for C and fortran ● On the clusters, we use OpenMPI (not to be confused with OpenMP); once you load the module, by using the wrapper compilers, compilation and linking paths are taken care of for you ● Aside from needing to link with other libraries like Intel MKL, compiling and linking a C++ MPI program can be as simple as : module load openmpi mpic++ my_program.cpp wjb19@psu.edu
  • 32. Step 6 : Distributed Memory #include "armadillo" #include <mkl.h> #include <iostream> #include <mpi.h> int main(int argc, char * argv[]){ int rank, size; MPI_Status status; MPI_Init(&argc, &argv); // size== total processes in this MPI_COMM_WORLD pool MPI_Comm_size(MPI_COMM_WORLD, &size); // rank== my identifier in pool MPI_Comm_rank(MPI_COMM_WORLD, &rank); // load data, initialize variables, calculate Gamma, perform factorization // solve just for my portion of predictions int lower = (rank * n) / size; int upper = ((rank+1) * n) / size)-1; for (int i=lower; i<upper; i++){ g.rows(0,m-1) = ... dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info); pred(i,0)=dot(z,g.rows(0,m-1)); … } //gather results back to root process wjb19@psu.edu
  • 33. Results 6 ● When you run a MPI job using PBS, you need to use the mpirun script to setup your environment and spawn processes on the different CPUs allocated to you by the scheduler: mpirun my_application.x ● Here we simply divided the output space between different processors ie., each processor in the pool calculated a portion of the predictions ● However a collective call was needed (not shown) after the solve steps, a gather statement to bring all the results to the root process (with rank 0) ● This was the only communication between different processes throughout the calculation ie., this was close to embarrassingly parallel → no communication, great scaling with processors ● Despite the high bandwidths available on modern networks, the cost of latency is generally the limiting factor is using distributed memory parallelism ● For more on MPI you could check out HPC Essentials III wjb19@psu.edu
  • 34. Review ● Let's review some of the things we've discussed ● I'll splash up several scenario's and we'll attempt to score them wjb19@psu.edu
  • 35. Score Card Score What this feels like Your HPC vehicle +5 Civilized society Something German +4 Evening with friends American Muscle +3 Favorite show renewed A Honda +2 Twinkies are back Sonata +1 A fairy gets its wings Camry 0 meh Corolla -1 A fairy dies Neon -2 Twinkies are gone Pinto -3 Favorite show canceled Le Barron -4 Evening with Facebook Yugo -5 Zombie Apocalypse Abrams tank wjb19@psu.edu
  • 36. Scenario 1 ● You get an account for hammer, maybe install and use Exceed onDemand, load and use the Matlab/Octave module after logging in wjb19@psu.edu
  • 37. Scenario 1 ● Score : 0 ● Meh. You'll run a little faster, probably have more memory. But this isn't HPC and you could almost do this on your laptop. You're driving a Corolla, doing 45 mp/h in the fast lane. wjb19@psu.edu
  • 38. Scenario 2 ● You vectorize your loops and/or create a compiled MEX (Matlab) or OCT (Octave) function wjb19@psu.edu
  • 39. Scenario 2 ● A fairy gets its wings! You move up to the Camry! ● By vectorizing loops you use internal functions that are interpreted once at runtime, and under the hood may even get to utilize the vector architecture of the CPU. ● Tricky loops eg., those with conditionals are best converted to MEX/OCT functions eg., for OCTAVE you want the mkoctfile utility ● If compiling new functions, don't forget to link with HPC libraries eg., Intel MKL or AMD ACML where possible. wjb19@psu.edu
  • 40. Scenario 3 ● Instead of submitting a PBS job you do all this on the head node of a batch cluster wjb19@psu.edu
  • 41. Scenario 3 ● A fairy dies! You drive a Neon at 35 mp/h in the HPC fastlane! ● Things could be worse for you, but using memory and CPU on head nodes can grind processes like parallel filesystems to a halt, making other users and sys admin feel downright melancholy. Screens freeze, commands return at the speed of pitchblende. ● If you need dedicated resources and/or to run for more than a few minutes, please use an interactive cluster or PBS : https://rcc.its.psu.edu/user_guides/system_utilities/pbs/ wjb19@psu.edu
  • 42. Scenario 4 ● You use Armadillo to port your Matlab/Octave code to C++, and use version control to manage your project (eg., SVN, git/github) wjb19@psu.edu
  • 43. Scenario 4 ● Twinkies are back! You think Hyundai finally have it together and splash out on the Sonata! ● Vectorized Octave/Matlab code is hard to beat. However you may wish to scale outside the node someday, integrate into an existing C++ project or perhaps use rich C++ objects (found in Boost for eg.,) so this is the way to go. Actually there are myriad reasons. ● Don't forget to compile first with '-Wall -g' options, then when it's working and you get the right answer, optimize! wjb19@psu.edu
  • 44. Scenario 5 ● You port your Matlab/Octave code to C++ without use of libraries or version control wjb19@psu.edu
  • 45. Scenario 5 ● No twinkies! You drive a pinto that bursts into flames immediately! ● Reinventing the wheel is a very bad, time consuming idea. Armadillo uses expression templates to create very efficient code at compile time, without it you could end up with an inefficient mess. ● Neglect to use version control and you will surely regret it. Probably right around a publication deadline too. And while we're on the topic please backup your data. wjb19@psu.edu
  • 46. Scenario 6 ● You target sections of your version controlled C++ code for acceleration, after understanding it better by profiling using valgrind --tool=callgrind wjb19@psu.edu
  • 47. Scenario 6 ● Score : +3 ● Futurama is back! You get a new civic! ● Believe the hype, GPUs are here to stay and will accelerate many algorithms, especially linear algebra. ● Take advantage of libraries like CUBLAS before rolling your own code, check in at the CUDAZONE to see what applications and code examples exist already. Get familiar with CUDA, we are an Nvidia CUDA Research Center : https://research.nvidia.com/content/penn-state-crc- summary wjb19@psu.edu
  • 48. Scenario 7 ● Your non-version controlled C++ code has bad memory access patterns, memory leaks, creates many temporaries. ● Score : -3 ● Bye-Bye Futurama ! Hello Le Barron! ● Ignore good memory and cache access patterns at your peril ● Use valgrind (default) and valgrind --tool=cachegrind to learn more. Avoid temporaries by using libraries like Armadillo, or learning and using expression templates. wjb19@psu.edu
  • 49. Scenario 8 ● Scenario 6 and you introduce shared memory parallelism using OpenMP. You look into and tune CPU affinity. ● Score : +4 ● You provide Babette's feast for your friends and elicit a penchant for the Ford mustang. ● OpenMP is relatively easy eg., a pragma around a for loop. ● Don't forget to check thread performance with valgrind --tool=helgrind ● Now your code is a thing of beauty, properly version controlled, profiled completely (well you could run massif as well) and you're able to use all the compute hardware in a single heterogeneous node. wjb19@psu.edu
  • 50. Scenario 9 ● Scenario 7 AND you decide to thrash disk. Plus you try to write >= 1M files ● Score : -4 ● Yugo is only cool in that Portlandia bit, and Facebook was only good for a brief period in 2006. ● Disk I/O kills in a HPC context, plus the maximum file limit at time of writing is 1M – You give control to the kernel and your application ceases to execute for some time (a voluntary context switch) – You might be contending for disk with other processes – You introduce the lower memory bandwidth (BW) and higher latency (Delta) of disk versus system memory – Parallel filesystems → all of the above plus network BW and Delta wjb19@psu.edu
  • 51. Scenario 10 ● Scenario 8 AND you decide to scale outside the node with MPI. You look into Patterns. GOF is on the nightstand. ● Score : +5 ● You are a cultured individual and you drive a German vehicle. You care about engineering. ● Don't forget Amdahl's law ● Even with IB networks, minimize communication, consider new paradigms in distributed memory parallelism (check out MPI revision 3). wjb19@psu.edu
  • 52. Scenario 11 ● Scenario 9 and you do it all on the head node, including OpenMP for 1% of your loops. You also export OMP_NUM_THREADS=20 and you have 10 cores. There's no coordination between threads, races all over the place. You have about 40 MPI processes trying to read the same file as well, without parallel file I/O. wjb19@psu.edu
  • 53. Scenario 11 ● Score : -5 ● The end is nigh and you're taking out zombies and HPC infrastructure in your Abrams tank, moving at 1mph, getting 0.2 miles to the gallon ● You ignored all the other advice, and now you throw out Amdahl's law too. ● AND you have no coordination between any of your threads or processes. ● AND you're trying to run more threads and processes than the system can support concurrently, so context switching takes place furiously. ● Expect a not-so-rosy email from sys admin :-)wjb19@psu.edu
  • 54. Summary ● High performance computing is leveraging one or more forms of parallelism in a performant way ● Often the best gains come from writing vectorized octave code, or making algorithmic changes ● Before you parallelize, fully profile your code and keep Amdahl's law in mind ● All forms of parallelism have their limitations, but in general: – GPU accelerators are excellent for linear algebra – Shared memory using OpenMP works well for simple, nested loops – Consider using MPI (distributed memory) for 'big data', but limit communication wjb19@psu.edu