Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
Cuda tutorial
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

HPC Essentials 0

Download to read offline

Elements of High Performance Computing, including shared and distributed memory parallelism, using Kriging as an example.

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

HPC Essentials 0

  1. 1. HPC Essentials Prequel: From 0 to HPC in one hour OR five ways to do Kriging Bill Brouwer Research Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  2. 2. Outline ● Step 0 – Navigating RCC resources ● Step 1 – Ordinary Kriging in Octave ● Step 2 – Vectorized octave ● Step 3 – Compiled code ● Digression on Profiling &Amdahl's Law ● Step 4 – Accelerating using GPU ● Step 5 – Shared Memory ● Step 6 – Distributed Memory ● Scenarios & Summary wjb19@psu.edu
  3. 3. Step 0 ● Get an account on our systems ● Check out the system details, or let us help pick one for you ● They are Linux systems, you'll need some basic commandline knowledge – You may want to check out HPC Essentials I seminar, Unix/C overview ● We use the modules system for software, you'll need to load what you use eg., to see a list of everything available: module av eg., load octave: module load octave To see which modules you have in your environment module list wjb19@psu.edu
  4. 4. Step 0 ● There are two main types of systems: – Interactive, share a single machine with one or more users, including memory and CPUs, used for ● Debugging ● Benchmarking ● Using a program with a graphical user interface – You'll need to log in using Exceed onDemand ● Running for short periods of time wjb19@psu.edu
  5. 5. Step 0 ● Batch systems – Get dedicated memory and CPUs for period of time ● Maximum time is generally 24 hours ● Maximum memory and CPUs depends on the cluster – You log in to a head node, from which you submit a request eg., an interactive session for 1 node, 1 processor per node (ppn) and 4gb total memory: qsub -I -l walltime=24:00:00 -l mem=4gb -l nodes=1:ppn=1 ● To check the status of your request: qstat -u <your_psu_id> wjb19@psu.edu
  6. 6. Step 0 ● Other notes on clusters: – Please never run anything significant on head nodes, use PBS to submit a job instead – If you request more than 1 CPU, remember your code/workflow needs to be able to either ● Use multiple CPUs on a single node (set ppn parameter) using some form of shared memory parallelism ● Use multiple CPUs on multiple nodes (set combination node &ppn parameters) using some form of distributed memory parallelism ● A combination of the above ● Parallelism applied in an optimal way is high performance computing wjb19@psu.edu
  7. 7. High Performance Computing ● Using one or more forms of parallelism to improve the performance and scaling of your code – Vector architecture eg., SSE/AVX in Intel CPU – Shared memory parallelism eg., using multiple cores of CPU – Distributed memory parallelism eg., using Message Passing Interface (MPI) to communicate between CPUs or GPUs – Accelerators eg., Graphics Processing Units wjb19@psu.edu
  8. 8. Typical Compute Node CPU IOH ICH QuickPath Interconnect memory bus RAM PCI-express GPU PCI-e cards SATA/USB Direct Media Interface non-volatile storage BIOS ethernet NETWORK volatile storage wjb19@psu.edu
  9. 9. CPU Architecture ● Composed of several complex processing cores, control elements and high speed memory areas (eg., registers, L3 cache), as well as vector elements including special registers wjb19@psu.edu Core Core Core Core Cache Memory Controller I/O PCIe
  10. 10. Shared + Distributed Memory Parallelism ● Shared memory parallelism is : – usually implemented with pThreads or directive based programming (OpenMP) – uses one or more cores in CPU ● Distributed memory parallelism is: – one or more nodes (composed of CPUs + possibly GPUs) communicating with each other using high speed network eg., Infiniband – network topology and fabric critical to ensuring optimal communication wjb19@psu.edu
  11. 11. Nvidia GPU Streaming Multiprocessor CUDA core wjb19@psu.edu 32768x32 bit registers interconnect 64kB shared mem/L1 Cache Dispatch unit Dispatch unit Warp scheduler Warp Scheduler Special Function Unit x4 Load/Store Unit x16 Core x 16 x 2 Dispatch Port FPU Int U Operand Collector Result Queue ● GPUs run many light-weight threads at once; device composed of many more (simpler) cores than CPU
  12. 12. Step 1: Prototype your problem ● Pick a numerical scripting language eg., Octave, free version of matlab – Solid, well established, linear algebra based ● Code up a solution (eg., we'll consider ordinary kriging) ● Time all scopes/sections of your code to get a feel for bottlenecks ● You can use the keyboard statement to set breakpoints in your code for debugging purposes wjb19@psu.edu
  13. 13. Step 1: Prototype your problem ● Kriging is a geospatial statistical method eg., predicting rainfall for locations where no measurements exist, based on surrounding measurements ● Solution involves: – constructing Gamma matrix – solve system of equations for every desired prediction location wjb19@psu.edu
  14. 14. Step 1: Prototype your problem function [w,G,g,pred] = krige() % load input data &output prediction grid load input.csv; load output.csv; % init … % Gamma; m is size of input space, x,y are coordinates for available data z for i=1:m for j=1:m G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))^2+(y(i)-y(j))^2)/3.33)); end end % matrix inversion Ginv = inv(G); % predictions; n is size of output space, xp,yp are prediction coordinates % z is available data for x,y coordinates for i=1:n g(1:m) = 10.*(1-exp(-sqrt((xp(i)-x).^2+(yp(i)-y).^2)/3.33)); w=Ginv * g'; pred(i) = sum(w(1:m).*z); end wjb19@psu.edu
  15. 15. Results 1 ● Use tic/toc statements around code blocks for timing; following times are for: – Initialization – Gamma construction – Matrix inversion – Solution Octave:1> [a b c d]=krige(); Elapsed time is 0.079224 seconds. Elapsed time is 40.9722 seconds. Elapsed time is 0.742576 seconds. Elapsed time is 10.6134 seconds. ● 80% of the time is spent in constructing the matrix → need to vectorize ● Interpreted languages like Octave benefit from removing loops and replacing with array operations – Loops are parsed every iteration by the interpreter – Vectorizing code by using array operations may take advantage of vector architecture in CPU wjb19@psu.edu
  16. 16. Step 2: Vectorize your Prototype function [w,G,g,pred] = krige() % load input data &output prediction grid load input.csv; load output.csv; % init … % Gamma XI = (ones(m,1)*x)'; YI = (ones(m,1)*y)'; G(1:m,1:m) = 10.*(1-exp(-sqrt((XI-XI').^2+(YI-YI').^2)/3.33)); % matrix inversion Ginv = inv(G); % predictions XP = (ones(m,1)*xp); YP = (ones(m,1)*yp); XI = (ones(n,1)*x)'; YI = (ones(n,1)*y)'; ZI = (ones(n,1)*z)'; g(1:m,:) = 10.*(1-exp(-sqrt((XP-XI).^2+(YP-YI).^2)/3.33)); w=Ginv * g; pred = sum(w(1:m,:).*ZI); wjb19@psu.edu
  17. 17. Results 2 octave:2> [a b c d]=krige(); Elapsed time is 0.0765891 seconds. Elapsed time is 0.195605 seconds. Elapsed time is 0.758174 seconds. Elapsed time is 3.24861 seconds. ● Code is more than 15x times faster, for a relatively small investment ● Vectorized code will have a higher memory overhead, due to the creation of temporary arrays; harder to read too :) ● When memory or compute time become unacceptable, no choice but to move to compiled code ● C/C++ are logical choices in a Linux environment – Very stable, heavily used, Linux OS itself is written in C – Expressive languages containing many innovations, algorithms and data structures – C++ is object oriented, allows for design of large sophisticated projects wjb19@psu.edu
  18. 18. Step 3 : Compiled Code ● Unlike a scripted language, C/C++ must be compiled to run on the CPU, converting a human readable language into machine code ● Several compilers are available on the clusters including Intel, PGI and the GNU compiler collection ● In compilation and linking steps we must specify headers (with interfaces) and libraries (with functions) need by our application ● Try to avoid reinventing the wheel, always use available libraries if you can instead of reimplementing algorithms, data structures ● As opposed to scripting, now responsible for memory management eg., allocating on the heap (dynamically at runtime) or on the stack (statically at compile time) wjb19@psu.edu
  19. 19. Step 3 : Compiled Code ● In porting Octave/Matlab code to C/C++ you should always consider using these libraries at least: – Armadillo, C++ wrappers for BLAS/LAPACK, syntax very similar to Octave/Matlab – BLAS/LAPACK itself ● BLAS==Basic Linear Algebra ● LAPACK==Linear Algebra PACKage ● Both come in many optimized flavors eg., Intel MKL ● If you want to know more about Linux basics including writing/compiling C code, you could check out HPC Essentials I ● If you want to know more about C++, you could check out HPC Essentials V wjb19@psu.edu
  20. 20. Step 3 : Compiled Code #include "armadillo" #include <mkl.h> #include <iostream> using namespace std; using namespace arma; int main(){ mat G; vec g; //load data, initialize variables, calculate Gamma for (int i=0; i<m; i++) for (int j=0; j<m; j++){ G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j)) +(y(i)-y(j))*(y(i)-y(j)))/3.33)); } char uplo = 'U'; int N = m+1; int info; int * ipiv = new int[N]; double * work = new double[3*N]; // factorize using the LU decomp. routine from LAPACK dgetrf(&N, &N, G.memptr(), &N, ipiv, &info); //solve int nrhs=1; char trans='N'; for (int i=0; i<n; i++){ g.rows(0,m-1) = ... dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info); pred(i,0)=dot(z,g.rows(0,m-1)); … } wjb19@psu.edu
  21. 21. Results 3 ● Compiled code is comparable in speed to vectorized code, although we could make some algorithmic changes to improve further: – The Gamma matrix is symmetric, no need to calculate values for j >= i (ie., just calculate/store a triangular matrix) – Calculating the inverse is expensive and inaccurate, better to (for eg.,) factorize a matrix and use direct solve eg., using forward/backward substitution (we did do this, but using full matrix/LU decomp.) – Armadillo uses operator overloading &expression templates to allow a vectorized approach to programming, although we leave loops in for the moment, to allow parallelization later ● If you have bugs in your code, use gdb to debug ● Always profile completely in order to solve all issues and get a complete handle on your code wjb19@psu.edu
  22. 22. Important Code Profiling Methods ● Solving memory leaks; use valgrind ● Poor memory access patterns/cache usage – Use valgrind --tool=cachegrind to assess cache hits + misses ● Heap memory usage – Memory management has performance impact, assess with valgrind --tool=massif ● And before you consider moving to parallel, develop a call profile for your code eg., in terms of total instructions executed for each scope, using valgrind --tool=callgrind wjb19@psu.edu
  23. 23. Amdahl's Law ● The problems in science we seek to solve are becoming increasingly large, as we go down in scale (eg., quantum chemistry) or up (eg., astrophysics) ● As a natural consequence, we seek both performance and scaling in our scientific applications, thus we parallelize as we run out of resources using a single processor ● We are limited by Amdahl's law, an expression of the maximum improvement of parallel code over serial: 1/((1-P) +P/N) where P is the portion of application code we parallelize, and N is the number of processors ie., as N increases, the portion of remaining serial code becomes increasingly expensive, relatively speaking wjb19@psu.edu
  24. 24. Amdahl's Law ● Unless the portion of code we can parallelize approaches 100%,we see rapidly diminishing returns with increasing numbers of processors wjb19@psu.edu
  25. 25. Step 4 : Accelerate ● In general not all algorithms are amenable, and there is the communication bottleneck between CPU and GPU to overcome ● However, linear algebra operations are extremely efficient on GPU, you can expect 2-10x over a whole CPU socket (ie., running all cores) for many operations ● The language for programming Nvidia series GPUs is CUDA; much like C but you need to know the architecture well and/or: – Use libraries like cuBLAS (what we'll try) – Use directive based programming in the form of openACC – Use the OpenCL language (cross platform, but not heavily supported by Nvidia like CUDA) wjb19@psu.edu
  26. 26. Step 4 : Accelerate #include "armadillo" #include <mkl.h> #include <iostream> #include <cuda.h> using namespace std; using namespace arma; int main(){ mat G; vec g; //load data, initialize variables, calculate Gamma as before //factorize using the LU decomp. routine from LAPACK, as before //allocate memory on GPU and transfer data //solve on gpu; two steps, solve two triangular systems cublasDtrsm(...); cublasDtrsm(...); //free memory on GPU and transfer data back } wjb19@psu.edu
  27. 27. Results 4 ● Minimal code changes, recompilation using nvcc compiler, available by loading any CUDA module on lion-GA (where you'll also need to run) ● We still perform matrix factorization on CPU side, move data to GPU for performing solve in two steps ● This overall solution is roughly 6x the single CPU thread solution presented previously, for larger data sizes ● General rule of thumb → minimize communication btwn CPU + GPU, use GPU when you can occupy all SMPs per device, don't bother for small problems, cost of communication outweighs benefits ● There is ongoing work performed in porting LAPACK routines to GPU eg., check out our LU/QR work, or the significant MAGMA project from UT/ORNL ● If you're interested in trying CUDA and GPUs further, you could check out HPC Essentials IV wjb19@psu.edu
  28. 28. Step 5: Shared memory ● We've determined through profiling that it's worthwhile parallelizing our loops ● By linking against Intel MKL we also have access to threaded functions ● Will simply use OpenMP directive based programming for this example ● We are generally responsible for deciding what variables need to be shared by threads, and which variables should be privately owned by threads ● If we fail to make these distinctions where needed, we end up with race conditions – Threads operate on data in an uncoordinated fashion ,and data elements have unpredictable/erroneous values ● Outside the scope of this talk, but just as pernicious is deadlock, when threads (and indeed whole programs) hang due to improper coordination wjb19@psu.edu
  29. 29. Step 5 : Shared Memory #include "armadillo" #include <mkl.h> #include <iostream> #include <omp.h> ... int main(){ ... //load data, initialize variables, calculate Gamma #pragma omp parallel for for (int i=0; i<m; i++) for (int j=0; j<m; j++){ G(i,j) = 10.*(1-exp(-sqrt((x(i)-x(j))*(x(i)-x(j)) +(y(i)-y(j))*(y(i)-y(j)))/3.33)); } // factorize using the LU decomp. routine from LAPACK dgetrf(&N, &N, G.memptr(), &N, ipiv, &info); //initialize data for solve, for all right hand sides #pragma omp parallel for for (int i=0; i<n; i++) for (int j=0; j<m; j++) g(i,j) = ... //multithreaded solve for all RHS dgetrs(&trans,&N,&n,G.memptr(),&N,ipiv,g.memptr(),&N,&info); //assemble predictions wjb19@psu.edu
  30. 30. Results 5 ● In linking, must specify -fopenmp if using GNU compiler, or -openmp for Intel ● At runtime, need to export the environment variable OMP_NUM_THREADS to the desired number ● Exporting this number to something beyond the total number of cores you have access to will result in severe performance degradation ● Outside the scope of this talk, but often need to tune CPU affinity for best performance ● For more information, please check out HPC Essentials II wjb19@psu.edu
  31. 31. Step 6 : Distributed Memory ● A good motivation for moving to distributed memory is, in a simple case, a shortage of memory on a single node ● From a practical perspective, scheduling distributed CPU cores is easier than shared memory cores ie., your PBS queuing time is shorter :) ● We will use the message passing interface (MPI), a venerable standard developed over the last 20 years or so, with language bindings for C and fortran ● On the clusters, we use OpenMPI (not to be confused with OpenMP); once you load the module, by using the wrapper compilers, compilation and linking paths are taken care of for you ● Aside from needing to link with other libraries like Intel MKL, compiling and linking a C++ MPI program can be as simple as : module load openmpi mpic++ my_program.cpp wjb19@psu.edu
  32. 32. Step 6 : Distributed Memory #include "armadillo" #include <mkl.h> #include <iostream> #include <mpi.h> int main(int argc, char * argv[]){ int rank, size; MPI_Status status; MPI_Init(&argc, &argv); // size== total processes in this MPI_COMM_WORLD pool MPI_Comm_size(MPI_COMM_WORLD, &size); // rank== my identifier in pool MPI_Comm_rank(MPI_COMM_WORLD, &rank); // load data, initialize variables, calculate Gamma, perform factorization // solve just for my portion of predictions int lower = (rank * n) / size; int upper = ((rank+1) * n) / size)-1; for (int i=lower; i<upper; i++){ g.rows(0,m-1) = ... dgetrs(&trans,&N,&nrhs,G.memptr(),&N,ipiv,g.memptr(),&N,&info); pred(i,0)=dot(z,g.rows(0,m-1)); … } //gather results back to root process wjb19@psu.edu
  33. 33. Results 6 ● When you run a MPI job using PBS, you need to use the mpirun script to setup your environment and spawn processes on the different CPUs allocated to you by the scheduler: mpirun my_application.x ● Here we simply divided the output space between different processors ie., each processor in the pool calculated a portion of the predictions ● However a collective call was needed (not shown) after the solve steps, a gather statement to bring all the results to the root process (with rank 0) ● This was the only communication between different processes throughout the calculation ie., this was close to embarrassingly parallel → no communication, great scaling with processors ● Despite the high bandwidths available on modern networks, the cost of latency is generally the limiting factor is using distributed memory parallelism ● For more on MPI you could check out HPC Essentials III wjb19@psu.edu
  34. 34. Review ● Let's review some of the things we've discussed ● I'll splash up several scenario's and we'll attempt to score them wjb19@psu.edu
  35. 35. Score Card Score What this feels like Your HPC vehicle +5 Civilized society Something German +4 Evening with friends American Muscle +3 Favorite show renewed A Honda +2 Twinkies are back Sonata +1 A fairy gets its wings Camry 0 meh Corolla -1 A fairy dies Neon -2 Twinkies are gone Pinto -3 Favorite show canceled Le Barron -4 Evening with Facebook Yugo -5 Zombie Apocalypse Abrams tank wjb19@psu.edu
  36. 36. Scenario 1 ● You get an account for hammer, maybe install and use Exceed onDemand, load and use the Matlab/Octave module after logging in wjb19@psu.edu
  37. 37. Scenario 1 ● Score : 0 ● Meh. You'll run a little faster, probably have more memory. But this isn't HPC and you could almost do this on your laptop. You're driving a Corolla, doing 45 mp/h in the fast lane. wjb19@psu.edu
  38. 38. Scenario 2 ● You vectorize your loops and/or create a compiled MEX (Matlab) or OCT (Octave) function wjb19@psu.edu
  39. 39. Scenario 2 ● A fairy gets its wings! You move up to the Camry! ● By vectorizing loops you use internal functions that are interpreted once at runtime, and under the hood may even get to utilize the vector architecture of the CPU. ● Tricky loops eg., those with conditionals are best converted to MEX/OCT functions eg., for OCTAVE you want the mkoctfile utility ● If compiling new functions, don't forget to link with HPC libraries eg., Intel MKL or AMD ACML where possible. wjb19@psu.edu
  40. 40. Scenario 3 ● Instead of submitting a PBS job you do all this on the head node of a batch cluster wjb19@psu.edu
  41. 41. Scenario 3 ● A fairy dies! You drive a Neon at 35 mp/h in the HPC fastlane! ● Things could be worse for you, but using memory and CPU on head nodes can grind processes like parallel filesystems to a halt, making other users and sys admin feel downright melancholy. Screens freeze, commands return at the speed of pitchblende. ● If you need dedicated resources and/or to run for more than a few minutes, please use an interactive cluster or PBS : https://rcc.its.psu.edu/user_guides/system_utilities/pbs/ wjb19@psu.edu
  42. 42. Scenario 4 ● You use Armadillo to port your Matlab/Octave code to C++, and use version control to manage your project (eg., SVN, git/github) wjb19@psu.edu
  43. 43. Scenario 4 ● Twinkies are back! You think Hyundai finally have it together and splash out on the Sonata! ● Vectorized Octave/Matlab code is hard to beat. However you may wish to scale outside the node someday, integrate into an existing C++ project or perhaps use rich C++ objects (found in Boost for eg.,) so this is the way to go. Actually there are myriad reasons. ● Don't forget to compile first with '-Wall -g' options, then when it's working and you get the right answer, optimize! wjb19@psu.edu
  44. 44. Scenario 5 ● You port your Matlab/Octave code to C++ without use of libraries or version control wjb19@psu.edu
  45. 45. Scenario 5 ● No twinkies! You drive a pinto that bursts into flames immediately! ● Reinventing the wheel is a very bad, time consuming idea. Armadillo uses expression templates to create very efficient code at compile time, without it you could end up with an inefficient mess. ● Neglect to use version control and you will surely regret it. Probably right around a publication deadline too. And while we're on the topic please backup your data. wjb19@psu.edu
  46. 46. Scenario 6 ● You target sections of your version controlled C++ code for acceleration, after understanding it better by profiling using valgrind --tool=callgrind wjb19@psu.edu
  47. 47. Scenario 6 ● Score : +3 ● Futurama is back! You get a new civic! ● Believe the hype, GPUs are here to stay and will accelerate many algorithms, especially linear algebra. ● Take advantage of libraries like CUBLAS before rolling your own code, check in at the CUDAZONE to see what applications and code examples exist already. Get familiar with CUDA, we are an Nvidia CUDA Research Center : https://research.nvidia.com/content/penn-state-crc- summary wjb19@psu.edu
  48. 48. Scenario 7 ● Your non-version controlled C++ code has bad memory access patterns, memory leaks, creates many temporaries. ● Score : -3 ● Bye-Bye Futurama ! Hello Le Barron! ● Ignore good memory and cache access patterns at your peril ● Use valgrind (default) and valgrind --tool=cachegrind to learn more. Avoid temporaries by using libraries like Armadillo, or learning and using expression templates. wjb19@psu.edu
  49. 49. Scenario 8 ● Scenario 6 and you introduce shared memory parallelism using OpenMP. You look into and tune CPU affinity. ● Score : +4 ● You provide Babette's feast for your friends and elicit a penchant for the Ford mustang. ● OpenMP is relatively easy eg., a pragma around a for loop. ● Don't forget to check thread performance with valgrind --tool=helgrind ● Now your code is a thing of beauty, properly version controlled, profiled completely (well you could run massif as well) and you're able to use all the compute hardware in a single heterogeneous node. wjb19@psu.edu
  50. 50. Scenario 9 ● Scenario 7 AND you decide to thrash disk. Plus you try to write >= 1M files ● Score : -4 ● Yugo is only cool in that Portlandia bit, and Facebook was only good for a brief period in 2006. ● Disk I/O kills in a HPC context, plus the maximum file limit at time of writing is 1M – You give control to the kernel and your application ceases to execute for some time (a voluntary context switch) – You might be contending for disk with other processes – You introduce the lower memory bandwidth (BW) and higher latency (Delta) of disk versus system memory – Parallel filesystems → all of the above plus network BW and Delta wjb19@psu.edu
  51. 51. Scenario 10 ● Scenario 8 AND you decide to scale outside the node with MPI. You look into Patterns. GOF is on the nightstand. ● Score : +5 ● You are a cultured individual and you drive a German vehicle. You care about engineering. ● Don't forget Amdahl's law ● Even with IB networks, minimize communication, consider new paradigms in distributed memory parallelism (check out MPI revision 3). wjb19@psu.edu
  52. 52. Scenario 11 ● Scenario 9 and you do it all on the head node, including OpenMP for 1% of your loops. You also export OMP_NUM_THREADS=20 and you have 10 cores. There's no coordination between threads, races all over the place. You have about 40 MPI processes trying to read the same file as well, without parallel file I/O. wjb19@psu.edu
  53. 53. Scenario 11 ● Score : -5 ● The end is nigh and you're taking out zombies and HPC infrastructure in your Abrams tank, moving at 1mph, getting 0.2 miles to the gallon ● You ignored all the other advice, and now you throw out Amdahl's law too. ● AND you have no coordination between any of your threads or processes. ● AND you're trying to run more threads and processes than the system can support concurrently, so context switching takes place furiously. ● Expect a not-so-rosy email from sys admin :-)wjb19@psu.edu
  54. 54. Summary ● High performance computing is leveraging one or more forms of parallelism in a performant way ● Often the best gains come from writing vectorized octave code, or making algorithmic changes ● Before you parallelize, fully profile your code and keep Amdahl's law in mind ● All forms of parallelism have their limitations, but in general: – GPU accelerators are excellent for linear algebra – Shared memory using OpenMP works well for simple, nested loops – Consider using MPI (distributed memory) for 'big data', but limit communication wjb19@psu.edu

Elements of High Performance Computing, including shared and distributed memory parallelism, using Kriging as an example.

Views

Total views

1,015

On Slideshare

0

From embeds

0

Number of embeds

12

Actions

Downloads

11

Shares

0

Comments

0

Likes

0

×