3. HPC Introduction
HPC systems composed of :
●
● Software
● Hardware
● Devices (eg., disks)
● Compute elements (eg., CPU)
● Shared and/or distributed memory
● Communication (eg., Infiniband network)
●A HPC system ...isn't... unless hardware is configured correctly and
software leverages all resources made available to it, in an optimal
manner
●An operating system controls the execution of software on the hardware;
HPC clusters almost exclusively use UNIX/Linux
●In the computational sciences, we pass data and/or abstractions through
a pipelined workflow; UNIX is the natural analogue to this
solving/discovery process
wjb19@psu.edu
4. UNIX
●UNIX is a multi-user/tasking OS created by Dennis Ritchie and Ken
Thompson at AT&T Bell Labs 1969-1970, written primarily in C language
(also developed by Ritchie)
UNIX is composed of :
●
● Kernel
● OS itself which handles scheduling, memory management, I/O etc
● Shell (eg., Bash)
● Interacts with kernel, command line interpreter
● Utilities
● Programs run by the shell, tools for file manipulation, interaction
with the system
● Files
● Everything but process(es), composed of data...
wjb19@psu.edu
5. Data-Related Definitions
●Binary
● Most fundamental data representation in computing, base 2 number system
(others; hex → base 16, oct → base 8)
●Byte
● 8 bits = 8b = 1Byte = 1B; 1kB = 1024 B; 1MB = 1024 kB etc
●ASCII
● American Standard Code for Information Interchange; character encoding
scheme, 7bits (traditional) or 8bits (UTF-8) per character, a Unicode
encoding
●Stream
● A flow of bytes; source → stdout (& stderr), sink → stdin
●Bus
● Communication channel over which data flows, connects elements within a
machine
●Process
● Fundamental unit of computational work performed by a processor; CPU
executes application or OS instructions
●Node
● Single computer, composed of many elements, various architectures for
CPU, eg., x86, RISC
wjb19@psu.edu
6. Typical Compute Node (Intel i7)
RAM
CPU memory bus
QuickPath Interconnect
GPU
IOH volatile storage
PCI-express
Direct Media Interface
ethernet
PCI-e cards ICH NETWORK
SATA/USB
BIOS
non-volatile storage wjb19@psu.edu
7. More Definitions
●Cluster
● Many nodes connected together via network
●Network
● Communication channel, inter-node; connects machines
●Shared Memory
● Memory region shared within node
●Distributed Memory
● Memory region across two or more nodes
●Direct Memory Access (DMA)
● Access memory independently of programmed I/O ie., independent of the
CPU
●Bandwidth
● Rate of data transfer across serial or parallel communication channel,
expressed as bits (b) or Bytes (B) per second (s)
● Beware quotations of bandwidth; many factors eg., simplex/duplex,
peak/sustained, no. of lanes etc
● Latency or the time to create a communication channel is often more
important
wjb19@psu.edu
8. Bandwidths
●Devices
● USB : 60MB/s (version 2.0)
● Hard Disk : 100MBs-500MB/s
● PCIe : 32GB/s (x8, version 2.0)
●Networks
● 10/100Base T : 10/100 Mbit/s
● 1000BaseT (1GigE) : 1000 Mbit/s
● 10 GigE : 10 Gbit/s
● Infiniband QDR 4X: 40 Gbit/s
●Memory
● CPU : ~ 35 GB/s (Nehalem, 3x 1.3GHz DIMM/socket)*
● GPU : ~ 180 GB/s (GeForce GTX 480)
●AVOID devices, keep data resident in memory, minimize communication
btwn processes
●MANY subtleties to CPU memory management eg., with 8x CPU cores,
total bandwidth may be > 300 GB/s or as little as 10 GB/s, will discuss
further
*http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations?t=anon#fbid=XZRzflqVZ6J
wjb19@psu.edu
10. UNIX Permissions & Files
●At the highest level, UNIX objects are either files or processes, and both
are protected by permissions (processes next time)
●Every file object has two ID's, the user and group, both are assigned on
creation; only the root user has unrestricted access to everything
●Files also have bits which specify read (r), write (w) and execute (x)
permissions for the user, group and others eg., output of ls command:
rwrr 1 root root 0 Jun 11 1976 /usr/local/foo.txt
user/group/others User ID Group ID filename
●We can manipulate files using myriad utilities, these utilities are commands
interpreted by the shell and executed by the kernel
●To learn more, check man pages ie., from the command line 'man
<command>'
wjb19@psu.edu
11. File Manipulation I
Working from the command line in a Bash shell:
●
List directory foo_dir contents, human readable :
●
[wjb19@lionga scratch] $ ls lah foo_dir
Change ownership of foo.xyz to wjb19; group and user:
●
[wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz
●Add execute permission to foo.xyz:
[wjb19@lionga scratch] $ chmod +x foo.xyz
●Determine filetype for foo.xyz:
[wjb19@lionga scratch] $ file foo.xyz
●Peruse text file foo.xyz:
[wjb19@lionga scratch] $ more foo.xyz
wjb19@psu.edu
12. File Manipulation II
●Copy foo.txt from lionga to file /home/bill/foo.txt on dirac :
[wjb19@lionga scratch] $ scp foo.txt
wjb19@dirac.rcc.psu.edu:/home/bill/foo.txt
Create gzip compressed file archive of directory foo and contents :
●
[wjb19@lionga scratch] $ tar cfz foo_archive.tgz foo/*
Create bzip2 compressed file archive of directory foo and contents :
●
[wjb19@lionga scratch] $ tar cfj foo_archive.tbz foo/*
Unpack compressed file archive :
●
[wjb19@lionga scratch] $ tar xvf foo_archive.tgz
Edit a text file using VIM:
●
[wjb19@lionga scratch] $ vim foo.txt
●VIM is a venerable and powerful command line editor with a rich set of
commands
wjb19@psu.edu
13. Text File Edit w/ VIM
●Two main modes of operation; editing or command. From command, switch to edit by
issuing 'a' (insert after cursor) or 'i' (before), switch back to command via <ESC>
Save w/o quitting :w<ENTER>
Save and quit (ie., <shift> AND 'z' AND 'z') :wq<ENTER>
Quit w/o saving :q!<ENTER>
Delete x lines eg,. x=10 (also stored in clipboard) d10d
Yank (copy) x lines eg., x=10 y10y
Split screen/buffer :split<ENTER>
Switch window/buffer <CNTRL>ww
Go to line x eg., x=10 :10<ENTER>
Find matching construct (eg., from { to }) %
● Paste: 'p' undo: 'u' redo: '<CNTRL>r'
● Move up/down one screen line : '' and '+'
● Search for expression exp, forward ('n' or 'N' navigate up/down highlighted
matches) '/exp<ENTER>' or backward '?exp<ENTER>'
wjb19@psu.edu
14. Text File Compare w/ VIMDIFF
●Same commands as VIM, but highlights differences in files, allows transfer of
text btwn buffers/files; launch with 'vimdiff foo.txt foo2.txt'
●Push text from right to left (when right window active and cursor in relevant
region) using command 'dp'
●Pull text from right to left (when left window active and cursor in relevant
region) using command 'do'
wjb19@psu.edu
15. Bash Scripting
●File and other utilities can be assembled into scripts, interpreted by the
shell eg., Bash
●The scripts can be collections of commands/utilities & fundamental
programming constructs
Code Comment #this is a comment
Pipe stdout of procA to stdin of procB procA | procB
Redirect stdout of procA to file foo.txt* procA > foo.txt
Command separator procA; procB
If block if [condition] then procA fi
Display on stdout echo “hello”
Variable assignment & literal value a = “foo”; echo $a
Concatenate strings b=a.“foo2”;
Text Processing utilities sed,gawk
Search utilities find,grep
*Streams have file descriptors (numbers) associated with them; eg., to redirect stderr
from procA to foo.txt → procA 2> foo.txt
wjb19@psu.edu
16. Text Processing
●Text documents are composed of records (roughly speaking, lines
separated by carriage returns) and fields (separated by spaces)
●Text processing using sed & gawk involves coupling patterns with
actions eg., print field 1 in document foo.txt when encountering word
image:
[wjb19@lionga scratch] $ gawk '/image/ {print $1;}' “foo.txt”
pattern action input
●Parse, without case sensitivity, change from default space field
separator (FS) to equals sign, print field 2:
[wjb19@lionga scratch] $ gawk 'BEGIN{IGNORECASE=1; FS=”=”}
/image/ {print $2;}' “foo.txt”
● Putting it all together → create a Bash script w/ VIM or other (eg,. Pico)...
wjb19@psu.edu
17. Bash Example I
#!/bin/bash Run using bash
#set source and destination paths
DIR_PATH=~/scratch/espressoPRACE/PW
BAK_PATH=~/scratch/PW_BAK
declare a file_list Declare an array
#filenames to array
file_list=$(ls l ${BAK_PATH} | gawk '/f90/ {print $9}') Command output
cnt=0;
#parse files & pretty up
for x in $file_list
do
let "cnt+=1"
sed 's/,&/, &/g' $BAK_PATH/$x |
sed 's/)/) /g' |
sed 's/call/ call /g' | Search & replace
sed 's/CALL/ call /g' > $DIR_PATH/$x
echo cleaned file no. $cnt $x
done
exit
wjb19@psu.edu
18. Bash Example II
#!/bin/bash
if [ $# lt 6 ] Total arguments
then
echo usage: fitCPCPMG.sh '[/path/and/filename.csv]
[desired number of gaussians in mixture (210)]
[no. random samples (100010000)]
[mcmc steps (100030000)]
[percent noise level (010)]
[percent step size (0.0120)]
[/path/to/restart/filename.csv; optional]'
exit
fi
ext=${1##*.} File extension
if [ "$ext" != "csv" ]
then
echo ERROR: file must be *.csv
exit
fi
base=$(basename $1 .csv) File basename
if [[ $2 lt 2 ]] || [[ $2 gt 10 ]]
then
echo "ERROR: must specify 2<=x<=10 gaussians in mixture"
exit
fi
wjb19@psu.edu
20. The C Language
●Utilities, user applications and indeed the UNIX OS itself are executed by the
CPU, when expressed as machine code eg., store/load from memory, addition
etc
●Fundamental operations like memory allocation, I/O etc are laborious to
express at this level, most frequently we begin from a high-level language like C
●The process of creating an executable consists of at least 3 fundamental steps;
creation of source code text file containing all desired objects and operations,
compilation and linking eg,. using the GNU tool gcc to create executable foo.x
from source file foo.c:
[wjb19@tesla2 scratch]$ gcc std=c99 foo.c o foo.x
*C99 standard
Executable
compile link
Source *c Object *o
file code
Library
objects
wjb19@psu.edu
21. C Code Elements I
●Composed of primitive datatypes (eg., int, float, long), which
have different sizes in memory, multiples of 1 byte
●May be composed of statically allocated memory (compile time),
dynamically allocated memory (runtime), or both
●Pointers (eg., float *) are primitives with 4 or 8 byte lengths (32bit or
64bit machines) which contain an address to a contiguous region of
dynamically allocated memory
●More complicated objects can be constructed from primitives and arrays
eg., a struct
wjb19@psu.edu
22. C Code Elements II
●Common operations are gathered into functions, the most common
being main(), which must be present in executable
●Functions have a distinct name, take arguments, and return output; this
information comprises the prototype, expressed separately to the
implementation details, former often in header file
●Important system functions include read,write,printf (I/O) and
malloc,free (Memory)
●The operating system executes compiled code; a running program is a
process (more next time)
wjb19@psu.edu
23. C Code Example
#include <stdio.h>
#include <stdlib.h> Tells preprocessor to
#include "allDefines.h" include these headers;
//Kirchoff Migration function in psktmCPU.c system functions etc
void ktmMigrationCPU(struct imageGrid* imageX,
struct imageGrid* imageY,
struct imageGrid* imageZ,
struct jobParams* config,
float* midX,
Function prototype;
float* midY, must give arguments,
float* offX, their types and return
float* offY, type; implementation
float* traces, elsewhere
float* slowness,
float* image);
int main()
{
int IMAGE_SIZE = 10;
float* image = (float*) malloc (IMAGE_SIZE*sizeof(float));
printf(“size of image = %in”,IMAGE_SIZE);
for (int i=0; i<IMAGE_SIZE; i++)
printf(“image point %i = %fn”,i,image[i]);
free(image);
return 0;
}
wjb19@psu.edu
24. UNIX C Good Practice I
●Use three streams, with file descriptors 0,1,2 respectively, allows
assembly of operations into pipeline and these data streams are
'cheap' to use
●Only hand simple command line options to main() using
argc,argv[]; in general we wish to handle short and long options
(eg., see GNU coding standards) and the use of getopt_long()
is preferable.
●Utilize the environment variables of the host shell, particularly in
setting runtime conditions in executed code via getenv() eg., in
Bash set in .bashrc config file or via command line:
[wjb19@lionga scratch] $ export MY_STRING=hello
●If your project/program requires a) sophisticated objects b) many
developers c) would benefit from object oriented design principles, you
should consider writing in C++ (although being a higher-level language it is
harder to optimize)
wjb19@psu.edu
25. UNIX C Good Practice II
●In high performance applications, avoid system calls eg.,
read/write where control is given over to the kernel and processes
can be blocked until the resource is ready eg., disk
● IF system calls must be used, handle errors and report to stderr
● IF temporary files must be written, use mkstemp which sets
permissions , followed by unlink; the file descriptor is closed by
the kernel when the program exists and the file removed
●Use assert to test validity of function arguments, statements etc;
will introduce performance hit, but asserts can be removed at compile
time with NDEBUG macro (C standard)
●Debug with gdb, profile with gprof, valgrind; target most
expensive functions for optimization
Put common functions in/use libraries wherever possible....
●
wjb19@psu.edu
26. Key HPC Libraries
BLAS/LAPACK/ScaLAPACK
●
● Original basic and extended linear algebra routines
● http://www.netlib.org/
Intel Math Kernel Library (MKL)
●
● implementation of above routines, w/ solvers, fft etc
● http://software.intel.com/en-us/articles/intel-mkl/
AMD Core Math Library (ACML)
●
● Ditto
● http://developer.amd.com/libraries/acml/pages/default.aspx
OpenMPI
●
● Open source MPI implementation
● http://www.open-mpi.org/
PETSc
●
● Data structures and routines for parallel scientific applications based on PDE's
● http://www.mcs.anl.gov/petsc/petsc-as/
wjb19@psu.edu
27. UNIX C Compilation I
●In general the creation and use of shared libraries (*so) is preferable to
static (*a), for space reasons and ease of software updates
Program in modules and link separate objects
●
●Use fPIC flag in shared library compilation; PIC==position
independent, code in shared object does not depend on address/location
at which it is loaded.
Use the make utility to manage builds (more next time)
●
●Don't forget to update your PATH and LD_LIBRARY_PATH env vars w/
your binary executable path & any libraries you need/created for the
application, respectively
wjb19@psu.edu
28. UNIX C Compilation II
●Remember in compilation steps to I/set/header/paths and keep
interface (in headers) separate from implementation as much as possible
●Remember in linking steps for shared libs to:
● L/set/path/to/library AND
● set flag lmyLib, where
● /set/path/to/library/libmyLib.so must exist
otherwise you will have undefined references and/or 'can't find
lmyLib' etc
Compile with Wall or similar and fix all warnings
●
Read the manual :)
●
wjb19@psu.edu
29. Conclusions
●High Performance Computing Systems are an assembly of hardware and
software working together, usually based on the UNIX OS; multiple compute
nodes are connected together
The UNIX kernel is surrounded by a shell eg., Bash; commands and constructs
●
may be assembled into scripts
●UNIX, associated utilities and user applications are traditionally written in high-
level languages like C
●HPC user applications may take advantage of shared or distributed memory
compute models, or both
●Regardless, good code minimizes I/O, keeps data resident in memory for as
long as possible and minimizes communication between processes
●User applications should take advantage of existing high performance libraries,
and tools like gdb, gprof and valgrind
wjb19@psu.edu
31. Exercises
●Take supplied code and compile using gcc, creating executable
foo.x; attempt to run as './foo.x'
●Code has a segmentation fault, an error in memory allocation which is
handled via the malloc function
●Recompile with debug flag g, run through gdb and correct the source
of the segmentation fault
●Load the valgrind module ie., 'module load valgrind' and
then run as 'valgrind ./foo.x'; this powerful profiling tool will
help identify memory leaks, or memory on the heap* which has not been
freed
●Write a Bash script that stores your home directory file contents in an
array and :
● Uses sed to swap vowels (eg., 'a' and 'e') in names
● Parses the array of names and returns only a single match, if it exists,
else echo NOMATCH
*heap== region of dynamically allocated memory
wjb19@psu.edu
32. GDB quick start
Launch :
●
[wjb19@tesla1 scratch]$ gdb ./foo.x
Run w/ command line argument '100' :
●
(gdb) run 100
Set breakpoint at line 10 in source file :
●
(gdb) b foo.c:10
Breakpoint 1 at 0x400594: file foo.c, line 10.
(gdb) run
Starting program: /gpfs/scratch/wjb19/foo.x
Breakpoint 1, main () at foo.c:22
22 int IMAGE_SIZE = 10;
Step to next instruction (issuing 'continue' will resume execution) :
●
(gdb) step
23 float * image = (float*) malloc (IMAGE_SIZE*sizeof(float));
Print second value in array 'image' :
●
(gdb) p image[2]
$4 = 0
Display full backtrace :
●
(gdb) bt full
#0 main () at foo.c:27
i = 0
IMAGE_SIZE = 10
image = 0x601010 wjb19@psu.edu
33. HPC Essentials
Part II : Elements of Parallelism
Bill Brouwer
Research Computing and Cyberinfrastructure
(RCC), PSU
wjb19@psu.edu
35. Motivation
The problems in science we seek to solve are becoming increasingly large, as
●
we go down in scale (eg., quantum chemistry) or up (eg., astrophysics)
●As a natural consequence, we seek both performance and scaling in our
scientific applications
●Therefore we want to increase floating point operations performed and memory
bandwidth and thus seek parallelization as we run out of resources using a
single processor
●We are limited by Amdahl's law, an expression of the maximum improvement of
parallel code over serial:
1/((1-P) + P/N)
where P is the portion of application code we parallelize, and N is the number of
processors ie., as N increases, the portion of remaining serial code becomes
increasingly expensive, relatively speaking
wjb19@psu.edu
36. Motivation
●Unless the portion of code we can parallelize approaches 100%, we see
rapidly diminishing returns with increasing numbers of processors
12
Improvement factor
10
P=90%
8
6
4
P=60%
2 P=30%
P=10%
0
0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256
processors
●Nonetheless, for many applications we have a good chance of
parallelizing the vast majority of the code...
wjb19@psu.edu
37. Example : Kirchhoff Time Migration
●KTM is a technique used widely in oil+gas exploration, providing images
into the earth's interior, used to identify resources
●Seismic trace data acquired over 2D geometry is integrated to give
image of earth's interior, using ~ Green's method
●Input is generally 10^4 – 10^6 traces, 10^3 – 10^4 data points each, ie.,
lots of data to process; output image is also very large
●This is an integral technique (ie., summation, easy to parallelize), just
one of many popular algorithms performed in HPC
x==image space
==seismic space
t==traveltime
Image point Weight Trace Data
wjb19@psu.edu
38. Common Operations in HPC
● Integration
● Load/store, add & multiply
● eg., transforms
● Derivatives (Finite differences)
● Load/store, subtract & divide
● eg., PDE
● Linear Algebra
● Load/store, subtract/add/multiply/divide
● chemistry & physics, solvers
● sparse (classical physics) & dense (quantum)
●Regardless of the operations performed, after compilation into machine code,
when executed by the CPU, instructions are clocked through a pipeline into
registers for execution
●Instruction execution generally takes place in four steps, and multiple
instruction groups are concurrent within the pipeline; execution rate is a direct
function of the clock rate
wjb19@psu.edu
39. Execution Pipeline
●This is the most fine-grained form of parallelism; it's efficiency is a strong
function of branch prediction hardware, or the prediction of which instruction in a
program is the next to execute*
●At a similar level, present in more recent devices are so-called streaming SIMD
extension (SSE) registers and associated compute hardware
Clock cycle 0 1 2 3 4 5 6 7
pending
1.Fetch
2.Decode PIPELINE executing
3.Execute
4.Write-back
completed
*assisted by compiler hints wjb19@psu.edu
40. SSE
●Streaming SIMD (Single instruction, multiple Data) computation exploits special
registers and instructions to increase computation many-fold in certain cases,
since several data elements are operated on simultaneously
●Each of 8 SSE registers (labeled xmm0 through xmm7) is 128-bit longs,
storing 4 x 32-bit floating-point numbers; SSE2 and SSE3 specifications have
expanded the allowed datatypes to include doubles, ints etc
float3 float2 float1 float0
Bit 127 0
●Operations may be 'scalar' or 'pack' (ie., vector), expressed using intrinsics in
__asm block within C code eg.,
addps xmm0,xmm1
operation dst operand src operand
One can either code the intrinsics explicitly, or rely on the compiler., eg., icc
●
with optimization (O3)
● The next level up of parallelization is the multiprocessor...
wjb19@psu.edu
41. Multiprocessor Overview
●Multiprocessors or multiple core CPU's are becoming ubiquitous; better scaling
(cf Moore's law) but limited by contention for shared resources, especially
memory
●Most commonly we deal with Symmetric Multiprocessors (SMP), with unique
cache and registers, as well as shared memory region(s); more on cache in a
moment
●Memory not necessarily next to processors
→ Non-uniform Memory Access (NUMA);
CPU0 CPU1
try to ensure memory access is as local to
registers registers CPU core(s) as possible
●The proc directory on UNIX machines is a
cache cache special directory written and updated by the
kernel, containing information on CPU
(/proc/cpuinfo) and memory
(/proc/meminfo)
main memory
●The fundamental unit of work on the cores
is a process... wjb19@psu.edu
42. Processes
●Application processes are launched on the CPU by the kernel using the
fork() system call; every process has a process ID pid, available on UNIX
systems via the getpid() system call
●The kernel manages many processes concurrently; all information required to
run a process is contained in the process control block (PCB) data structure,
containing (among other things):
● The pid
● The address space
● I/O information eg., open files/streams
● Pointer to next PCB
●Processes may spawn children using the fork() system call; children are
initially a copy of the parent, but may take on different attributes via the exec()
call
wjb19@psu.edu
43. Processes
●A child process takes the id of the parent (ppid), and additionally has a unique
pid eg., output from ps command, describing itself :
[wjb19@tesla1 ~]$ ps eHo "%P %p %c %t %C"
PPID PID COMMAND ELAPSED %CPU
12608 1719 sshd 01:07:54 0.0
1719 1724 sshd 01:07:49 0.0
1724 1725 bash 01:07:48 0.0
1725 1986 ps 00:00 0.0
●During a context switch, kernel will swap one process control block for another;
context switches are detrimental to HPC and have one or more triggers,
including:
● I/O requests
● Timer interrupts
●Context switching is a very fine-grained form of scheduling; on compute
clusters we also have coarse grained scheduling in the form of job scheduling
software (more next time)
●The unique address space from the perspective of the process is referred to as
virtual memory
wjb19@psu.edu
44. Virtual Memory
●A running process is given memory by the kernel, referred to as virtual memory
(VM); address space does not correspond to physical memory address space
●The Memory Management Unit (MMU) on CPU translates between the two
address spaces, for requests made between process and OS
●Virtual Memory for every process has the same structure, below left; virtual
address space is divided into units called pages
High Address ●The MMU is assisted in address
Environment variables
Function arguments translation by the Translation
Lookaside Buffer (TLB), which stores
Stack page details in a cache
Unused
● Cache is high speed memory
immediately adjacent to the CPU and
it's registers, connected via bus(es)
Heap
Low Address Instructions
wjb19@psu.edu
45. Cache : Introduction
In HPC, we talk about problems being compute or memory bound
●
● In the former case, we are limited by the rate at which instructions
can be executed by the CPU
● In the latter, we are limited by the rate at which data can be
processed by the CPU
●Both instructions and data are loaded into cache; cache memory is laid
out in lines
Cache memory is intermediate in the overall hierarchy, lying between
●
CPU registers and main memory
● If the executing process requests an address corresponding to data or
instructions in cache, we have a 'hit', else 'miss', and a much slower
retrieval of instruction or data from main memory must take place
wjb19@psu.edu
46. Cache : Introduction
●Modern architectures have various levels of cache and divisions of
responsibilities, we will follow valgrind-cachegrind convention, from the
manual:
... It simulates a machine with independent first-level instruction and data caches
(I1 and D1), backed by a unified second-level cache (L2). This exactly matches
the configuration of many modern machines.
However, some modern machines have three levels of cache. For these
machines (in the cases where Cachegrind can auto-detect the cache
configuration) Cachegrind simulates the first-level and third-level caches. The
reason for this choice is that the L3 cache has the most influence on runtime, as it
masks accesses to main memory. Furthermore, the L1 caches often have low
associativity, so simulating them can detect cases where the code interacts badly
with this cache (eg. traversing a matrix column-wise with the row length being a
power of 2)
wjb19@psu.edu
47. Cache Example
●The distribution of data to cache levels is largely set by compiler,
hardware and kernel, however the programmer is still responsible for the
best data access patterns in his/her code possible
●Use cachegrind to optimize data alignment & cache usage eg.,
#include <stdlib.h>
#include <stdio.h>
int main(){
int SIZE_X,SIZE_Y;
SIZE_X=2048;
SIZE_Y=2048;
float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float));
for (int i=0; i<SIZE_X; i++)
for (int j=0; j<SIZE_Y; j++)
data[j+SIZE_Y*i] = 10.0f * 3.14f;
//bad data access
//data[i+SIZE_Y*j] = 10.0f * 3.14f;
free(data);
return 0;
}
wjb19@psu.edu
48. Cache : Bad Access
bill@billHPEliteBook6930p:~$ valgrind tool=cachegrind ./foo.x
==3088== Cachegrind, a cache and branchprediction profiler
==3088== Copyright (C) 20022010, and GNU GPL'd, by Nicholas Nethercote et al.
==3088== Using Valgrind3.6.1 and LibVEX; rerun with h for copyright info
==3088== Command: ./foo.x
==3088==
==3088==
==3088== I refs: 50,503,275
==3088== I1 misses: 734
==3088== LLi misses: 733 instructions
==3088== I1 miss rate: 0.00%
==3088== LLi miss rate: 0.00%
==3088== READ Ops WRITE Ops
==3088== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr)
==3088== D1 misses: 4,197,161 ( 2,335 rd + 4,194,826 wr)
==3088== LLd misses: 4,196,772 ( 1,985 rd + 4,194,787 wr) data
==3088== D1 miss rate: 12.4% ( 0.0% + 99.6% )
==3088== LLd miss rate: 12.4% ( 0.0% + 99.6% )
==3088==
==3088== LL refs: 4,197,895 ( 3,069 rd + 4,194,826 wr)
==3088== LL misses: 4,197,505 ( 2,718 rd + 4,194,787 wr)
==3088== LL miss rate: 4.9% ( 0.0% + 99.6% )
lowest level
wjb19@psu.edu
49. Cache : Good Access
bill@billHPEliteBook6930p:~$ valgrind tool=cachegrind ./foo.x
==4410== Cachegrind, a cache and branchprediction profiler
==4410== Copyright (C) 20022010, and GNU GPL'd, by Nicholas Nethercote et al.
==4410== Using Valgrind3.6.1 and LibVEX; rerun with h for copyright info
==4410== Command: ./foo.x
==4410==
==4410==
==4410== I refs: 50,503,275
==4410== I1 misses: 734
==4410== LLi misses: 733
==4410== I1 miss rate: 0.00%
==4410== LLi miss rate: 0.00%
==4410==
==4410== D refs: 33,617,678 (29,410,213 rd + 4,207,465 wr)
==4410== D1 misses: 265,002 ( 2,335 rd + 262,667 wr)
==4410== LLd misses: 264,613 ( 1,985 rd + 262,628 wr)
==4410== D1 miss rate: 0.7% ( 0.0% + 6.2% )
==4410== LLd miss rate: 0.7% ( 0.0% + 6.2% )
==4410==
==4410== LL refs: 265,736 ( 3,069 rd + 262,667 wr)
==4410== LL misses: 265,346 ( 2,718 rd + 262,628 wr)
==4410== LL miss rate: 0.3% ( 0.0% + 6.2% )
wjb19@psu.edu
50. Cache Performance
●For large data problems, any speedup introduced by parallelization can easily
be negated by poor cache utilization
●In this case, memory bandwidth is an order of magnitude worse for problem
size (2^14)^2 (cf earlier note on widely variable memory bandwidths; we have to
work hard to approach peak)
● In many cases we are limited also by random access patterns
12
High % miss
10
8
time (s)
6
4
2
Low % miss
0
10 11 12 13 14
log2 SIZE_X
wjb19@psu.edu
52. POSIX Threads I
●A process may spawn one or more threads; on a multiprocessor, the
OS can schedule these threads across a variety of cores, providing
parallelism in the form of 'light-weight processes' (LWP)
●Whereas a child process receives a copy of the parent's virtual memory
and executes independently thereafter, a thread shares the memory of
the parent including instructions, and also has private data
Using threads we perform shared memory processing (cf distributed
●
memory, next time)
●We are at liberty to launch as many threads as we wish, although as you
might expect, performance takes a hit as more threads are launched
than can be scheduled simultaneously across available cores
wjb19@psu.edu
53. POSIX Threads II
●Pthreads refers to the POSIX standard, which is just a specification;
implementations exist for various systems
Each pthread has:
●
● An ID
● Attributes :
● Stack size
● Schedule information
●Much like processes, we can monitor thread execution using utilities
such as top and ps
●The memory shared among threads must be used carefully in order to
prevent race conditions, or threads seeing incorrect data during
execution, due to more than one thread performing operations on said
data, in an uncoordinated fashion
wjb19@psu.edu
54. POSIX Threads III
●Race conditions may be ameliorated through careful coding, but also
through explicit constructs eg., locks, whereby a single thread gains and
relinquishes control→ implies serialization and computational overhead
●Multi-Threaded programs must also avoid deadlock, a highly undesirous
state where one or more threads await resources, and in turn are unable
to offer up resources required by others
●Deadlocks can also be avoided through good coding, as well as the use
of communication techniques based around semaphores, for example
●Threads awaiting resources may sleep (context switch by kernel, slow,
saves cycles) or busy wait (executes while loop or similar checking
semaphore, fast, wastes cycles)
wjb19@psu.edu
55. Pthreads Example
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
int sum;
void *worker(void *param);
global (shared) variable
int main(int argc, char *argv[]){ main thread
pthread_t tid; thread id & attributes
pthread_attr_t attr;
if (argc!=2 || atoi(argv[1])<0){
printf("usage : a.out <int value>, where int value > 0n");
return 1;
}
pthread_attr_init(&attr);
pthread_create(&tid,&attr,worker,argv[1]); worker thread
pthread_join(tid,NULL);
printf("sum = %dn",sum); creation & join
} after completion
void * worker(void *total){
int upper=atoi(total);
sum = 0; local (private)
variable
for (int i=0; i<upper; i++)
sum += i;
pthread_exit(0);
}
wjb19@psu.edu
56. Valgrind-helgrind output
[wjb19@hammer16 scratch]$ valgrind tool=helgrind v ./foo.x 100
==5185== Helgrind, a thread error detector
==5185== Copyright (C) 20072009, and GNU GPL'd, by OpenWorks LLP et al.
==5185== Using Valgrind3.5.0 and LibVEX; rerun with h for copyright info
==5185== Command: ./foo.x 100
==5185==
5185 Valgrind options: system calls establishing thread ie., there
5185 tool=helgrind is a COST to create and destroy threads
5185 v
5185 Contents of /proc/version:
5185 Linux version 2.6.18274.7.1.el5 (mockbuild@x86004.build.bos.redhat.com) (gcc version
5185 REDIR: 0x3a97e7c240 (memcpy) redirected to 0x4a09e3c (memcpy)
5185 REDIR: 0x3a97e79420 (index) redirected to 0x4a09bc9 (index)
5185 REDIR: 0x3a98a069a0 (pthread_create@@GLIBC_2.2.5) redirected to 0x4a0b2a5
(pthread_create@*)
5185 REDIR: 0x3a97e749e0 (calloc) redirected to 0x4a05942 (calloc)
5185 REDIR: 0x3a98a08ca0 (pthread_mutex_lock) redirected to 0x4a076c2 (pthread_mutex_lock)
5185 REDIR: 0x3a97e74dc0 (malloc) redirected to 0x4a0664a (malloc)
5185 REDIR: 0x3a98a0a020 (pthread_mutex_unlock) redirected to 0x4a07b66 (pthread_mutex_unlock)
5185 REDIR: 0x3a97e79b50 (strlen) redirected to 0x4a09cbb (strlen)
5185 REDIR: 0x3a98a07a10 (pthread_join) redirected to 0x4a07431 (pthread_join)
sum = 4950
==5185==
==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)
5185
5185 used_suppression: 1 helgrindglibc2X101
5185 used_suppression: 1 helgrindglibc2X112
5185 used_suppression: 1 helgrindglibc2X102
==5185==
==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)
wjb19@psu.edu
58. Helgrind output w/ race
[wjb19@hammer16 scratch]$ valgrind tool=helgrind ./foo.x 100
==5384== Helgrind, a thread error detector
==5384== Copyright (C) 20072009, and GNU GPL'd, by OpenWorks LLP et al.
==5384== Using Valgrind3.5.0 and LibVEX; rerun with h for copyright info
==5384== Command: ./foo.x 100
==5384==
==5384== Thread #1 is the program's root thread
built foo.x with debug on (-g) to
==5384== find source file line(s) w/
==5384== Thread #2 was created
==5384== at 0x3A97ED447E: clone (in /lib64/libc2.5.so)
error(s)
==5384== by 0x3A98A06D87: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread2.5.so)
==5384== by 0x4A0B206: pthread_create_WRK (hg_intercepts.c:229)
==5384== by 0x4A0B2AD: pthread_create@* (hg_intercepts.c:256)
==5384== by 0x400748: main (fooThread2.c:18)
==5384==
==5384== Possible data race during write of size 4 at 0x600cdc by thread #1
==5384== at 0x400764: main (fooThread2.c:20)
==5384== This conflicts with a previous write of size 4 by thread #2
==5384== at 0x4007E3: worker (fooThread2.c:31)
==5384== by 0x4A0B330: mythread_wrapper (hg_intercepts.c:201)
==5384== by 0x3A98A0673C: start_thread (in /lib64/libpthread2.5.so)
==5384== by 0x3A97ED44BC: clone (in /lib64/libc2.5.so)
==5384==
●Pthreads is a versatile albeit large and inherently complicated interface
●We are primarily concerned with 'simply' dividing a workload among
available cores; OpenMP proves much less unwieldy to use
wjb19@psu.edu
59. OpenMP Introduction
●OpenMP is a set of multi-platform/OS compiler directives, libraries and
environment variables for readily creating multi-threaded applications
●The OpenMP standard is managed by a review board, and is defined by a large
number of hardware vendors
●Applications written using OpenMP employ pragmas, or statements interpreted
by the preprocessor (before compilation), representing functionality like fork &
join that would take considerably more effort and care to implement otherwise
●OpenMP pragmas or directives indicate parallel sections of code ie., after
compilation, at runtime, threads are each given a portion of work eg., in this
case, loop iterations will be divided evenly among running threads :
#pragma omp parallel for
for (int i=0; i<SIZE; i++)
y[i]=x[i]*10.0f;
wjb19@psu.edu
60. OpenMP Clauses I
●The number of threads launched during parallel blocks may be set via function
calls or by setting the OMP_NUM_THREADS environment variable
●Data objects are generally by default shared (loop counters are private by
default), a number of pragma clauses are available, which are valid for the
scope of the parallel section eg., :
● private
● shared
● firstprivate -initialized to value before parallel block
● lastprivate -variable keeps value after parallel block
● reduction -thread safe way of combining data at conclusion of parallel
block
●Thread synchronization is implicit to parallel sections; there are a variety of
clauses available for controlling this behavior also, including :
● critical-one thread at a time works in this section eg., in order to avoid
race (expensive, design your code to avoid at all costs)
● atomic- safe memory updates performed using eg., mutual exclusion (cost)
● barrier-threads wait at this point for others to arrrive
wjb19@psu.edu
61. OpenMP Clauses II
OpenMP has default thread scheduling behavior handled via the runtime library,
●
which may be modified through use of the schedule(type,chunk) clause,
with types :
● static loop iterations are divided among threads equally by default;
specifying an integer for the parameter chunk will allocate a number of
contiguous iterations to a thread
● dynamic total iterations form a pool, from which threads work on small
contiguous subsets until all are complete, with subset size given again by
chunk
● guided a large section of contiguous iterations are allocated to each
thread dynamically. The section size decreases exponentially with each
successive allocation to a minimum size specified by chunk
wjb19@psu.edu
62. OpenMP Example : KTM
●In our first attempt at parallelization shortly, we simply add an OpenMP pragma
before the computational loops in worker function:
#pragma omp parallel for
//loop over trace records
for (int k=0; k<config>traceNo; k++){
//loop over imageX
for(int i=0; i<Li; i++){
tempC = ( midX[k] imageXX[i]offX[k]) * (midX[k] imageXX[i]offX[k]);
tempD = ( midX[k] imageXX[i]+offX[k]) * (midX[k] imageXX[i]+offX[k]);
//loop over imageY
for(int j=0; j<Lj; j++){
tempA = tempC + ( midY[k] imageYY[j]offY[k]) * (midY[k] imageYY[j]offY[k]);
tempB = tempD + ( midY[k] imageYY[j]+offY[k]) * (midY[k] imageYY[j]+offY[k]);
//loop over imageZ
for (int l=0; l<Ll; l++){
temp = sqrtf(tauS[l] + tempA * slownessS[l]);
temp += sqrtf(tauS[l] + tempB * slownessS[l]);
timeIndex = (int) (temp / sRate);
if ((timeIndex < config>tracePts) && (timeIndex > 0)){
image[i*Lj*Ll + j*Ll + l] +=
traces[timeIndex + k * config>tracePts] * temp *sqrtf(tauS[l] / temp);
}
} //imageZ
} //imageY
} //imageX
}//input trace records
wjb19@psu.edu
63. OpenMP KTM Results
●Scales well up to eight cores, then drops off; SMP model has deficiencies due
to a number of factors, including :
● Coverage (Amdahl's law); as we increase processors, relative cost of serial
code portion increases
● Hardware limitations
● Locality...
5
4.5
4
Execution time
3.5
3
2.5
2
1.5
1
0.5
0
1 2 4 8 16
CPU cores
wjb19@psu.edu
64. CPU Affinity (Intel*)
●Recall that the OS schedules processes and threads using context
switches; can be detrimental → threads may resume on different core,
destroying locality
●We can change this by restricting threads to execute on a subset of
processors, by setting processor affinity
●Simplest approach is to set environment variable KMP_AFFINITY to:
● determine the machine topology,
● assign threads to processors
●Usage:
KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>]
*For GNU, ~ equivalent env var == GOMP_CPU_AFFINITY wjb19@psu.edu
65. CPU Affinity Settings
●The modifier may take settings corresponding to granularity (with specifiers:
fine, thread, and core), as well as a processor list (proclist={<proc
list>}), verbose, warnings and others
● The type settings refer to the nature of the affinity, and may take values :
● compact-try to assign thread n+1 context as close as possible to n
● disabled
● explicit-force assign of threads to processors in proclist
● none-just return the topology w/ verbose modifier
● scatter-distribute as evenly as possible
●fine & thread refer to the same thing, namely that threads only resume in
the same context; the core modifier implies that they may resume within a
different context, but the same physical core
●CPU Affinity can effect application performance significantly and is worth tuning,
based on your application and the machine topology...
wjb19@psu.edu
66. CPU Topology Map
●For any given computational node, we have several different physical devices
(packages in sockets), comprised of cores (eg., two here), which run one or two
thread contexts
●Without hyperthreading, there is only a single context per core ie., modifiers
thread/fine, core are indistinguishable
Node
packageA packageB
core0 core1 core0 core1
0 1 0 1 0 1 0 1 Thread context
wjb19@psu.edu
67. CPU Affinity Examples
●Display machine topology map eg,. Hammer :
[wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none
[wjb19@hammer16 scratch] $ ./psktm.x
OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #156: KMP_AFFINITY: 12 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
wjb19@psu.edu
68. CPU Affinity Examples
●Set affinity with compact setting, fine granularity :
[wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact
[wjb19@hammer5 scratch]$ ./psktm.x
OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #156: KMP_AFFINITY: 12 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11}
wjb19@psu.edu
69. Conclusions
●Scientific research is supported by computational scaling and performance,
both provided by parallelism, limited to some extent by Amdahl's law
●Parallelism has various levels of granularity; at the finest level is the instruction
pipeline and vectorized registers eg., SSE
●The next level up in parallel granularity is the multiprocessor; we may run many
concurrent threads using the pthreads API or the OpenMP standard for instance
●Threads must be coded and handled with care, to avoid race and deadlock
conditions
●Performance is a strong function of cache utilization; benefits introduced
through parallelization can easily be negated by sloppy use of memory
bandwidth
●Scaling across cores is limited by hardware, Amdahl's law but also locality; we
have some control over the latter using KMP_AFFINITY for instance
wjb19@psu.edu
71. Exercises
●Take the supplied code and parallelize using OpenMP
pragma around the worker function
●Create a makefile which builds the code, compare timings
btwn serial & parallel by varying OMP_NUM_THREADS
●Examine effect of various settings for KMP_AFFINITY
wjb19@psu.edu
72. Build w/ Confidence : make
#Makefile for basic Kirchhoff Time Migration example
#set compiler
CC=icc openmp
#set build options
CFLAGS=std=c99 c
#main executable
all: psktm.x
#objects and dependencies
psktm.x: psktmCPU.o demoA.o
$(CC) psktmCPU.o demoA.o o psktm.x
psktmCPU.o: psktmCPU.c
$(CC) $(CFLAGS) psktmCPU.c
demoA.o: demoA.c
$(CC) $(CFLAGS) demoA.c
clean:
rm rf *o psktm.x
wjb19@psu.edu
indent with tab only!
73. HPC Essentials
Part III : Message Passing Interface
Bill Brouwer
Research Computing and Cyberinfrastructure
(RCC), PSU
wjb19@psu.edu
75. Motivation
●We saw last time that Amdahl's law implies an asymptotic limit to
performance gains from parallelism, where parallel P and serial code (1-
P) portions have fixed relative cost
●We looked at threads (“light-weight processes”) and also saw that
performance depends on a variety of things, including good cache
utilization and affinity
●For the problem size investigated, ultimately the limiting factor was disk
I/O, there was no sense going beyond a single compute node; in a
machine with 16 cores or more, there is no point when P < 60%, should
the process have sufficient memory
●However, as we increase our problem size, the relative parallel/serial
cost changes and P can approach 1
wjb19@psu.edu
76. Motivation
●In the limit as processors N → we find the maximum performance
improvement :
1/(1-P)
●It is helpful to see the 3dB points for this limit ie., the number of processors N
1/2
required to achieve (1/√2)*max = 1/(√2*(1-P)); equating with Amdahl's law &
after some algebra :
N1/2 = 1/((1-P)*(√2-1))
300
250
200
N1/2
150
100
50
0
0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99
Parallel code fraction P wjb19@psu.edu
77. Motivation
Points to note from the graph :
●
● P ~ 0.90, we can benefit from ~ 20 cores
● P ~ 0.99, we can benefit from a cluster size of ~ 256 cores
● P → 1, we approach the “embarrassingly parallel” limit
● P ~ 1, performance improvement directly proportional to cores
● P ~ 1 implies independent or batch processes
●Quite aside from considerations of Amdahl's law, as the problem size
grows, we may simply exceed the memory available on a single node
●In this case, must move to a distributed memory processing
model/multiple nodes (unless P ~ 1 of course)
How do we determine P? → PROFILING
●
wjb19@psu.edu
78. Profiling w/ Valgrind
[wjb19@lionxf scratch]$ valgrind tool=callgrind ./psktm.x
[wjb19@lionxf scratch]$ callgrind_annotate inclusive=yes callgrind.out.3853
Profile data file 'callgrind.out.3853' (creator: callgrind3.5.0)
I1 cache:
D1 cache:
L2 cache: Parallelizable worker
Timerange: Basic block 0 2628034011 function is 99.5% of
Trigger: Program termination
Profiled target: ./psktm.x (PID 3853, part 1)
total instructions
executed
20,043,133,545 PROGRAM TOTALS
Ir file:function
20,043,133,545 ???:0x0000003128400a70 [/lib64/ld2.5.so]
20,042,523,959 ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x]
20,042,522,144 ???:(below main) [/lib64/libc2.5.so]
20,042,473,687 /gpfs/scratch/wjb19/demoA.c:main
20,042,473,687 demoA.c:main [/gpfs/scratch/wjb19/psktm.x]
19,934,044,644 psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x]
19,934,044,644 /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU
6,359,083,826 ???:sqrtf [/gpfs/scratch/wjb19/psktm.x]
4,402,442,574 ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x]
104,966,265 demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x]
If we wish to scale outside a single node, we must use some form of interprocess
communication
wjb19@psu.edu
79. Inter-Process Communication
● There are a variety of ways for processes to exchange information, including:
● Memory (~last week)
● Files
● Pipes (named/anonymous)
● Signals
● Sockets
● Message Passing
● File I/O is too slow, and read/writes liable to race conditions
● Anonymous & named pipes are highly efficient but FIFO (first in, first out)
buffers, allowing only unidirectional communication, and between processes on
the same node
●Signals are a very limited form of communication, sent to the process after an
interrupt by the kernel, and handled using a default handler or one specified
using signal() system call
●Signals may come from a variety of sources eg., segmentation fault (SIGSEGV),
keyboard interrupt Ctrl-C (SIGINT) etc
wjb19@psu.edu
80. Signals
●strace is a powerful utility in UNIX which shows the interaction between a
running process and kernel in the form of system calls and signals; here, a
partial output showing mapping of signals to defaults with system call
sigaction(), from ./psktm.x :
UNIX signals
rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0
●Signals are crude and restricted to local communication; to communicate
remotely, we can establish a socket between processes, and communicate over
the network
wjb19@psu.edu
81. Sockets & Networks
●Davies/Baran first devised packet switching, an efficient means of
communication over a channel; a computer was conceived to realize their
design and ARPANET went online Oct 1969 between UCLA and Stanford
●TCP/IP became the communication protocol of ARPANET 1 Jan 1983, which
was retired in 1990 and NFSNET established; university networks in the US and
Europe join
●TCP/IP is just one of many protocols, which describes the format of data
packets, and the nature of the communication; an analogous connection method
is used by Infiniband networks in conjunction with Remote Direct Memory
Access (RDMA)
●Unreliable Datagram Protocol (UDP) is analogous to a connectionless method
of communication used by Infiniband high performance networks
wjb19@psu.edu
82. Sockets : UDP host example
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <unistd.h> /* for close() for socket */
#include <stdlib.h>
int main(void)
{
//creates an endpoint & returns file descriptor
//uses IPv4 domain, datagram type, UDP transport
int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
//socket address object (sa) and memory buffer
struct sockaddr_in sa;
char buffer[1024];
ssize_t recsize;
socklen_t fromlen;
//specify same domain type, any input address and port 7654 to listen on
memset(&sa, 0, sizeof sa);
sa.sin_family = AF_INET;
sa.sin_addr.s_addr = INADDR_ANY;
sa.sin_port = htons(7654);
fromlen = sizeof(sa);
wjb19@psu.edu
83. Sockets : host example cont.
//we bind an address (sa) to the socket using fd sock
if (1 == bind(sock,(struct sockaddr *)&sa, sizeof(sa)))
{
perror("error bind failed");
close(sock);
exit(EXIT_FAILURE);
}
for (;;)
{
//listen and dump buffer to stdout where applicable
printf ("recv test....n");
recsize = recvfrom(sock, (void *)buffer, 1024, 0, (struct sockaddr *)&sa, &fromlen);
if (recsize < 0) {
fprintf(stderr, "%sn", strerror(errno));
exit(EXIT_FAILURE);
}
printf("recsize: %zn ", recsize);
sleep(1);
printf("datagram: %.*sn", (int)recsize, buffer);
}
}
wjb19@psu.edu
84. Sockets : client example
int main(int argc, char *argv[])
{
//create a buffer with character data
int sock;
struct sockaddr_in sa;
int bytes_sent;
char buffer[200];
strcpy(buffer, "hello world!");
//create a socket, same IP and transport as before, address of host 127.0.0.1
sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
if (1 == sock) /* if socket failed to initialize, exit */
{
printf("Error Creating Socket");
exit(EXIT_FAILURE);
}
memset(&sa, 0, sizeof sa);
sa.sin_family = AF_INET;
sa.sin_addr.s_addr = inet_addr("127.0.0.1");
sa.sin_port = htons(7654);
bytes_sent = sendto(sock, buffer, strlen(buffer), 0,(struct sockaddr*)&sa, sizeof sa);
if (bytes_sent < 0) {
printf("Error sending packet: %sn", strerror(errno));
exit(EXIT_FAILURE);
}
close(sock); /* close the socket */
return 0;
}
●You can monitor sockets by using the netstat facility, which takes it's data
from /proc/net wjb19@psu.edu
85. Outline
●Motivation
●Interprocess Communication
● Signals
● Sockets & Networks
●procfs Digression
●Message Passing
● Send/Receive
● Communication
● Parallel Constructs
● Grouping Data
● Communicators & Topologies
wjb19@psu.edu
86. procfs
●We mentioned the /proc directory previously in the context of cpu and
memory information, which is frequently referred to as the proc filesystem or
procfs
●It is a veritable treasure trove of information, written periodically by the kernel,
and is used by a variety of tools eg., ps
● Each running process is assigned a directory, whose name is the process id
●Each directory contains text files and subdirectories with every detail of a
running process, including context switching statistics, memory management,
open file descriptors and much more
●Much like the ptrace() system call, procfs also gives user applications the
ability to directly manipulate running processes, given sufficient permission; you
can explore that on your own :)
wjb19@psu.edu
87. procfs : examples
● Some of the more useful files :
● /proc/PID/cmdline : command used to launch process
● /proc/PID/cwd : current working directory
● /proc/PID/environ : environment variables for the process
● /proc/PID/fd : directory w/ symbolic link for each open file descriptor eg., streams
● /proc/PID/status : information including signals, state, memory usage
● /proc/PID/maps : memory map between virtual and physical addresses
●
● eg., contents of the fd firectory for running process ./psktm.x :
[wjb19@hammer1 fd]$ ls lah
total 0
drx 2 wjb19 wjb19 0 Dec 7 12:13 .
drxrxrx 6 wjb19 wjb19 0 Dec 7 12:10 ..
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 0 > /dev/pts/28
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 1 > /dev/pts/28
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 2 > /dev/pts/28
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 3 > /gpfs/scratch/wjb19/inputDataSmall.bin
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 4 > /gpfs/scratch/wjb19/inputSrcXSmall.bin
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 5 > /gpfs/scratch/wjb19/inputSrcYSmall.bin
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 6 > /gpfs/scratch/wjb19/inputRecXSmall.bin
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 7 > /gpfs/scratch/wjb19/inputRecYSmall.bin
lrwx 1 wjb19 wjb19 64 Dec 7 12:13 8 > /gpfs/scratch/wjb19/velModel.bin
wjb19@psu.edu
90. Message Passing Interface (MPI)
●Classical von Neumann machine has single instruction/data stream (SISD) →
single process & memory
●Multiple Instruction, multiple data (MIMD) system → connected processes are
asynchronous, generally distributed memory (may also be shared where
processes on single node)
MIMD Processors are connected in some network topology; we don't have to
●
worry about the details, MPI abstracts this away
●MPI is a standard for parallel programming first established in 1991, updated
occasionally, by academics and industry
●It comprises routines for point-to-point and collective communication, with
bindings to C/C++ and fortran
● Depending on underlying network fabric, communication maybe TCP or UDP-
like in Infiniband networks
wjb19@psu.edu
91. MPI : Basic communication
●Multiple, distributed processes are spawned at initialization, each process
assigned a unique rank 0,1,...,p-1
● One may send information referencing process rank eg.,:
MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);
Buffer address Rank of rcv
● This function has a receive analogue; both routines are blocking by default
●Send/receive statements generally occur in same code, processors execute
appropriate statement according to rank & code branch
Non-blocking functions available, allows communicating processes to continue
●
with execution where able
wjb19@psu.edu
92. MPI : Requisite functions
●Bare minimum → initialize, get rank for process, total processes and
finalize when done
MPI_Init(&argc, &argv); //Start up
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rank
MPI_Comm_size(MPI_COMM_WORLD, &p); //No. processors
MPI_Finalize(); //close up shop
●MPI_COMM_WORLD is a communicator parameter, a collection of
processes that can send messages to each other.
●Messages are sent with tags to identify them, allowing specificity beyond
using just a source/destination parameter
wjb19@psu.edu
93. MPI : Datatypes
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
wjb19@psu.edu
95. MPI : Collective Communication
● A communication pattern involving all processes in a communicator is
a collective communication eg., a broadcast
● Same data sent to every process in communicator, more efficient
than using multiple p2p routines, optimized :
MPI_Bcast(void* message, int count, MPI_Datatype type,
int root, MPI_Comm comm)
● Sends copy of data in message from root process to all in comm, a
scatter/map operation
● Collective communication is at the heart of efficient parallel
operations
wjb19@psu.edu
96. Parallel Operations : Reduction
● Data maybe gathered/reduced after computation via :
MPI_Reduce(void* operand, void* result, int count,
MPI_Datatype type, MPI_Op operator, int root, MPI_Comm
comm)
● Combines all operand, using operator and stores result on
process root, in result
● A tree-structured reduce at all nodes == MPI_Allreduce,ie., every
process in comm gets a copy of the result
1 2 3 p-1
0 root
wjb19@psu.edu
97. Reduction Ops
MPI_MAX
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LAND Logical and
MPI_BAND Bitwise and
MPI_LOR Logical or
MPI_BOR Bitwise or
MPI_LXOR Logical XOR
MPI_BXOR Bitwise XOR
MPI_MAXLOC Max w/ location
MPI_MINLOC Min w/ location
MPI_PACKED
wjb19@psu.edu
98. Parallel Operations : Scatter/Gather
● Bulk transfers of many-to-one and one-to-many are accomplished by
gather and scatter operations respectively
● These operations form the kernel of matrix/vector operations for
example; they are useful for distributing and reassembling arrays
Process 0 x0 a00 a01 a02 a03
Process 1 x1
x2
Process 2
x3
Process 3
Gather Scatter
wjb19@psu.edu
99. Scatter/Gather Syntax
● MPI_Gather(void* send_data, int send_count, MPI_Datatype
send_type, void* recv_data, int recv_count, MPI_Datatype
recv_type, int root, MPI_Comm comm)
● Collects data referenced by send_data from each process in comm and
stores data in process rank order on process w/ rank root, in memory
referenced by recv_data
● MPI_Scatter(void* send_data, int send_count,
MPI_Datatype send_type, void* recv_data, int recv_count,
MPI_Datatype recv_type, int root, MPI_Comm comm)
● Splits data referenced by send_data on process w/ rank root into
segments, send_count elements each, w/ send_type & distributed in
order to processes
● For gather result to ALL processes → MPI_Allgather
wjb19@psu.edu
100. Grouping Data I
● Communication is expensive → bundle variables into single message
● We must define a derived type than can describe the heterogeneous
contents of a message using type and displacement pairs
● Several ways to build this MPI_Datatype eg.,
MPI_Type_Struct(int count,
int block_lengths[], //contains no. entries in each block
MPI_Aint displacements[], //element offset from msg start
MPI_Datatype typelist[], //exactly that
MPI_Datatype* new_mpi_t //a pointer to this new type)
Allows for addresses > int
● A very general derived type, although arrays to struct must be constructed
explicitly using other MPI commands
● Simpler when less heterogeneous eg., MPI_Type_vector,
MPI_Type_Contiguous, MPI_Type_indexed
wjb19@psu.edu
101. Grouping Data II
● Before these derived types can be used by a communication function,
must be committed with MPI_type_commit function call
● In order for message to be received, type signatures at send and
receive must be compatible; if a collective communication, signatures
must be identical
● MPI_Pack & MPI_Unpack are useful for when messages of
heterogeneous data are infrequent, and cost of constructing derived
type outweighs benefit
● These methods also allow buffering in user versus system memory,
and the number of items transmitted is in the message itself
● Group data allows for sophisticated objects; we can also create more
fined grained communication objects
wjb19@psu.edu
102. Communicators
● Process subsets or groups expand communication beyond simple
p2p and broadcast communication, to create :
● Intra-communicators → communicate among one other and
participate in collective communication, composed of :
– an ordered collection of processes (group)
– a context
● Inter-communicators → communicate between different groups
● Communicators/groups are opaque, internals not directly accessible;
these objects are referenced by a handle
wjb19@psu.edu
103. Communicators Cont.
● Internal contents manipulated by methods, much like private data in C++
class objects eg.,
● int MPI_Group_incl(MPI_Group old_group,int
new_group_size, int ranks_in_old_group[], MPI_Group*
new_group) → create a new_group from old_group, using
ranks_in_old_group[] etc
● int MPI_Comm_create(MPI_Comm old_comm, MPI_Group
new_group, MPI_Comm* new_comm) → create a new communicator
from the old, with context
● MPI_Comm_group and MPI_Group_incl are local methods without
communication, MPI_Comm_create is a collective communication implying
synchronization ie,. to establish single context
● Multiple communicators may be created simultaneously using
MPI_Comm_split
wjb19@psu.edu
104. Topologies I
● MPI allows one to associate different addressing schemes to
processes within a group
● This is a virtual versus real or physical topology, and is either a graph
structure or a (Cartesian) grid; properties:
● Dimensions, w/
– Size of each
– Period of each
● Option to have processes reordered optimally within grid
● Method to establish Cartesian grid cart_comm :
int MPI_Cart_create(MPI_Comm old_comm, int
number_of_dims, int dim_sizes[], int wrap_around[],
int reorder, MPI_Comm* cart_comm)
● old_comm is typically just MPI_COMM_WORLD created at init
wjb19@psu.edu
105. Topologies II
● cart_comm will contain the processes from old_comm with
associated coordinates, available from MPI_Cart_coords:
int coordinates[2];
int my_grid_rank;
MPI_Comm_rank(cart_comm, &my_grid_rank);
MPI_Cart_Coords(cart_comm,
my_grid_rank,2,coordinates);
● Call to MPI_Comm_rank is necessary because of process rank
reordering (optimization)
● Processes in cart_comm are stored in row major order
● Can also partition in to sub-grid(s) using MPI_Cart_sub eg., for row:
int free_coords[2];
MPI_Comm row_comm; //new subgrid
free_coords[0]=0; //bool; first coordinate fixed
free_coords[1]=1; //bool; second coordinate free
MPI_Cart_sub(cart_comm,free_coords,&row_comm);
wjb19@psu.edu
106. Writing Parallel Code
● Assuming we've profiled our code and decided to parallelize,
equipped with MPI routines, we must decide whether to take a :
● Domain parallel (divide tasks, similar data) or
● Data parallel (divide data, similar tasks) approach
● Data parallel in general scales much better, implies lower
communication overhead
● Regardless, easiest to begin by selecting or designing data
structures, and subsequently their distribution using a constructed
topology or scatter/gather routines, for example
● Program in modules, beginning with easiest/essential functions (eg.,
I/O), relegating 'hard' functionality to stubs initially
● Time code sections, look at targets for optimization & redesign
● Only concern yourself with the highest levels of abstraction germane
to your problem, use parallel constructs wherever possible
wjb19@psu.edu
107. A Note on the OSI Model
●We've been playing fast and loose with a variety of communication entities;
sockets, networks, protocols like UDP, TCP etc
●The Open Systems Interconnection model separates these entities into 7 layers
of abstraction, each layer providing services to the layer immediately above
●Data becomes increasingly fine grained going down from layer 7 to 1
●As application developers and/or scientists, we need only be concerned with
layers 4 and above
Layer Granularity Function Example
7.Application data process accessing network MPI
6.Presentation data encryt/decrypt, data conversion MPI
5.Session data management MPI
4.Transport segments reliability & flow control IB verbs
3.Network packets path Infiniband
2.Data Link frames addressing Infiniband
1.Physical bits signals/electrical Infiniband
wjb19@psu.edu