SlideShare une entreprise Scribd logo
1  sur  112
Télécharger pour lire hors ligne
HPC Essentials
           Part I : UNIX/C Overview




             Bill Brouwer
Research Computing and Cyberinfrastructure
             (RCC), PSU




                                      wjb19@psu.edu
Outline

●Introduction
  ● Hardware

  ● Definitions

  ● UNIX

    ● Kernel & shell

●Files

  ● Permissions

  ● Utilities

  ● Bash Scripting

●C programming




                                 wjb19@psu.edu
HPC Introduction
HPC systems composed of :
●

 ● Software

 ● Hardware

   ● Devices (eg., disks)

   ● Compute elements (eg., CPU)

   ● Shared and/or distributed memory

   ● Communication (eg., Infiniband network)



●A HPC system ...isn't... unless hardware is configured correctly and
software leverages all resources made available to it, in an optimal
manner
●An operating system controls the execution of software on the hardware;

HPC clusters almost exclusively use UNIX/Linux

●In the computational sciences, we pass data and/or abstractions through
a pipelined workflow; UNIX is the natural analogue to this
solving/discovery process

                                                            wjb19@psu.edu
UNIX
●UNIX is a multi-user/tasking OS created by Dennis Ritchie and Ken
Thompson at AT&T Bell Labs 1969-1970, written primarily in C language
(also developed by Ritchie)

UNIX is composed of :
●

 ● Kernel

   ● OS itself which handles scheduling, memory management, I/O etc

 ● Shell (eg., Bash)

   ● Interacts with kernel, command line interpreter

 ● Utilities

   ● Programs run by the shell, tools for file manipulation, interaction

     with the system
 ● Files

   ● Everything but process(es), composed of data...




                                                              wjb19@psu.edu
Data-Related Definitions
●Binary
  ● Most fundamental data representation in computing, base 2 number system

    (others; hex → base 16, oct → base 8)
●Byte

  ● 8 bits = 8b = 1Byte = 1B; 1kB = 1024 B; 1MB = 1024 kB etc

●ASCII

  ● American Standard Code for Information Interchange; character encoding

    scheme, 7bits (traditional) or 8bits (UTF-8) per character, a Unicode
    encoding
●Stream

  ● A flow of bytes; source → stdout (& stderr), sink → stdin

●Bus

  ● Communication channel over which data flows, connects elements within a

    machine
●Process

  ● Fundamental unit of computational work performed by a processor; CPU

    executes application or OS instructions
●Node

  ● Single computer, composed of many elements, various architectures for

    CPU, eg., x86, RISC
                                                                wjb19@psu.edu
Typical Compute Node (Intel i7)
                                                                       RAM
                                       CPU     memory bus

              QuickPath Interconnect
  GPU

                                       IOH                         volatile storage
              PCI-express

                                         Direct Media Interface

                                                        ethernet
PCI-e cards                            ICH                            NETWORK



                 SATA/USB

                                                BIOS


                                 non-volatile storage                       wjb19@psu.edu
More Definitions
●Cluster
  ● Many nodes connected together via network

●Network

  ● Communication channel, inter-node; connects machines

●Shared Memory

  ● Memory region shared within node

●Distributed Memory

  ● Memory region across two or more nodes

●Direct Memory Access (DMA)

  ● Access memory independently of programmed I/O ie., independent of the

    CPU
●Bandwidth

  ● Rate of data transfer across serial or parallel communication channel,

    expressed as bits (b) or Bytes (B) per second (s)
  ● Beware quotations of bandwidth; many factors eg., simplex/duplex,

    peak/sustained, no. of lanes etc
  ● Latency or the time to create a communication channel is often more

    important

                                                                 wjb19@psu.edu
Bandwidths
●Devices
  ● USB : 60MB/s (version 2.0)

  ● Hard Disk : 100MBs-500MB/s

  ● PCIe : 32GB/s (x8, version 2.0)

●Networks

  ● 10/100Base T : 10/100 Mbit/s

  ● 1000BaseT (1GigE) : 1000 Mbit/s

  ● 10 GigE : 10 Gbit/s

  ● Infiniband QDR 4X: 40 Gbit/s

●Memory

  ● CPU : ~ 35 GB/s (Nehalem, 3x 1.3GHz DIMM/socket)*

  ● GPU : ~ 180 GB/s (GeForce GTX 480)

●AVOID devices, keep data resident in memory, minimize communication

btwn processes
●MANY subtleties to CPU memory management eg., with 8x CPU cores,

total bandwidth may be > 300 GB/s or as little as 10 GB/s, will discuss
further
*http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations?t=anon#fbid=XZRzflqVZ6J

                                                                                                    wjb19@psu.edu
Outline

●Introduction
  ● HPC hardware

  ● Definitions

  ● UNIX

    ● Kernel & shell

●Files

  ● Permissions

  ● Utilities

  ● Bash Scripting

●C programming




                                 wjb19@psu.edu
UNIX Permissions & Files

●At the highest level, UNIX objects are either files or processes, and both
are protected by permissions (processes next time)
●Every file object has two ID's, the user and group, both are assigned on

creation; only the root user has unrestricted access to everything
●Files also have bits which specify read (r), write (w) and execute (x)

permissions for the user, group and others eg., output of ls command:

    ­rw­r­­r­­ 1 root root 0 Jun 11 1976 /usr/local/foo.txt

    user/group/others     User ID Group ID                       filename

●We can manipulate files using myriad utilities, these utilities are commands
interpreted by the shell and executed by the kernel
●To learn more, check man pages ie., from the command line 'man 

<command>'

                                                                     wjb19@psu.edu
File Manipulation I
Working from the command line in a Bash shell:
●




List directory foo_dir contents, human readable :
●

[wjb19@lionga scratch] $ ls ­lah foo_dir

Change ownership of foo.xyz to wjb19; group and user:
●

[wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz

●Add execute permission to foo.xyz:
[wjb19@lionga scratch] $ chmod +x foo.xyz

●Determine filetype for foo.xyz:
[wjb19@lionga scratch] $ file foo.xyz
●Peruse text file foo.xyz:
[wjb19@lionga scratch] $ more foo.xyz


                                                        wjb19@psu.edu
File Manipulation II
●Copy foo.txt from lionga to file /home/bill/foo.txt on dirac :
[wjb19@lionga scratch] $ scp foo.txt  
wjb19@dirac.rcc.psu.edu:/home/bill/foo.txt

Create gzip compressed file archive of directory foo and contents :
●

[wjb19@lionga scratch] $ tar ­cfz foo_archive.tgz foo/*

Create bzip2 compressed file archive of directory foo and contents :
●

[wjb19@lionga scratch] $ tar ­cfj foo_archive.tbz foo/*

Unpack compressed file archive :
●

[wjb19@lionga scratch] $ tar ­xvf foo_archive.tgz

Edit a text file using VIM:
●

[wjb19@lionga scratch] $ vim foo.txt

●VIM is a venerable and powerful command line editor with a rich set of
commands

                                                             wjb19@psu.edu
Text File Edit w/ VIM
●Two main modes of operation; editing or command. From command, switch to edit by
issuing 'a' (insert after cursor) or 'i' (before), switch back to command via <ESC>
        Save w/o quitting                                     :w<ENTER>
        Save and quit (ie., <shift> AND 'z' AND 'z')          :wq<ENTER>
        Quit w/o saving                                       :q!<ENTER>
        Delete x lines eg,. x=10 (also stored in clipboard)   d10d
        Yank (copy) x lines eg., x=10                         y10y
        Split screen/buffer                                   :split<ENTER>
        Switch window/buffer                                  <CNTRL>­w­w
        Go to line x eg., x=10                                :10<ENTER>
        Find matching construct (eg., from { to })            %
    ●   Paste: 'p' undo: 'u' redo: '<CNTRL>­r'

    ●   Move up/down one screen line : '­' and '+'

    ●   Search for expression exp, forward ('n' or 'N' navigate up/down highlighted
        matches) '/exp<ENTER>' or backward '?exp<ENTER>' 
                                                                       wjb19@psu.edu
Text File Compare w/ VIMDIFF
●Same commands as VIM, but highlights differences in files, allows transfer of
text btwn buffers/files; launch with 'vimdiff foo.txt foo2.txt'




●Push text from right to left (when right window active and cursor in relevant
region) using command 'dp'
●Pull text from right to left (when left window active and cursor in relevant

region) using command 'do'
                                                                              wjb19@psu.edu
Bash Scripting
●File and other utilities can be assembled into scripts, interpreted by the
shell eg., Bash
●The scripts can be collections of commands/utilities & fundamental

programming constructs
Code Comment                                   #this is a comment
Pipe stdout of procA to stdin of procB         procA | procB
Redirect stdout of procA to file foo.txt*      procA > foo.txt
Command separator                              procA; procB
If block                                       if [condition] then procA fi
Display on stdout                              echo “hello”
Variable assignment & literal value            a = “foo”; echo $a
Concatenate strings                            b=a.“foo2”;
Text Processing utilities                      sed,gawk
Search utilities                               find,grep

*Streams have file descriptors (numbers) associated with them; eg., to redirect stderr
from procA to foo.txt → procA 2> foo.txt
                                                                            wjb19@psu.edu
Text Processing
●Text documents are composed of records (roughly speaking, lines
separated by carriage returns) and fields (separated by spaces)

●Text processing using sed & gawk involves coupling patterns with
actions eg., print field 1 in document foo.txt when encountering word
image:

[wjb19@lionga scratch] $ gawk '/image/ {print $1;}' “foo.txt” 

                                    pattern action           input

●Parse, without case sensitivity, change from default space field
separator (FS) to equals sign, print field 2:

[wjb19@lionga scratch] $ gawk 'BEGIN{IGNORECASE=1; FS=”=”}  
/image/ {print $2;}' “foo.txt”

●   Putting it all together → create a Bash script w/ VIM or other (eg,. Pico)...

                                                                          wjb19@psu.edu
Bash Example I
#!/bin/bash                                                Run using bash
#set source and destination paths
DIR_PATH=~/scratch/espresso­PRACE/PW
BAK_PATH=~/scratch/PW_BAK

declare ­a file_list                                       Declare an array
#filenames to array
file_list=$(ls ­l ${BAK_PATH} | gawk '/f90/ {print $9}')   Command output
cnt=0;

#parse files & pretty up
for x in $file_list
do
    let "cnt+=1"
    sed 's/,&/, &/g' $BAK_PATH/$x | 
    sed 's/)/) /g' | 
    sed 's/call/ call /g' |                             Search & replace
    sed 's/CALL/ call /g' > $DIR_PATH/$x
     echo cleaned file no. $cnt $x
done

exit



                                                                    wjb19@psu.edu
Bash Example II
#!/bin/bash


if [ $# ­lt 6 ]                                                Total arguments
then
     echo usage: fitCPCPMG.sh '[/path/and/filename.csv] 
     [desired number of gaussians in mixture (2­10)]  
     [no. random samples (1000­10000)]
     [mcmc steps (1000­30000)] 
     [percent noise level (0­10)]
     [percent step size (0.01­20)]
     [/path/to/restart/filename.csv; optional]'
    exit
fi

ext=${1##*.}                                                   File extension
if [ "$ext" != "csv" ]
then
        echo ERROR: file must be *.csv
        exit
fi

base=$(basename $1 .csv)                                       File basename
if [[ $2 ­lt 2 ]] || [[ $2 ­gt 10 ]]
then 
    echo "ERROR: must specify 2<=x<=10 gaussians in mixture"
    exit
fi
                                                                        wjb19@psu.edu
Outline

●Introduction
  ● HPC hardware

  ● Definitions

  ● UNIX

    ● Kernel & shell

●Files

  ● Permissions

  ● Utilities

  ● Bash Scripting

●C programming




                                 wjb19@psu.edu
The C Language
●Utilities, user applications and indeed the UNIX OS itself are executed by the
CPU, when expressed as machine code eg., store/load from memory, addition
etc
●Fundamental operations like memory allocation, I/O etc are laborious to

express at this level, most frequently we begin from a high-level language like C
●The process of creating an executable consists of at least 3 fundamental steps;

creation of source code text file containing all desired objects and operations,
compilation and linking eg,. using the GNU tool gcc to create executable foo.x
from source file foo.c:
[wjb19@tesla2 scratch]$ gcc ­std=c99 foo.c ­o foo.x
                         *C99 standard
                                                                Executable
                             compile                  link
                 Source *c               Object *o
                    file                   code



                                                             Library
                                                             objects
                                                                       wjb19@psu.edu
C Code Elements I
●Composed of primitive datatypes (eg., int, float, long), which
have different sizes in memory, multiples of 1 byte

●May be composed of statically allocated memory (compile time),
dynamically allocated memory (runtime), or both

●Pointers (eg., float *) are primitives with 4 or 8 byte lengths (32bit or
64bit machines) which contain an address to a contiguous region of
dynamically allocated memory

●More complicated objects can be constructed from primitives and arrays
eg., a struct




                                                               wjb19@psu.edu
C Code Elements II
●Common operations are gathered into functions, the most common
being main(), which must be present in executable

●Functions have a distinct name, take arguments, and return output; this
information comprises the prototype, expressed separately to the
implementation details, former often in header file

●Important system functions include read,write,printf (I/O) and
malloc,free (Memory)

●The operating system executes compiled code; a running program is a
process (more next time)




                                                             wjb19@psu.edu
C Code Example
#include <stdio.h>
#include <stdlib.h>                                    Tells preprocessor to
#include "allDefines.h"                                include these headers;
//Kirchoff Migration function in psktmCPU.c            system functions etc
void ktmMigrationCPU(struct imageGrid* imageX,
        struct imageGrid* imageY,
        struct imageGrid* imageZ,
        struct jobParams* config,
        float* midX,
                                                       Function prototype;
        float* midY,                                   must give arguments,
        float* offX,                                   their types and return
        float* offY,                                   type; implementation
        float* traces,                                 elsewhere
        float* slowness,
        float* image);

int main()
{
    int IMAGE_SIZE = 10;
    float* image = (float*) malloc (IMAGE_SIZE*sizeof(float));
    printf(“size of image = %in”,IMAGE_SIZE);

    for (int i=0; i<IMAGE_SIZE; i++)
        printf(“image point %i = %fn”,i,image[i]);

    free(image);
    return 0;
}
                                                                   wjb19@psu.edu
UNIX C Good Practice I
●Use three streams, with file descriptors 0,1,2 respectively, allows
assembly of operations into pipeline and these data streams are
'cheap' to use

●Only hand simple command line options to main() using
argc,argv[]; in general we wish to handle short and long options
(eg., see GNU coding standards) and the use of getopt_long()
is preferable.

●Utilize the environment variables of the host shell, particularly in
setting runtime conditions in executed code via getenv() eg., in
Bash set in .bashrc config file or via command line:
[wjb19@lionga scratch] $ export MY_STRING=hello

●If your project/program requires a) sophisticated objects b) many
developers c) would benefit from object oriented design principles, you
should consider writing in C++ (although being a higher-level language it is
harder to optimize)
                                                                 wjb19@psu.edu
UNIX C Good Practice II
●In high performance applications, avoid system calls eg.,
read/write where control is given over to the kernel and processes
can be blocked until the resource is ready eg., disk
  ● IF system calls must be used, handle errors and report to stderr

  ● IF temporary files must be written, use mkstemp which sets


    permissions , followed by unlink; the file descriptor is closed by
    the kernel when the program exists and the file removed

●Use assert to test validity of function arguments, statements etc;
will introduce performance hit, but asserts can be removed at compile
time with NDEBUG macro (C standard)

●Debug with gdb, profile with gprof, valgrind; target most
expensive functions for optimization

Put common functions in/use libraries wherever possible....
●




                                                          wjb19@psu.edu
Key HPC Libraries
BLAS/LAPACK/ScaLAPACK
●

 ● Original basic and extended linear algebra routines

 ● http://www.netlib.org/




Intel Math Kernel Library (MKL)
●

  ● implementation of above routines, w/ solvers, fft etc

  ● http://software.intel.com/en-us/articles/intel-mkl/




AMD Core Math Library (ACML)
●

 ● Ditto

 ● http://developer.amd.com/libraries/acml/pages/default.aspx




OpenMPI
●

 ● Open source MPI implementation

 ● http://www.open-mpi.org/




PETSc
●

 ● Data structures and routines for parallel scientific applications based on PDE's

 ● http://www.mcs.anl.gov/petsc/petsc-as/




                                                                            wjb19@psu.edu
UNIX C Compilation I
●In general the creation and use of shared libraries (*so) is preferable to
static (*a), for space reasons and ease of software updates

Program in modules and link separate objects
●




●Use ­fPIC flag in shared library compilation; PIC==position
independent, code in shared object does not depend on address/location
at which it is loaded.

Use the make utility to manage builds (more next time)
●




●Don't forget to update your PATH and LD_LIBRARY_PATH env vars w/
your binary executable path & any libraries you need/created for the
application, respectively



                                                               wjb19@psu.edu
UNIX C Compilation II

●Remember in compilation steps to ­I/set/header/paths and keep
interface (in headers) separate from implementation as much as possible

●Remember in linking steps for shared libs to:
  ● ­L/set/path/to/library AND

  ● set flag ­lmyLib, where

  ● /set/path/to/library/libmyLib.so must exist


otherwise you will have undefined references and/or 'can't find 
­lmyLib' etc

Compile with ­Wall or similar and fix all warnings
●




Read the manual :)
●




                                                            wjb19@psu.edu
Conclusions
●High Performance Computing Systems are an assembly of hardware and
software working together, usually based on the UNIX OS; multiple compute
nodes are connected together

The UNIX kernel is surrounded by a shell eg., Bash; commands and constructs
●

may be assembled into scripts

●UNIX, associated utilities and user applications are traditionally written in high-
level languages like C

●HPC user applications may take advantage of shared or distributed memory
compute models, or both

●Regardless, good code minimizes I/O, keeps data resident in memory for as
long as possible and minimizes communication between processes

●User applications should take advantage of existing high performance libraries,
and tools like gdb, gprof and valgrind

                                                                       wjb19@psu.edu
References
●Dennis Ritchie, RIP
  ● http://en.wikipedia.org/wiki/Dennis_Ritchie

●Advanced bash scripting guide

  ● http://tldp.org/LDP/abs/html/

●Text processing w/ GAWK

  ● http://www.gnu.org/s/gawk/manual/gawk.html

●Advanced Linux programming

  ● http://www.advancedlinuxprogramming.com/alp-folder/

●Excellent optimization tips

  ● http://www.lri.fr/~bastoul/local_copies/lee.html

●GNU compiler collection documents

  ● http://gcc.gnu.org/onlinedocs/

●Original RISC design paper

  ● http://www.eecs.berkeley.edu/Pubs/TechRpts/1982/CSD-82-106.pdf

●C++ FAQ

  ● http://www.parashift.com/c++-faq-lite/

●VIM Wiki

  ● http://vim.wikia.com/wiki/Vim_Tips_Wiki


                                                         wjb19@psu.edu
Exercises
●Take supplied code and compile using gcc, creating executable
foo.x; attempt to run as './foo.x'
●Code has a segmentation fault, an error in memory allocation which is

handled via the malloc function
●Recompile with debug flag ­g, run through gdb and correct the source


of the segmentation fault
●Load the valgrind module ie., 'module load valgrind' and


then run as 'valgrind ./foo.x'; this powerful profiling tool will
help identify memory leaks, or memory on the heap* which has not been
freed

●Write a Bash script that stores your home directory file contents in an
array and :
  ● Uses sed to swap vowels (eg., 'a' and 'e') in names

  ● Parses the array of names and returns only a single match, if it exists,

    else echo NO­MATCH
*heap== region of dynamically allocated memory
                                                                wjb19@psu.edu
GDB quick start
Launch :
●

[wjb19@tesla1 scratch]$ gdb ./foo.x

Run w/ command line argument '100' :
●

(gdb) run 100  

Set breakpoint at line 10 in source file :
●

(gdb) b foo.c:10
Breakpoint 1 at 0x400594: file foo.c, line 10.
(gdb) run
Starting program: /gpfs/scratch/wjb19/foo.x 

Breakpoint 1, main () at foo.c:22
22       int IMAGE_SIZE = 10;

Step to next instruction (issuing 'continue' will resume execution) :
●

(gdb) step
23       float * image = (float*) malloc (IMAGE_SIZE*sizeof(float));

Print second value in array 'image' :
●

(gdb) p image[2]
$4 = 0

Display full backtrace :
●

(gdb) bt full
#0  main () at foo.c:27
        i = 0
        IMAGE_SIZE = 10
        image = 0x601010                                                wjb19@psu.edu
HPC Essentials
         Part II : Elements of Parallelism




             Bill Brouwer
Research Computing and Cyberinfrastructure
             (RCC), PSU




                                             wjb19@psu.edu
Outline
●Introduction
  ● Motivation

    ● HPC operations

  ● Multiprocessors

  ● Processes

  ● Memory Digression

    ● Virtual Memory

    ● Cache

●Threads

  ● POSIX

  ● OpenMP

  ● Affinity




                                  wjb19@psu.edu
Motivation
The problems in science we seek to solve are becoming increasingly large, as
●

we go down in scale (eg., quantum chemistry) or up (eg., astrophysics)

●As a natural consequence, we seek both performance and scaling in our
scientific applications

●Therefore we want to increase floating point operations performed and memory
bandwidth and thus seek parallelization as we run out of resources using a
single processor

●We are limited by Amdahl's law, an expression of the maximum improvement of
parallel code over serial:

                                 1/((1-P) + P/N)

 where P is the portion of application code we parallelize, and N is the number of
processors ie., as N increases, the portion of remaining serial code becomes
increasingly expensive, relatively speaking

                                                                     wjb19@psu.edu
Motivation
●Unless the portion of code we can parallelize approaches 100%, we see
rapidly diminishing returns with increasing numbers of processors

                               12
          Improvement factor



                               10
                                                                                                                 P=90%
                               8


                               6


                               4
                                                                                                                 P=60%
                               2                                                                                 P=30%
                                                                                                                 P=10%
                               0
                                    0   16   32   48   64   80   96    112 128 144 160 176 192 208 224 240 256


                                                                      processors

●Nonetheless, for many applications we have a good chance of
parallelizing the vast majority of the code...

                                                                                                                    wjb19@psu.edu
Example : Kirchhoff Time Migration
●KTM is a technique used widely in oil+gas exploration, providing images
into the earth's interior, used to identify resources

●Seismic trace data acquired over 2D geometry is integrated to give
image of earth's interior, using ~ Green's method

●Input is generally 10^4 – 10^6 traces, 10^3 – 10^4 data points each, ie.,
lots of data to process; output image is also very large

●This is an integral technique (ie., summation, easy to parallelize), just
one of many popular algorithms performed in HPC


                                                        x==image space
                                                        ==seismic space
                                                        t==traveltime
    Image point       Weight       Trace Data
                                                                 wjb19@psu.edu
Common Operations in HPC
●   Integration
     ● Load/store, add & multiply

     ● eg., transforms



●   Derivatives (Finite differences)
     ● Load/store, subtract & divide

     ● eg., PDE



●   Linear Algebra
     ● Load/store, subtract/add/multiply/divide

     ● chemistry & physics, solvers

     ● sparse (classical physics) & dense (quantum)



●Regardless of the operations performed, after compilation into machine code,
when executed by the CPU, instructions are clocked through a pipeline into
registers for execution

●Instruction execution generally takes place in four steps, and multiple
instruction groups are concurrent within the pipeline; execution rate is a direct
function of the clock rate
                                                                      wjb19@psu.edu
Execution Pipeline
 ●This is the most fine-grained form of parallelism; it's efficiency is a strong
 function of branch prediction hardware, or the prediction of which instruction in a
 program is the next to execute*

 ●At a similar level, present in more recent devices are so-called streaming SIMD
 extension (SSE) registers and associated compute hardware


           Clock cycle        0   1   2 3   4 5   6   7



                                                                     pending



            1.Fetch
            2.Decode                                      PIPELINE   executing
            3.Execute
            4.Write-back
                                                                     completed
*assisted by compiler hints                                                      wjb19@psu.edu
SSE
●Streaming SIMD (Single instruction, multiple Data) computation exploits special
registers and instructions to increase computation many-fold in certain cases,
since several data elements are operated on simultaneously

●Each of 8 SSE registers (labeled xmm0 through xmm7) is 128-bit longs,
storing 4 x 32-bit floating-point numbers; SSE2 and SSE3 specifications have
expanded the allowed datatypes to include doubles, ints etc
                          float3       float2     float1    float0
                Bit 127                                              0
●Operations may be 'scalar' or 'pack' (ie., vector), expressed using intrinsics in
__asm block within C code eg.,

                                     addps   xmm0,xmm1
                                   operation dst operand src operand
One can either code the intrinsics explicitly, or rely on the compiler., eg., icc
●


with optimization (­O3)

●   The next level up of parallelization is the multiprocessor...
                                                                         wjb19@psu.edu
Multiprocessor Overview
●Multiprocessors or multiple core CPU's are becoming ubiquitous; better scaling
(cf Moore's law) but limited by contention for shared resources, especially
memory

●Most commonly we deal with Symmetric Multiprocessors (SMP), with unique
cache and registers, as well as shared memory region(s); more on cache in a
moment
                                      ●Memory not necessarily next to processors

                                      → Non-uniform Memory Access (NUMA);
      CPU0              CPU1
                                      try to ensure memory access is as local to
     registers         registers      CPU core(s) as possible

                                     ●The proc directory on UNIX machines is a
      cache              cache       special directory written and updated by the
                                     kernel, containing information on CPU
                                     (/proc/cpuinfo) and memory
                                     (/proc/meminfo)
           main memory
                                     ●The fundamental unit of work on the cores
                                     is a process...              wjb19@psu.edu
Processes
●Application processes are launched on the CPU by the kernel using the
fork() system call; every process has a process ID pid, available on UNIX
systems via the getpid() system call

●The kernel manages many processes concurrently; all information required to
run a process is contained in the process control block (PCB) data structure,
containing (among other things):

    ●   The pid
    ●   The address space
    ●   I/O information eg., open files/streams
    ●   Pointer to next PCB

●Processes may spawn children using the fork() system call; children are
initially a copy of the parent, but may take on different attributes via the exec()
call



                                                                      wjb19@psu.edu
Processes
●A child process takes the id of the parent (ppid), and additionally has a unique
pid eg., output from ps command, describing itself :
[wjb19@tesla1 ~]$ ps  ­eHo "%P %p %c %t %C" 
 PPID   PID COMMAND             ELAPSED %CPU
12608  1719     sshd           01:07:54  0.0
 1719  1724       sshd         01:07:49  0.0
 1724  1725         bash       01:07:48  0.0
 1725  1986           ps          00:00  0.0

●During a context switch, kernel will swap one process control block for another;
context switches are detrimental to HPC and have one or more triggers,
including:
  ● I/O requests

  ● Timer interrupts



●Context switching is a very fine-grained form of scheduling; on compute
clusters we also have coarse grained scheduling in the form of job scheduling
software (more next time)

●The unique address space from the perspective of the process is referred to as
virtual memory
                                                                    wjb19@psu.edu
Virtual Memory
●A running process is given memory by the kernel, referred to as virtual memory
(VM); address space does not correspond to physical memory address space

●The Memory Management Unit (MMU) on CPU translates between the two
address spaces, for requests made between process and OS

●Virtual Memory for every process has the same structure, below left; virtual
address space is divided into units called pages
    High Address                           ●The MMU is assisted in address
                   Environment variables
                    Function arguments     translation by the Translation
                                           Lookaside Buffer (TLB), which stores
                           Stack           page details in a cache

                         Unused
                                           ● Cache is high speed memory
                                           immediately adjacent to the CPU and
                                           it's registers, connected via bus(es)
                           Heap

    Low Address         Instructions

                                                                      wjb19@psu.edu
Cache : Introduction
In HPC, we talk about problems being compute or memory bound
●



    ●   In the former case, we are limited by the rate at which instructions
        can be executed by the CPU
    ●   In the latter, we are limited by the rate at which data can be
        processed by the CPU

●Both instructions and data are loaded into cache; cache memory is laid
out in lines

Cache memory is intermediate in the overall hierarchy, lying between
●

CPU registers and main memory

● If the executing process requests an address corresponding to data or
instructions in cache, we have a 'hit', else 'miss', and a much slower
retrieval of instruction or data from main memory must take place
                                                                    wjb19@psu.edu
Cache : Introduction
●Modern architectures have various levels of cache and divisions of
responsibilities, we will follow valgrind-cachegrind convention, from the
manual:

    ... It simulates a machine with independent first-level instruction and data caches
    (I1 and D1), backed by a unified second-level cache (L2). This exactly matches
    the configuration of many modern machines.
    However, some modern machines have three levels of cache. For these
    machines (in the cases where Cachegrind can auto-detect the cache
    configuration) Cachegrind simulates the first-level and third-level caches. The
    reason for this choice is that the L3 cache has the most influence on runtime, as it
    masks accesses to main memory. Furthermore, the L1 caches often have low
    associativity, so simulating them can detect cases where the code interacts badly
    with this cache (eg. traversing a matrix column-wise with the row length being a
    power of 2)




                                                                               wjb19@psu.edu
Cache Example
●The distribution of data to cache levels is largely set by compiler,
hardware and kernel, however the programmer is still responsible for the
best data access patterns in his/her code possible
●Use cachegrind to optimize data alignment & cache usage eg.,



#include <stdlib.h>
#include <stdio.h>

int main(){

        int SIZE_X,SIZE_Y;
        SIZE_X=2048;
        SIZE_Y=2048;

        float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float));

        for (int i=0; i<SIZE_X; i++)
                for (int j=0; j<SIZE_Y; j++)
                        data[j+SIZE_Y*i] = 10.0f * 3.14f;
                        //bad data access       
                        //data[i+SIZE_Y*j] = 10.0f * 3.14f;             

        free(data);

        return 0;
}
                                                                           wjb19@psu.edu
Cache : Bad Access
bill@bill­HP­EliteBook­6930p:~$ valgrind ­­tool=cachegrind ./foo.x
==3088== Cachegrind, a cache and branch­prediction profiler
==3088== Copyright (C) 2002­2010, and GNU GPL'd, by Nicholas Nethercote et al.
==3088== Using Valgrind­3.6.1 and LibVEX; rerun with ­h for copyright info
==3088== Command: ./foo.x
==3088== 
==3088== 
==3088== I   refs:      50,503,275
==3088== I1  misses:           734
==3088== LLi misses:           733                                      instructions
==3088== I1  miss rate:       0.00%
==3088== LLi miss rate:       0.00%
==3088==                                READ Ops        WRITE Ops
==3088== D   refs:      33,617,678  (29,410,213 rd   + 4,207,465 wr)
==3088== D1  misses:     4,197,161  (     2,335 rd   + 4,194,826 wr)
==3088== LLd misses:     4,196,772  (     1,985 rd   + 4,194,787 wr)    data
==3088== D1  miss rate:       12.4% (       0.0%     +      99.6%  )
==3088== LLd miss rate:       12.4% (       0.0%     +      99.6%  )
==3088== 
==3088== LL refs:        4,197,895  (     3,069 rd   + 4,194,826 wr)
==3088== LL misses:      4,197,505  (     2,718 rd   + 4,194,787 wr)
==3088== LL miss rate:         4.9% (       0.0%     +      99.6%  )
                                                                        lowest level




                                                                         wjb19@psu.edu
Cache : Good Access
bill@bill­HP­EliteBook­6930p:~$ valgrind ­­tool=cachegrind ./foo.x
==4410== Cachegrind, a cache and branch­prediction profiler
==4410== Copyright (C) 2002­2010, and GNU GPL'd, by Nicholas Nethercote et al.
==4410== Using Valgrind­3.6.1 and LibVEX; rerun with ­h for copyright info
==4410== Command: ./foo.x
==4410== 
==4410== 
==4410== I   refs:      50,503,275
==4410== I1  misses:           734
==4410== LLi misses:           733
==4410== I1  miss rate:       0.00%
==4410== LLi miss rate:       0.00%
==4410== 
==4410== D   refs:      33,617,678  (29,410,213 rd   + 4,207,465 wr)
==4410== D1  misses:       265,002  (     2,335 rd   +   262,667 wr)
==4410== LLd misses:       264,613  (     1,985 rd   +   262,628 wr)
==4410== D1  miss rate:        0.7% (       0.0%     +       6.2%  )
==4410== LLd miss rate:        0.7% (       0.0%     +       6.2%  )
==4410== 
==4410== LL refs:          265,736  (     3,069 rd   +   262,667 wr)
==4410== LL misses:        265,346  (     2,718 rd   +   262,628 wr)
==4410== LL miss rate:         0.3% (       0.0%     +       6.2%  )




                                                                        wjb19@psu.edu
Cache Performance
●For large data problems, any speedup introduced by parallelization can easily
be negated by poor cache utilization

●In this case, memory bandwidth is an order of magnitude worse for problem
size (2^14)^2 (cf earlier note on widely variable memory bandwidths; we have to
work hard to approach peak)

●   In many cases we are limited also by random access patterns

                        12
                                                                  High % miss
                        10

                        8
             time (s)




                        6

                        4

                        2
                                                                  Low % miss
                        0
                             10     11       12        13   14

                                         log2 SIZE_X
                                                                    wjb19@psu.edu
Outline
●Introduction
  ● Motivation

    ● Computational operations

  ● Multiprocessors

  ● Processes

  ● Memory Digression

    ● Virtual Memory

    ● Cache

●Threads

  ● POSIX

  ● OpenMP

  ● Affinity




                                 wjb19@psu.edu
POSIX Threads I
●A process may spawn one or more threads; on a multiprocessor, the
OS can schedule these threads across a variety of cores, providing
parallelism in the form of 'light-weight processes' (LWP)

●Whereas a child process receives a copy of the parent's virtual memory
and executes independently thereafter, a thread shares the memory of
the parent including instructions, and also has private data

Using threads we perform shared memory processing (cf distributed
●

memory, next time)

●We are at liberty to launch as many threads as we wish, although as you
might expect, performance takes a hit as more threads are launched
than can be scheduled simultaneously across available cores



                                                            wjb19@psu.edu
POSIX Threads II
●Pthreads refers to the POSIX standard, which is just a specification;
implementations exist for various systems

Each pthread has:
●

 ● An ID

 ● Attributes :

   ● Stack size

   ● Schedule information



●Much like processes, we can monitor thread execution using utilities
such as top and ps

●The memory shared among threads must be used carefully in order to
prevent race conditions, or threads seeing incorrect data during
execution, due to more than one thread performing operations on said
data, in an uncoordinated fashion

                                                               wjb19@psu.edu
POSIX Threads III
●Race conditions may be ameliorated through careful coding, but also
through explicit constructs eg., locks, whereby a single thread gains and
relinquishes control→ implies serialization and computational overhead

●Multi-Threaded programs must also avoid deadlock, a highly undesirous
state where one or more threads await resources, and in turn are unable
to offer up resources required by others

●Deadlocks can also be avoided through good coding, as well as the use
of communication techniques based around semaphores, for example

●Threads awaiting resources may sleep (context switch by kernel, slow,
saves cycles) or busy wait (executes while loop or similar checking
semaphore, fast, wastes cycles)



                                                              wjb19@psu.edu
Pthreads Example
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

int sum; 
void *worker(void *param);
                                                                              global (shared) variable
int main(int argc, char *argv[]){     main thread
        pthread_t tid;                                                         thread id & attributes
        pthread_attr_t attr;

        if (argc!=2 || atoi(argv[1])<0){
                printf("usage : a.out <int value>, where int value > 0n");
                return ­1;
        }  
        pthread_attr_init(&attr);
        pthread_create(&tid,&attr,worker,argv[1]);                               worker thread
        pthread_join(tid,NULL);
        printf("sum = %dn",sum);                                                creation & join
}                                                                                after completion
void * worker(void *total){

        int upper=atoi(total);
        sum = 0;                                                                local (private)
                                                                                variable
        for (int i=0; i<upper; i++)
                sum += i;

        pthread_exit(0);

}

                                                                                    wjb19@psu.edu
Valgrind-helgrind output
[wjb19@hammer16 scratch]$ valgrind ­­tool=helgrind ­v ./foo.x 100 
==5185== Helgrind, a thread error detector
==5185== Copyright (C) 2007­2009, and GNU GPL'd, by OpenWorks LLP et al.
==5185== Using Valgrind­3.5.0 and LibVEX; rerun with ­h for copyright info
==5185== Command: ./foo.x 100
==5185== 
­­5185­­ Valgrind options:                     system calls establishing thread ie., there
­­5185­­    ­­tool=helgrind                    is a COST to create and destroy threads
­­5185­­    ­v
­­5185­­ Contents of /proc/version:
­­5185­­   Linux version 2.6.18­274.7.1.el5 (mockbuild@x86­004.build.bos.redhat.com) (gcc version 

­­5185­­ REDIR: 0x3a97e7c240 (memcpy) redirected to 0x4a09e3c (memcpy)
­­5185­­ REDIR: 0x3a97e79420 (index) redirected to 0x4a09bc9 (index)
­­5185­­ REDIR: 0x3a98a069a0 (pthread_create@@GLIBC_2.2.5) redirected to 0x4a0b2a5 
(pthread_create@*)
­­5185­­ REDIR: 0x3a97e749e0 (calloc) redirected to 0x4a05942 (calloc)
­­5185­­ REDIR: 0x3a98a08ca0 (pthread_mutex_lock) redirected to 0x4a076c2 (pthread_mutex_lock)
­­5185­­ REDIR: 0x3a97e74dc0 (malloc) redirected to 0x4a0664a (malloc)
­­5185­­ REDIR: 0x3a98a0a020 (pthread_mutex_unlock) redirected to 0x4a07b66 (pthread_mutex_unlock)
­­5185­­ REDIR: 0x3a97e79b50 (strlen) redirected to 0x4a09cbb (strlen)
­­5185­­ REDIR: 0x3a98a07a10 (pthread_join) redirected to 0x4a07431 (pthread_join)
sum = 4950
==5185== 
==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)
­­5185­­ 
­­5185­­ used_suppression:      1 helgrind­glibc2X­101
­­5185­­ used_suppression:      1 helgrind­glibc2X­112
­­5185­­ used_suppression:      1 helgrind­glibc2X­102
==5185== 
==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3)



                                                                                    wjb19@psu.edu
Pthreads: Race Condition
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>

int sum;
void *worker(void *param);

int main(int argc, char *argv[]){

        pthread_t tid;
        pthread_attr_t attr;

        if (argc!=2 || atoi(argv[1])<0){
                printf("usage : a.out <int value>, where int value > 0n");
                return ­1;
        }
        pthread_attr_init(&attr);
        pthread_create(&tid,&attr,worker,argv[1]);
        int upper=atoi(argv[1]);                                         main thread works on
        sum=0;
        for (int i=0; i<upper; i++)                                      global variable as well,
                sum+=i;                                                  without synchronization/
        pthread_join(tid,NULL);                                        coordination
        printf("sum = %dn",sum);
}




                                                                                 wjb19@psu.edu
Helgrind output w/ race
[wjb19@hammer16 scratch]$ valgrind ­­tool=helgrind ./foo.x 100 
==5384== Helgrind, a thread error detector
==5384== Copyright (C) 2007­2009, and GNU GPL'd, by OpenWorks LLP et al.
==5384== Using Valgrind­3.5.0 and LibVEX; rerun with ­h for copyright info
==5384== Command: ./foo.x 100
==5384== 
==5384== Thread #1 is the program's root thread
                                                                  built foo.x with debug     on (-g) to
==5384==                                                          find source file line(s)   w/
==5384== Thread #2 was created
==5384==    at 0x3A97ED447E: clone (in /lib64/libc­2.5.so)
                                                                  error(s)
==5384==    by 0x3A98A06D87: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread­2.5.so)
==5384==    by 0x4A0B206: pthread_create_WRK (hg_intercepts.c:229)
==5384==    by 0x4A0B2AD: pthread_create@* (hg_intercepts.c:256)
==5384==    by 0x400748: main (fooThread2.c:18)
==5384== 
==5384== Possible data race during write of size 4 at 0x600cdc by thread #1
==5384==    at 0x400764: main (fooThread2.c:20)
==5384==  This conflicts with a previous write of size 4 by thread #2
==5384==    at 0x4007E3: worker (fooThread2.c:31)
==5384==    by 0x4A0B330: mythread_wrapper (hg_intercepts.c:201)
==5384==    by 0x3A98A0673C: start_thread (in /lib64/libpthread­2.5.so)
==5384==    by 0x3A97ED44BC: clone (in /lib64/libc­2.5.so)
==5384==


●Pthreads is a versatile albeit large and inherently complicated interface

●We are primarily concerned with 'simply' dividing a workload among
available cores; OpenMP proves much less unwieldy to use
                                                                                    wjb19@psu.edu
OpenMP Introduction
●OpenMP is a set of multi-platform/OS compiler directives, libraries and
environment variables for readily creating multi-threaded applications

●The OpenMP standard is managed by a review board, and is defined by a large
number of hardware vendors

●Applications written using OpenMP employ pragmas, or statements interpreted
by the preprocessor (before compilation), representing functionality like fork &
join that would take considerably more effort and care to implement otherwise

●OpenMP pragmas or directives indicate parallel sections of code ie., after
compilation, at runtime, threads are each given a portion of work eg., in this
case, loop iterations will be divided evenly among running threads :
#pragma omp parallel for
for (int i=0; i<SIZE; i++)
    y[i]=x[i]*10.0f;



                                                                      wjb19@psu.edu
OpenMP Clauses I
●The number of threads launched during parallel blocks may be set via function
calls or by setting the OMP_NUM_THREADS environment variable

●Data objects are generally by default shared (loop counters are private by
default), a number of pragma clauses are available, which are valid for the
scope of the parallel section eg., :
  ● private

  ● shared

  ● firstprivate -initialized to value before parallel block

  ● lastprivate -variable keeps value after parallel block

  ● reduction -thread safe way of combining data at conclusion of parallel


    block

●Thread synchronization is implicit to parallel sections; there are a variety of
clauses available for controlling this behavior also, including :
  ● critical-one thread at a time works in this section eg., in order to avoid


    race (expensive, design your code to avoid at all costs)
  ● atomic- safe memory updates performed using eg., mutual exclusion (cost)

  ● barrier-threads wait at this point for others to arrrive
                                                                       wjb19@psu.edu
OpenMP Clauses II
OpenMP has default thread scheduling behavior handled via the runtime library,
●

which may be modified through use of the schedule(type,chunk) clause,
with types :

    ●   static ­ loop iterations are divided among threads equally by default;
        specifying an integer for the parameter chunk will allocate a number of
        contiguous iterations to a thread

    ●   dynamic ­ total iterations form a pool, from which threads work on small
        contiguous subsets until all are complete, with subset size given again by
        chunk

    ●   guided ­ a large section of contiguous iterations are allocated to each
        thread dynamically. The section size decreases exponentially with each
        successive allocation to a minimum size specified by chunk




                                                                       wjb19@psu.edu
OpenMP Example : KTM
●In our first attempt at parallelization shortly, we simply add an OpenMP pragma
before the computational loops in worker function:
#pragma omp parallel for
//loop over trace records
for (int k=0; k<config­>traceNo; k++){

     //loop over imageX
     for(int i=0; i<Li; i++){
          tempC = ( midX[k] ­ imageXX[i]­offX[k]) * (midX[k]­ imageXX[i]­offX[k]);
          tempD = ( midX[k] ­ imageXX[i]+offX[k]) * (midX[k]­ imageXX[i]+offX[k]);

          //loop over imageY
          for(int j=0; j<Lj; j++){
               tempA = tempC + ( midY[k] ­ imageYY[j]­offY[k]) * (midY[k]­ imageYY[j]­offY[k]);
               tempB = tempD + ( midY[k] ­ imageYY[j]+offY[k]) * (midY[k]­ imageYY[j]+offY[k]);

               //loop over imageZ                              
               for (int l=0; l<Ll; l++){
                    temp = sqrtf(tauS[l] + tempA * slownessS[l]);
                    temp += sqrtf(tauS[l] + tempB * slownessS[l]);
                    timeIndex = (int) (temp / sRate);

                    if ((timeIndex < config­>tracePts) && (timeIndex > 0)){
                          image[i*Lj*Ll + j*Ll + l] +=
                          traces[timeIndex + k * config­>tracePts] * temp *sqrtf(tauS[l] / temp);
                   }
               } //imageZ
          } //imageY
     } //imageX
}//input trace records


                                                                                     wjb19@psu.edu
OpenMP KTM Results
●Scales well up to eight cores, then drops off; SMP model has deficiencies due
to a number of factors, including :

    ●   Coverage (Amdahl's law); as we increase processors, relative cost of serial
        code portion increases
    ●   Hardware limitations
    ●   Locality...
                                      5
                                     4.5
                                      4
                    Execution time




                                     3.5
                                      3
                                     2.5
                                      2
                                     1.5
                                      1
                                     0.5
                                      0
                                           1   2      4        8   16


                                                   CPU cores

                                                                        wjb19@psu.edu
CPU Affinity (Intel*)
  ●Recall that the OS schedules processes and threads using context
  switches; can be detrimental → threads may resume on different core,
  destroying locality

  ●We can change this by restricting threads to execute on a subset of
  processors, by setting processor affinity

  ●Simplest approach is to set environment variable KMP_AFFINITY to:
    ● determine the machine topology,

    ● assign threads to processors



  ●Usage:
      KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>] 




*For GNU, ~ equivalent env var == GOMP_CPU_AFFINITY            wjb19@psu.edu
CPU Affinity Settings
●The modifier may take settings corresponding to granularity (with specifiers:
fine, thread, and core), as well as a processor list (proclist={<proc­
list>}), verbose, warnings and others

●   The type settings refer to the nature of the affinity, and may take values :
     ● compact-try to assign thread n+1 context as close as possible to n

     ● disabled

     ● explicit-force assign of threads to processors in proclist

     ● none-just return the topology w/ verbose modifier

     ● scatter-distribute as evenly as possible




●fine & thread refer to the same thing, namely that threads only resume in
the same context; the core modifier implies that they may resume within a
different context, but the same physical core

●CPU Affinity can effect application performance significantly and is worth tuning,
based on your application and the machine topology...

                                                                          wjb19@psu.edu
CPU Topology Map
●For any given computational node, we have several different physical devices
(packages in sockets), comprised of cores (eg., two here), which run one or two
thread contexts

●Without hyperthreading, there is only a single context per core ie., modifiers
thread/fine, core are indistinguishable



                                             Node


                          packageA                       packageB


              core0              core1                  core0           core1


          0           1      0           1          0           1   0           1   Thread context

                                                                                      wjb19@psu.edu
CPU Affinity Examples
●Display machine topology map eg,. Hammer :
[wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none
[wjb19@hammer16 scratch] $ ./psktm.x
OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #156: KMP_AFFINITY: 12 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11}




                                                                                 wjb19@psu.edu
CPU Affinity Examples
●Set affinity with compact setting, fine granularity :
[wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact
[wjb19@hammer5 scratch]$ ./psktm.x 
OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids.
OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11}
OMP: Info #156: KMP_AFFINITY: 12 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores)
OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map:
OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10 
OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0 
OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1 
OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2 
OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8 
OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9 
OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10 
OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2}
OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10}
OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6}
OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1}
OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9}
OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5}
OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3}
OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11}
                                                                                wjb19@psu.edu
Conclusions
●Scientific research is supported by computational scaling and performance,
both provided by parallelism, limited to some extent by Amdahl's law

●Parallelism has various levels of granularity; at the finest level is the instruction
pipeline and vectorized registers eg., SSE

●The next level up in parallel granularity is the multiprocessor; we may run many
concurrent threads using the pthreads API or the OpenMP standard for instance

●Threads must be coded and handled with care, to avoid race and deadlock
conditions

●Performance is a strong function of cache utilization; benefits introduced
through parallelization can easily be negated by sloppy use of memory
bandwidth

●Scaling across cores is limited by hardware, Amdahl's law but also locality; we
have some control over the latter using  KMP_AFFINITY for instance


                                                                         wjb19@psu.edu
References
●Valgrind (buy the manual, worth every penny)
  ● http://valgrind.org/

●OpenMP

  ● http://openmp.org/wp/

●GNU OpenMP

  ● http://gcc.gnu.org/projects/gomp/

●Summary of OpenMP 3.0 C/C++ Syntax

  ● http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf

●Summary of OpenMP 3.0 Fortran Syntax

  ● http://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdf

●Nice SSE tutorial

  ● http://neilkemp.us/src/sse_tutorial/sse_tutorial.html

●Intel Nehalem

  ● http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29

●GNU Make

  ● http://www.gnu.org/s/make/

●Intel hyperthreading

  ● http://en.wikipedia.org/wiki/Hyper-threading


                                                             wjb19@psu.edu
Exercises



●Take the supplied code and parallelize using OpenMP
pragma around the worker function
●Create a makefile which builds the code, compare timings

btwn serial & parallel by varying OMP_NUM_THREADS
●Examine effect of various settings for KMP_AFFINITY




                                                  wjb19@psu.edu
Build w/ Confidence : make
   #Makefile for basic Kirchhoff Time Migration example

   #set compiler
   CC=icc ­openmp

   #set build options
   CFLAGS=­std=c99 ­c

   #main executable
   all: psktm.x

   #objects and dependencies
   psktm.x: psktmCPU.o demoA.o
           $(CC) psktmCPU.o demoA.o ­o psktm.x

   psktmCPU.o: psktmCPU.c
           $(CC) $(CFLAGS) psktmCPU.c

   demoA.o: demoA.c
           $(CC) $(CFLAGS) demoA.c

   clean:
           rm ­rf *o psktm.x


                                                          wjb19@psu.edu
indent with tab only!
HPC Essentials
       Part III : Message Passing Interface




             Bill Brouwer
Research Computing and Cyberinfrastructure
             (RCC), PSU




                                              wjb19@psu.edu
Outline
●Motivation
●Interprocess Communication

    ● Signals

    ● Sockets & Networks

●procfs Digression

●Message Passing Interface

  ● Send/Receive

  ● Communication

  ● Parallel Constructs

  ● Grouping Data

  ● Communicators & Topologies




                                 wjb19@psu.edu
Motivation
●We saw last time that Amdahl's law implies an asymptotic limit to
performance gains from parallelism, where parallel P and serial code (1-
P) portions have fixed relative cost

●We looked at threads (“light-weight processes”) and also saw that
performance depends on a variety of things, including good cache
utilization and affinity

●For the problem size investigated, ultimately the limiting factor was disk
I/O, there was no sense going beyond a single compute node; in a
machine with 16 cores or more, there is no point when P < 60%, should
the process have sufficient memory

●However, as we increase our problem size, the relative parallel/serial
cost changes and P can approach 1


                                                                wjb19@psu.edu
Motivation
●In the limit as processors N → we find the maximum performance
improvement :
                                        1/(1-P)
●It is helpful to see the 3dB points for this limit ie., the number of processors N
                                                                                    1/2
required to achieve (1/√2)*max = 1/(√2*(1-P)); equating with Amdahl's law &
after some algebra :
                               N1/2 = 1/((1-P)*(√2-1))

                   300


                   250


                   200
            N1/2




                   150


                   100


                   50


                    0
                         0.9   0.91   0.92    0.93   0.94   0.95   0.96   0.97   0.98   0.99


                                             Parallel code fraction P                          wjb19@psu.edu
Motivation
Points to note from the graph :
●

 ● P ~ 0.90, we can benefit from ~ 20 cores

 ● P ~ 0.99, we can benefit from a cluster size of ~ 256 cores

 ● P → 1, we approach the “embarrassingly parallel” limit

 ● P ~ 1, performance improvement directly proportional to cores

 ● P ~ 1 implies independent or batch processes



●Quite aside from considerations of Amdahl's law, as the problem size
grows, we may simply exceed the memory available on a single node

●In this case, must move to a distributed memory processing
model/multiple nodes (unless P ~ 1 of course)

How do we determine P? → PROFILING
●




                                                              wjb19@psu.edu
Profiling w/ Valgrind
 [wjb19@lionxf scratch]$ valgrind ­­tool=callgrind ./psktm.x
 [wjb19@lionxf scratch]$ callgrind_annotate ­­inclusive=yes callgrind.out.3853 
 ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
 Profile data file 'callgrind.out.3853' (creator: callgrind­3.5.0)
 ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
 I1 cache: 
 D1 cache: 
 L2 cache:                                                Parallelizable worker
 Timerange: Basic block 0 ­ 2628034011                    function is 99.5% of
 Trigger: Program termination
 Profiled target:  ./psktm.x (PID 3853, part 1)
                                                          total instructions
                                                          executed
 ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
 20,043,133,545  PROGRAM TOTALS
 ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
             Ir  file:function
 ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­
 20,043,133,545  ???:0x0000003128400a70 [/lib64/ld­2.5.so]
 20,042,523,959  ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x]
 20,042,522,144  ???:(below main) [/lib64/libc­2.5.so]
 20,042,473,687  /gpfs/scratch/wjb19/demoA.c:main
 20,042,473,687  demoA.c:main [/gpfs/scratch/wjb19/psktm.x]
 19,934,044,644  psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x]
 19,934,044,644  /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU
  6,359,083,826  ???:sqrtf [/gpfs/scratch/wjb19/psktm.x]
  4,402,442,574  ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x]
    104,966,265  demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x]

If we wish to scale outside a single node, we must use some form of interprocess
communication
                                                                        wjb19@psu.edu
Inter-Process Communication
●   There are a variety of ways for processes to exchange information, including:
     ● Memory (~last week)

     ● Files

     ● Pipes (named/anonymous)

     ● Signals

     ● Sockets

     ● Message Passing



●   File I/O is too slow, and read/writes liable to race conditions

● Anonymous & named pipes are highly efficient but FIFO (first in, first out)
buffers, allowing only unidirectional communication, and between processes on
the same node

●Signals are a very limited form of communication, sent to the process after an
interrupt by the kernel, and handled using a default handler or one specified
using signal() system call

●Signals may come from a variety of sources eg., segmentation fault (SIGSEGV),
keyboard interrupt Ctrl-C (SIGINT) etc
                                                                       wjb19@psu.edu
Signals
●strace is a powerful utility in UNIX which shows the interaction between a
running process and kernel in the form of system calls and signals; here, a
partial output showing mapping of signals to defaults with system call
sigaction(), from ./psktm.x :
                                                            UNIX signals
rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0
rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0

●Signals are crude and restricted to local communication; to communicate
remotely, we can establish a socket between processes, and communicate over
the network

                                                                   wjb19@psu.edu
Sockets & Networks
●Davies/Baran first devised packet switching, an efficient means of
communication over a channel; a computer was conceived to realize their
design and ARPANET went online Oct 1969 between UCLA and Stanford

●TCP/IP became the communication protocol of ARPANET 1 Jan 1983, which
was retired in 1990 and NFSNET established; university networks in the US and
Europe join

●TCP/IP is just one of many protocols, which describes the format of data
packets, and the nature of the communication; an analogous connection method
is used by Infiniband networks in conjunction with Remote Direct Memory
Access (RDMA)

●Unreliable Datagram Protocol (UDP) is analogous to a connectionless method
of communication used by Infiniband high performance networks



                                                                 wjb19@psu.edu
Sockets : UDP host example
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <unistd.h> /* for close() for socket */ 
#include <stdlib.h>
 
int main(void)
{
  //creates an endpoint & returns file descriptor
  //uses IPv4 domain, datagram type, UDP transport
  int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
  
  //socket address object (sa) and memory buffer
  struct sockaddr_in sa; 
  char buffer[1024];
  ssize_t recsize;
  socklen_t fromlen;
 
  //specify same domain type, any input address and port 7654 to listen on
  memset(&sa, 0, sizeof sa);
  sa.sin_family = AF_INET;
  sa.sin_addr.s_addr = INADDR_ANY;
  sa.sin_port = htons(7654);
  fromlen = sizeof(sa);
 

 
   
                                                                             wjb19@psu.edu
Sockets : host example cont.

  //we bind an address (sa) to the socket using fd sock
  if (­1 == bind(sock,(struct sockaddr *)&sa, sizeof(sa)))
  {
    perror("error bind failed");
    close(sock);
    exit(EXIT_FAILURE);
  } 
 
  for (;;) 
  {
    //listen and dump buffer to stdout where applicable
    printf ("recv test....n");
    recsize = recvfrom(sock, (void *)buffer, 1024, 0, (struct sockaddr *)&sa, &fromlen);
    if (recsize < 0) {
      fprintf(stderr, "%sn", strerror(errno));
      exit(EXIT_FAILURE);
    }
    printf("recsize: %zn ", recsize);
    sleep(1);
    printf("datagram: %.*sn", (int)recsize, buffer);
  }
}
 
   




                                                                                      wjb19@psu.edu
Sockets : client example
int main(int argc, char *argv[])
{
  //create a buffer with character data
  int sock;
  struct sockaddr_in sa;
  int bytes_sent;
  char buffer[200];
 
  strcpy(buffer, "hello world!");
 
  //create a socket, same IP and transport as before, address of host 127.0.0.1
  sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);
  if (­1 == sock) /* if socket failed to initialize, exit */
    {
      printf("Error Creating Socket");
      exit(EXIT_FAILURE);
    }
 
  memset(&sa, 0, sizeof sa);
  sa.sin_family = AF_INET;
  sa.sin_addr.s_addr = inet_addr("127.0.0.1");
  sa.sin_port = htons(7654);
 
  bytes_sent = sendto(sock, buffer, strlen(buffer), 0,(struct sockaddr*)&sa, sizeof sa);
  if (bytes_sent < 0) {
    printf("Error sending packet: %sn", strerror(errno));
    exit(EXIT_FAILURE);
  }
 
  close(sock); /* close the socket */
  return 0;
}
●You can monitor sockets by using the netstat facility, which takes it's data
from /proc/net                                                     wjb19@psu.edu
Outline
●Motivation
●Interprocess Communication

    ● Signals

    ● Sockets & Networks

●procfs Digression

●Message Passing

  ● Send/Receive

  ● Communication

  ● Parallel Constructs

  ● Grouping Data

  ● Communicators & Topologies




                                 wjb19@psu.edu
procfs
●We mentioned the /proc directory previously in the context of cpu and
memory information, which is frequently referred to as the proc filesystem or
procfs

●It is a veritable treasure trove of information, written periodically by the kernel,
and is used by a variety of tools eg., ps

●   Each running process is assigned a directory, whose name is the process id

●Each directory contains text files and subdirectories with every detail of a
running process, including context switching statistics, memory management,
open file descriptors and much more

●Much like the ptrace() system call, procfs also gives user applications the
ability to directly manipulate running processes, given sufficient permission; you
can explore that on your own :)


                                                                         wjb19@psu.edu
procfs : examples
●   Some of the more useful files :
     ●   /proc/PID/cmdline : command used to launch process
     ●   /proc/PID/cwd : current working directory
     ●   /proc/PID/environ : environment variables for the process
     ●   /proc/PID/fd : directory w/ symbolic link for each open file descriptor eg., streams
     ●   /proc/PID/status : information including signals, state, memory usage
     ●   /proc/PID/maps : memory map between virtual and physical addresses
●

●   eg., contents of the fd firectory for running process ./psktm.x :
[wjb19@hammer1 fd]$ ls ­lah
total 0
dr­x­­­­­­ 2 wjb19 wjb19  0 Dec  7 12:13 .
dr­xr­xr­x 6 wjb19 wjb19  0 Dec  7 12:10 ..
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 0 ­> /dev/pts/28
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 1 ­> /dev/pts/28
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 2 ­> /dev/pts/28
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 3 ­> /gpfs/scratch/wjb19/inputDataSmall.bin
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 4 ­> /gpfs/scratch/wjb19/inputSrcXSmall.bin
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 5 ­> /gpfs/scratch/wjb19/inputSrcYSmall.bin
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 6 ­> /gpfs/scratch/wjb19/inputRecXSmall.bin
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 7 ­> /gpfs/scratch/wjb19/inputRecYSmall.bin
lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 8 ­> /gpfs/scratch/wjb19/velModel.bin 



                                                                               wjb19@psu.edu
procfs : status file extract
[wjb19@hammer1 30769]$ more status
Name:     psktm.x
State:    R (running)
SleepAVG:      0%
Tgid:     30769
Pid: 30769
PPid:     30687
TracerPid:     0
Uid: 2511 2511 2511 2511
Gid: 2530 2530 2530 2530
FDSize: 256
Groups: 2472 2530 3835 4933 5505 5732 
VmPeak:    65520 kB
VmSize:    65520 kB
VmLck:           0 kB
VmHWM:       37016 kB
VmRSS:       37016 kB
VmData:    51072 kB
VmStk:          88 kB                      Virtual memory usage
VmExe:          64 kB
VmLib:        2944 kB
VmPTE:         164 kB
StaBrk: 1289a000 kB
Brk: 128bb000 kB
StaStk: 7fffbd0a0300 kB
Threads: 5
SigQ:     0/398335
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000                 signals
SigIgn: 0000000000000000
SigCgt: 0000000180000000
                                                                  wjb19@psu.edu
Outline
●Motivation
●Interprocess Communication

    ● Signals

    ● Sockets & Networks

●procfs Digression

●Message Passing Interface

  ● Send/Receive

  ● Communication

  ● Parallel Constructs

  ● Grouping Data

  ● Communicators & Topologies




                                 wjb19@psu.edu
Message Passing Interface (MPI)
●Classical von Neumann machine has single instruction/data stream (SISD) →
single process & memory

●Multiple Instruction, multiple data (MIMD) system → connected processes are
asynchronous, generally distributed memory (may also be shared where
processes on single node)

MIMD Processors are connected in some network topology; we don't have to
●

worry about the details, MPI abstracts this away

●MPI is a standard for parallel programming first established in 1991, updated
occasionally, by academics and industry

●It comprises routines for point-to-point and collective communication, with
bindings to C/C++ and fortran

● Depending on underlying network fabric, communication maybe TCP or UDP-
like in Infiniband networks

                                                                     wjb19@psu.edu
MPI : Basic communication
●Multiple, distributed processes are spawned at initialization, each process
assigned a unique rank 0,1,...,p-1

●   One may send information referencing process rank eg.,:

          MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD);

     Buffer address            Rank of rcv

●   This function has a receive analogue; both routines are blocking by default

●Send/receive statements generally occur in same code, processors execute
appropriate statement according to rank & code branch

Non-blocking functions available, allows communicating processes to continue
●

with execution where able


                                                                       wjb19@psu.edu
MPI : Requisite functions

●Bare minimum → initialize, get rank for process, total processes and
finalize when done

MPI_Init(&argc, &argv); //Start up
MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rank
MPI_Comm_size(MPI_COMM_WORLD, &p); //No. processors
MPI_Finalize(); //close up shop

●MPI_COMM_WORLD is a communicator parameter, a collection of
processes that can send messages to each other.

●Messages are sent with tags to identify them, allowing specificity beyond
using just a source/destination parameter



                                                              wjb19@psu.edu
MPI : Datatypes

MPI_CHAR             signed char
MPI_SHORT            signed short int
MPI_INT              signed int
MPI_LONG             signed long int
MPI_UNSIGNED_CHAR    unsigned char
MPI_UNSIGNED_SHORT   unsigned short int
MPI_UNSIGNED         unsigned int
MPI_UNSIGNED_LONG    unsigned long int
MPI_FLOAT            float 
MPI_DOUBLE           double
MPI_LONG_DOUBLE      long double
MPI_BYTE
MPI_PACKED


                                          wjb19@psu.edu
Minimal MPI example
#include "mpi.h"
#include <stdio.h>

int main(int argc, char *argv[])
{
        int rank, size, i;
        int buffer[10];
        MPI_Status status;

        MPI_Init(&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, &size);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);

        if (rank > 0)
        {

        for (int i =0; i<10; i++)
                        buffer[i]=i * rank;

          MPI_Send(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD);
        } else {
        
          for (int i=1; i<size; i++){
               MPI_Recv(buffer, 10, MPI_INT, i, 0, MPI_COMM_WORLD, &status);
               printf("buffer element 0 : %i from proc : %i n",buffer[0],i);
          }
     }
     MPI_Finalize();
     return 0;
}

                                                                                wjb19@psu.edu
MPI : Collective Communication

●   A communication pattern involving all processes in a communicator is
    a collective communication eg., a broadcast
●   Same data sent to every process in communicator, more efficient
    than using multiple p2p routines, optimized :
MPI_Bcast(void* message, int count, MPI_Datatype type, 
               int root, MPI_Comm comm)
●   Sends copy of data in message from root process to all in comm, a
    scatter/map operation
●   Collective communication is at the heart of efficient parallel
    operations




                                                                 wjb19@psu.edu
Parallel Operations : Reduction
●   Data maybe gathered/reduced after computation via :
MPI_Reduce(void* operand, void* result, int count, 
MPI_Datatype type, MPI_Op operator, int root, MPI_Comm 
comm)
●   Combines all operand, using operator and stores result on
    process root, in result
●   A tree-structured reduce at all nodes == MPI_Allreduce,ie., every
    process in comm gets a copy of the result

            1    2    3              p-1




                          0   root
                                                          wjb19@psu.edu
Reduction Ops

MPI_MAX
MPI_MIN
MPI_SUM
MPI_PROD
MPI_LAND         Logical and
MPI_BAND         Bitwise and
MPI_LOR          Logical or
MPI_BOR          Bitwise or
MPI_LXOR         Logical XOR
MPI_BXOR         Bitwise XOR
MPI_MAXLOC       Max w/ location
MPI_MINLOC       Min w/ location
MPI_PACKED


                                   wjb19@psu.edu
Parallel Operations : Scatter/Gather
●   Bulk transfers of many-to-one and one-to-many are accomplished by
    gather and scatter operations respectively
●   These operations form the kernel of matrix/vector operations for
    example; they are useful for distributing and reassembling arrays




       Process 0   x0                       a00   a01   a02   a03

       Process 1   x1
                   x2
       Process 2
                   x3
       Process 3
                        Gather                      Scatter
                                                                    wjb19@psu.edu
Scatter/Gather Syntax
●   MPI_Gather(void* send_data, int send_count, MPI_Datatype 
    send_type, void* recv_data, int recv_count, MPI_Datatype 
    recv_type, int root, MPI_Comm comm)
●   Collects data referenced by send_data from each process in comm and
    stores data in process rank order on process w/ rank root, in memory
    referenced by recv_data
●   MPI_Scatter(void* send_data, int send_count, 
    MPI_Datatype send_type, void* recv_data, int recv_count, 
    MPI_Datatype recv_type, int root, MPI_Comm comm)
●   Splits data referenced by send_data on process w/ rank root into
    segments, send_count elements each, w/ send_type & distributed in
    order to processes
●   For gather result to ALL processes → MPI_Allgather



                                                                wjb19@psu.edu
Grouping Data I
●   Communication is expensive → bundle variables into single message
●   We must define a derived type than can describe the heterogeneous
    contents of a message using type and displacement pairs
●   Several ways to build this MPI_Datatype eg.,

MPI_Type_Struct(int count,
int block_lengths[], //contains no. entries in each block
MPI_Aint displacements[], //element offset from msg start
MPI_Datatype typelist[], //exactly that
MPI_Datatype* new_mpi_t //a pointer to this new type)
Allows for addresses > int
●   A very general derived type, although arrays to struct must be constructed
    explicitly using other MPI commands
●   Simpler when less heterogeneous eg., MPI_Type_vector,
    MPI_Type_Contiguous, MPI_Type_indexed




                                                                    wjb19@psu.edu
Grouping Data II
●   Before these derived types can be used by a communication function,
    must be committed with MPI_type_commit function call
●   In order for message to be received, type signatures at send and
    receive must be compatible; if a collective communication, signatures
    must be identical
●   MPI_Pack & MPI_Unpack are useful for when messages of
    heterogeneous data are infrequent, and cost of constructing derived
    type outweighs benefit
●   These methods also allow buffering in user versus system memory,
    and the number of items transmitted is in the message itself
●   Group data allows for sophisticated objects; we can also create more
    fined grained communication objects


                                                              wjb19@psu.edu
Communicators
●   Process subsets or groups expand communication beyond simple
    p2p and broadcast communication, to create :
    ●   Intra-communicators → communicate among one other and
        participate in collective communication, composed of :
         –   an ordered collection of processes (group)
         –   a context
    ●   Inter-communicators → communicate between different groups
●   Communicators/groups are opaque, internals not directly accessible;
    these objects are referenced by a handle




                                                            wjb19@psu.edu
Communicators Cont.
●   Internal contents manipulated by methods, much like private data in C++
    class objects eg.,
     ●  int MPI_Group_incl(MPI_Group old_group,int 
        new_group_size, int ranks_in_old_group[], MPI_Group* 
        new_group) → create a new_group from old_group, using
        ranks_in_old_group[] etc
    ●   int MPI_Comm_create(MPI_Comm old_comm, MPI_Group 
        new_group, MPI_Comm* new_comm) → create a new communicator
        from the old, with context
●   MPI_Comm_group and MPI_Group_incl are local methods without
    communication, MPI_Comm_create is a collective communication implying
    synchronization ie,. to establish single context
●   Multiple communicators may be created simultaneously using
    MPI_Comm_split


                                                                 wjb19@psu.edu
Topologies I
●   MPI allows one to associate different addressing schemes to
    processes within a group
●   This is a virtual versus real or physical topology, and is either a graph
    structure or a (Cartesian) grid; properties:
     ●  Dimensions, w/
         – Size of each
         – Period of each
     ●  Option to have processes reordered optimally within grid
●   Method to establish Cartesian grid cart_comm :
int MPI_Cart_create(MPI_Comm old_comm, int 
number_of_dims, int dim_sizes[], int wrap_around[], 
int reorder, MPI_Comm* cart_comm)
●   old_comm is typically just MPI_COMM_WORLD created at init



                                                                 wjb19@psu.edu
Topologies II
●  cart_comm will contain the processes from old_comm with
   associated coordinates, available from MPI_Cart_coords:
int coordinates[2];
int my_grid_rank;
MPI_Comm_rank(cart_comm, &my_grid_rank);
MPI_Cart_Coords(cart_comm, 
my_grid_rank,2,coordinates);

●   Call to MPI_Comm_rank is necessary because of process rank
    reordering (optimization)
●   Processes in cart_comm are stored in row major order
●   Can also partition in to sub-grid(s) using MPI_Cart_sub eg., for row:

int free_coords[2];
MPI_Comm row_comm;      //new sub­grid
free_coords[0]=0;       //bool; first coordinate fixed
free_coords[1]=1;       //bool; second coordinate free
MPI_Cart_sub(cart_comm,free_coords,&row_comm);
                                                               wjb19@psu.edu
Writing Parallel Code
●   Assuming we've profiled our code and decided to parallelize,
    equipped with MPI routines, we must decide whether to take a :
     ● Domain parallel (divide tasks, similar data) or
    ●   Data parallel (divide data, similar tasks) approach
●   Data parallel in general scales much better, implies lower
    communication overhead
●   Regardless, easiest to begin by selecting or designing data
    structures, and subsequently their distribution using a constructed
    topology or scatter/gather routines, for example
●   Program in modules, beginning with easiest/essential functions (eg.,
    I/O), relegating 'hard' functionality to stubs initially
●   Time code sections, look at targets for optimization & redesign
●   Only concern yourself with the highest levels of abstraction germane
    to your problem, use parallel constructs wherever possible
                                                                 wjb19@psu.edu
A Note on the OSI Model
●We've been playing fast and loose with a variety of communication entities;
sockets, networks, protocols like UDP, TCP etc
●The Open Systems Interconnection model separates these entities into 7 layers

of abstraction, each layer providing services to the layer immediately above
●Data becomes increasingly fine grained going down from layer 7 to 1



●As application developers and/or scientists, we need only be concerned with
layers 4 and above
     Layer           Granularity   Function                          Example
     7.Application   data          process accessing network         MPI
     6.Presentation data           encryt/decrypt, data conversion   MPI
     5.Session       data          management                        MPI
     4.Transport     segments      reliability & flow control        IB verbs
     3.Network       packets       path                              Infiniband
     2.Data Link     frames        addressing                        Infiniband
     1.Physical      bits          signals/electrical                Infiniband
                                                                           wjb19@psu.edu
HPC Essentials
HPC Essentials
HPC Essentials
HPC Essentials
HPC Essentials

Contenu connexe

Tendances

BeagleBone Black: Platform Bring-Up with Upstream Components
BeagleBone Black: Platform Bring-Up with Upstream ComponentsBeagleBone Black: Platform Bring-Up with Upstream Components
BeagleBone Black: Platform Bring-Up with Upstream ComponentsGlobalLogic Ukraine
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems updateLukáš Czerner
 
Linux basics and commands - from lynxbee.com
Linux basics and commands - from lynxbee.comLinux basics and commands - from lynxbee.com
Linux basics and commands - from lynxbee.comGreen Ecosystem
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemHungWei Chiu
 
Linux directory structure by jitu mistry
Linux directory structure by jitu mistryLinux directory structure by jitu mistry
Linux directory structure by jitu mistryJITU MISTRY
 
How to install gentoo distributed
How to install gentoo distributedHow to install gentoo distributed
How to install gentoo distributedSongWang54
 
Files and directories in Linux 6
Files and directories  in Linux 6Files and directories  in Linux 6
Files and directories in Linux 6Meenakshi Paul
 
Lamp ppt
Lamp pptLamp ppt
Lamp pptReka
 
From Drives to URLs
From Drives to URLsFrom Drives to URLs
From Drives to URLsadil raja
 
Open Enea Linux workshop at the Embedded Conference Scandinavia 2014
Open Enea Linux workshop at the Embedded Conference Scandinavia 2014Open Enea Linux workshop at the Embedded Conference Scandinavia 2014
Open Enea Linux workshop at the Embedded Conference Scandinavia 2014EneaSoftware
 
คำสั่งยูนิกส์ Command line
คำสั่งยูนิกส์ Command lineคำสั่งยูนิกส์ Command line
คำสั่งยูนิกส์ Command lineSopit Pairo
 
Linux Shell Scripting Presantion
Linux Shell Scripting PresantionLinux Shell Scripting Presantion
Linux Shell Scripting PresantionSameerNimkar
 
Linux fundamental - Chap 11 boot
Linux fundamental - Chap 11 bootLinux fundamental - Chap 11 boot
Linux fundamental - Chap 11 bootKenny (netman)
 

Tendances (20)

BeagleBone Black: Platform Bring-Up with Upstream Components
BeagleBone Black: Platform Bring-Up with Upstream ComponentsBeagleBone Black: Platform Bring-Up with Upstream Components
BeagleBone Black: Platform Bring-Up with Upstream Components
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems update
 
Linux basics and commands - from lynxbee.com
Linux basics and commands - from lynxbee.comLinux basics and commands - from lynxbee.com
Linux basics and commands - from lynxbee.com
 
Linux filesystemhierarchy
Linux filesystemhierarchyLinux filesystemhierarchy
Linux filesystemhierarchy
 
FUSE Filesystems
FUSE FilesystemsFUSE Filesystems
FUSE Filesystems
 
Linux
LinuxLinux
Linux
 
The basic concept of Linux FIleSystem
The basic concept of Linux FIleSystemThe basic concept of Linux FIleSystem
The basic concept of Linux FIleSystem
 
Linux directory structure by jitu mistry
Linux directory structure by jitu mistryLinux directory structure by jitu mistry
Linux directory structure by jitu mistry
 
How to install gentoo distributed
How to install gentoo distributedHow to install gentoo distributed
How to install gentoo distributed
 
UNIX/Linux training
UNIX/Linux trainingUNIX/Linux training
UNIX/Linux training
 
Files and directories in Linux 6
Files and directories  in Linux 6Files and directories  in Linux 6
Files and directories in Linux 6
 
Linux file system
Linux file systemLinux file system
Linux file system
 
Cli1 Bibalex
Cli1 BibalexCli1 Bibalex
Cli1 Bibalex
 
101 1.2 boot the system
101 1.2 boot the system101 1.2 boot the system
101 1.2 boot the system
 
Lamp ppt
Lamp pptLamp ppt
Lamp ppt
 
From Drives to URLs
From Drives to URLsFrom Drives to URLs
From Drives to URLs
 
Open Enea Linux workshop at the Embedded Conference Scandinavia 2014
Open Enea Linux workshop at the Embedded Conference Scandinavia 2014Open Enea Linux workshop at the Embedded Conference Scandinavia 2014
Open Enea Linux workshop at the Embedded Conference Scandinavia 2014
 
คำสั่งยูนิกส์ Command line
คำสั่งยูนิกส์ Command lineคำสั่งยูนิกส์ Command line
คำสั่งยูนิกส์ Command line
 
Linux Shell Scripting Presantion
Linux Shell Scripting PresantionLinux Shell Scripting Presantion
Linux Shell Scripting Presantion
 
Linux fundamental - Chap 11 boot
Linux fundamental - Chap 11 bootLinux fundamental - Chap 11 boot
Linux fundamental - Chap 11 boot
 

Similaire à HPC Essentials

Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyOlivier Bourgeois
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and ShellsHoffman Lab
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Itzik Kotler
 
2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdf2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdfsanjeevkuraganti
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...Torsten Seemann
 
[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practicalMoabi.com
 
Lec 49 - stream-files
Lec 49 - stream-filesLec 49 - stream-files
Lec 49 - stream-filesPrincess Sam
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopQuey-Liang Kao
 
LCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLinaro
 
Unix fundamentals
Unix fundamentalsUnix fundamentals
Unix fundamentalsBimal Jain
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactAlessandro Selli
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuThe Linux Foundation
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
 

Similaire à HPC Essentials (20)

Tips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development EfficiencyTips and Tricks for Increased Development Efficiency
Tips and Tricks for Increased Development Efficiency
 
Terminals and Shells
Terminals and ShellsTerminals and Shells
Terminals and Shells
 
Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)Hack Like It's 2013 (The Workshop)
Hack Like It's 2013 (The Workshop)
 
Hardware hacking
Hardware hackingHardware hacking
Hardware hacking
 
2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdf2018-Summer-Tutorial-Intro-to-Linux.pdf
2018-Summer-Tutorial-Intro-to-Linux.pdf
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 
Linux Workshop , Day 3
Linux Workshop , Day 3Linux Workshop , Day 3
Linux Workshop , Day 3
 
[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical[Defcon] Hardware backdooring is practical
[Defcon] Hardware backdooring is practical
 
Lec 49 - stream-files
Lec 49 - stream-filesLec 49 - stream-files
Lec 49 - stream-files
 
U-Boot - An universal bootloader
U-Boot - An universal bootloader U-Boot - An universal bootloader
U-Boot - An universal bootloader
 
Talk 160920 @ Cat System Workshop
Talk 160920 @ Cat System WorkshopTalk 160920 @ Cat System Workshop
Talk 160920 @ Cat System Workshop
 
LCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platformLCU14 302- How to port OP-TEE to another platform
LCU14 302- How to port OP-TEE to another platform
 
Unix fundamentals
Unix fundamentalsUnix fundamentals
Unix fundamentals
 
Linux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compactLinux Capabilities - eng - v2.1.5, compact
Linux Capabilities - eng - v2.1.5, compact
 
Masters porting linux
Masters porting linuxMasters porting linux
Masters porting linux
 
PV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream QemuPV-Drivers for SeaBIOS using Upstream Qemu
PV-Drivers for SeaBIOS using Upstream Qemu
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Hpc4
Hpc4Hpc4
Hpc4
 
Io sys
Io sysIo sys
Io sys
 
A
AA
A
 

HPC Essentials

  • 1. HPC Essentials Part I : UNIX/C Overview Bill Brouwer Research Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  • 2. Outline ●Introduction ● Hardware ● Definitions ● UNIX ● Kernel & shell ●Files ● Permissions ● Utilities ● Bash Scripting ●C programming wjb19@psu.edu
  • 3. HPC Introduction HPC systems composed of : ● ● Software ● Hardware ● Devices (eg., disks) ● Compute elements (eg., CPU) ● Shared and/or distributed memory ● Communication (eg., Infiniband network) ●A HPC system ...isn't... unless hardware is configured correctly and software leverages all resources made available to it, in an optimal manner ●An operating system controls the execution of software on the hardware; HPC clusters almost exclusively use UNIX/Linux ●In the computational sciences, we pass data and/or abstractions through a pipelined workflow; UNIX is the natural analogue to this solving/discovery process wjb19@psu.edu
  • 4. UNIX ●UNIX is a multi-user/tasking OS created by Dennis Ritchie and Ken Thompson at AT&T Bell Labs 1969-1970, written primarily in C language (also developed by Ritchie) UNIX is composed of : ● ● Kernel ● OS itself which handles scheduling, memory management, I/O etc ● Shell (eg., Bash) ● Interacts with kernel, command line interpreter ● Utilities ● Programs run by the shell, tools for file manipulation, interaction with the system ● Files ● Everything but process(es), composed of data... wjb19@psu.edu
  • 5. Data-Related Definitions ●Binary ● Most fundamental data representation in computing, base 2 number system (others; hex → base 16, oct → base 8) ●Byte ● 8 bits = 8b = 1Byte = 1B; 1kB = 1024 B; 1MB = 1024 kB etc ●ASCII ● American Standard Code for Information Interchange; character encoding scheme, 7bits (traditional) or 8bits (UTF-8) per character, a Unicode encoding ●Stream ● A flow of bytes; source → stdout (& stderr), sink → stdin ●Bus ● Communication channel over which data flows, connects elements within a machine ●Process ● Fundamental unit of computational work performed by a processor; CPU executes application or OS instructions ●Node ● Single computer, composed of many elements, various architectures for CPU, eg., x86, RISC wjb19@psu.edu
  • 6. Typical Compute Node (Intel i7) RAM CPU memory bus QuickPath Interconnect GPU IOH volatile storage PCI-express Direct Media Interface ethernet PCI-e cards ICH NETWORK SATA/USB BIOS non-volatile storage wjb19@psu.edu
  • 7. More Definitions ●Cluster ● Many nodes connected together via network ●Network ● Communication channel, inter-node; connects machines ●Shared Memory ● Memory region shared within node ●Distributed Memory ● Memory region across two or more nodes ●Direct Memory Access (DMA) ● Access memory independently of programmed I/O ie., independent of the CPU ●Bandwidth ● Rate of data transfer across serial or parallel communication channel, expressed as bits (b) or Bytes (B) per second (s) ● Beware quotations of bandwidth; many factors eg., simplex/duplex, peak/sustained, no. of lanes etc ● Latency or the time to create a communication channel is often more important wjb19@psu.edu
  • 8. Bandwidths ●Devices ● USB : 60MB/s (version 2.0) ● Hard Disk : 100MBs-500MB/s ● PCIe : 32GB/s (x8, version 2.0) ●Networks ● 10/100Base T : 10/100 Mbit/s ● 1000BaseT (1GigE) : 1000 Mbit/s ● 10 GigE : 10 Gbit/s ● Infiniband QDR 4X: 40 Gbit/s ●Memory ● CPU : ~ 35 GB/s (Nehalem, 3x 1.3GHz DIMM/socket)* ● GPU : ~ 180 GB/s (GeForce GTX 480) ●AVOID devices, keep data resident in memory, minimize communication btwn processes ●MANY subtleties to CPU memory management eg., with 8x CPU cores, total bandwidth may be > 300 GB/s or as little as 10 GB/s, will discuss further *http://www.delltechcenter.com/page/04-08-2009+-+Nehalem+and+Memory+Configurations?t=anon#fbid=XZRzflqVZ6J wjb19@psu.edu
  • 9. Outline ●Introduction ● HPC hardware ● Definitions ● UNIX ● Kernel & shell ●Files ● Permissions ● Utilities ● Bash Scripting ●C programming wjb19@psu.edu
  • 10. UNIX Permissions & Files ●At the highest level, UNIX objects are either files or processes, and both are protected by permissions (processes next time) ●Every file object has two ID's, the user and group, both are assigned on creation; only the root user has unrestricted access to everything ●Files also have bits which specify read (r), write (w) and execute (x) permissions for the user, group and others eg., output of ls command: ­rw­r­­r­­ 1 root root 0 Jun 11 1976 /usr/local/foo.txt user/group/others User ID Group ID filename ●We can manipulate files using myriad utilities, these utilities are commands interpreted by the shell and executed by the kernel ●To learn more, check man pages ie., from the command line 'man  <command>' wjb19@psu.edu
  • 11. File Manipulation I Working from the command line in a Bash shell: ● List directory foo_dir contents, human readable : ● [wjb19@lionga scratch] $ ls ­lah foo_dir Change ownership of foo.xyz to wjb19; group and user: ● [wjb19@lionga scratch] $ chown wjb19:wjb19 foo.xyz ●Add execute permission to foo.xyz: [wjb19@lionga scratch] $ chmod +x foo.xyz ●Determine filetype for foo.xyz: [wjb19@lionga scratch] $ file foo.xyz ●Peruse text file foo.xyz: [wjb19@lionga scratch] $ more foo.xyz wjb19@psu.edu
  • 12. File Manipulation II ●Copy foo.txt from lionga to file /home/bill/foo.txt on dirac : [wjb19@lionga scratch] $ scp foo.txt   wjb19@dirac.rcc.psu.edu:/home/bill/foo.txt Create gzip compressed file archive of directory foo and contents : ● [wjb19@lionga scratch] $ tar ­cfz foo_archive.tgz foo/* Create bzip2 compressed file archive of directory foo and contents : ● [wjb19@lionga scratch] $ tar ­cfj foo_archive.tbz foo/* Unpack compressed file archive : ● [wjb19@lionga scratch] $ tar ­xvf foo_archive.tgz Edit a text file using VIM: ● [wjb19@lionga scratch] $ vim foo.txt ●VIM is a venerable and powerful command line editor with a rich set of commands wjb19@psu.edu
  • 13. Text File Edit w/ VIM ●Two main modes of operation; editing or command. From command, switch to edit by issuing 'a' (insert after cursor) or 'i' (before), switch back to command via <ESC> Save w/o quitting :w<ENTER> Save and quit (ie., <shift> AND 'z' AND 'z') :wq<ENTER> Quit w/o saving :q!<ENTER> Delete x lines eg,. x=10 (also stored in clipboard) d10d Yank (copy) x lines eg., x=10 y10y Split screen/buffer :split<ENTER> Switch window/buffer <CNTRL>­w­w Go to line x eg., x=10 :10<ENTER> Find matching construct (eg., from { to }) % ● Paste: 'p' undo: 'u' redo: '<CNTRL>­r' ● Move up/down one screen line : '­' and '+' ● Search for expression exp, forward ('n' or 'N' navigate up/down highlighted matches) '/exp<ENTER>' or backward '?exp<ENTER>'  wjb19@psu.edu
  • 14. Text File Compare w/ VIMDIFF ●Same commands as VIM, but highlights differences in files, allows transfer of text btwn buffers/files; launch with 'vimdiff foo.txt foo2.txt' ●Push text from right to left (when right window active and cursor in relevant region) using command 'dp' ●Pull text from right to left (when left window active and cursor in relevant region) using command 'do' wjb19@psu.edu
  • 15. Bash Scripting ●File and other utilities can be assembled into scripts, interpreted by the shell eg., Bash ●The scripts can be collections of commands/utilities & fundamental programming constructs Code Comment #this is a comment Pipe stdout of procA to stdin of procB procA | procB Redirect stdout of procA to file foo.txt* procA > foo.txt Command separator procA; procB If block if [condition] then procA fi Display on stdout echo “hello” Variable assignment & literal value a = “foo”; echo $a Concatenate strings b=a.“foo2”; Text Processing utilities sed,gawk Search utilities find,grep *Streams have file descriptors (numbers) associated with them; eg., to redirect stderr from procA to foo.txt → procA 2> foo.txt wjb19@psu.edu
  • 16. Text Processing ●Text documents are composed of records (roughly speaking, lines separated by carriage returns) and fields (separated by spaces) ●Text processing using sed & gawk involves coupling patterns with actions eg., print field 1 in document foo.txt when encountering word image: [wjb19@lionga scratch] $ gawk '/image/ {print $1;}' “foo.txt”  pattern action input ●Parse, without case sensitivity, change from default space field separator (FS) to equals sign, print field 2: [wjb19@lionga scratch] $ gawk 'BEGIN{IGNORECASE=1; FS=”=”}   /image/ {print $2;}' “foo.txt” ● Putting it all together → create a Bash script w/ VIM or other (eg,. Pico)... wjb19@psu.edu
  • 17. Bash Example I #!/bin/bash Run using bash #set source and destination paths DIR_PATH=~/scratch/espresso­PRACE/PW BAK_PATH=~/scratch/PW_BAK declare ­a file_list Declare an array #filenames to array file_list=$(ls ­l ${BAK_PATH} | gawk '/f90/ {print $9}') Command output cnt=0; #parse files & pretty up for x in $file_list do     let "cnt+=1"     sed 's/,&/, &/g' $BAK_PATH/$x |      sed 's/)/) /g' |      sed 's/call/ call /g' |  Search & replace     sed 's/CALL/ call /g' > $DIR_PATH/$x echo cleaned file no. $cnt $x done exit wjb19@psu.edu
  • 18. Bash Example II #!/bin/bash if [ $# ­lt 6 ] Total arguments then echo usage: fitCPCPMG.sh '[/path/and/filename.csv]  [desired number of gaussians in mixture (2­10)]   [no. random samples (1000­10000)] [mcmc steps (1000­30000)]  [percent noise level (0­10)] [percent step size (0.01­20)] [/path/to/restart/filename.csv; optional]'     exit fi ext=${1##*.} File extension if [ "$ext" != "csv" ] then         echo ERROR: file must be *.csv         exit fi base=$(basename $1 .csv) File basename if [[ $2 ­lt 2 ]] || [[ $2 ­gt 10 ]] then  echo "ERROR: must specify 2<=x<=10 gaussians in mixture" exit fi wjb19@psu.edu
  • 19. Outline ●Introduction ● HPC hardware ● Definitions ● UNIX ● Kernel & shell ●Files ● Permissions ● Utilities ● Bash Scripting ●C programming wjb19@psu.edu
  • 20. The C Language ●Utilities, user applications and indeed the UNIX OS itself are executed by the CPU, when expressed as machine code eg., store/load from memory, addition etc ●Fundamental operations like memory allocation, I/O etc are laborious to express at this level, most frequently we begin from a high-level language like C ●The process of creating an executable consists of at least 3 fundamental steps; creation of source code text file containing all desired objects and operations, compilation and linking eg,. using the GNU tool gcc to create executable foo.x from source file foo.c: [wjb19@tesla2 scratch]$ gcc ­std=c99 foo.c ­o foo.x *C99 standard Executable compile link Source *c Object *o file code Library objects wjb19@psu.edu
  • 21. C Code Elements I ●Composed of primitive datatypes (eg., int, float, long), which have different sizes in memory, multiples of 1 byte ●May be composed of statically allocated memory (compile time), dynamically allocated memory (runtime), or both ●Pointers (eg., float *) are primitives with 4 or 8 byte lengths (32bit or 64bit machines) which contain an address to a contiguous region of dynamically allocated memory ●More complicated objects can be constructed from primitives and arrays eg., a struct wjb19@psu.edu
  • 22. C Code Elements II ●Common operations are gathered into functions, the most common being main(), which must be present in executable ●Functions have a distinct name, take arguments, and return output; this information comprises the prototype, expressed separately to the implementation details, former often in header file ●Important system functions include read,write,printf (I/O) and malloc,free (Memory) ●The operating system executes compiled code; a running program is a process (more next time) wjb19@psu.edu
  • 23. C Code Example #include <stdio.h> #include <stdlib.h> Tells preprocessor to #include "allDefines.h" include these headers; //Kirchoff Migration function in psktmCPU.c system functions etc void ktmMigrationCPU(struct imageGrid* imageX,         struct imageGrid* imageY,         struct imageGrid* imageZ,         struct jobParams* config,         float* midX, Function prototype;         float* midY, must give arguments,         float* offX, their types and return         float* offY, type; implementation         float* traces, elsewhere         float* slowness,         float* image); int main() { int IMAGE_SIZE = 10; float* image = (float*) malloc (IMAGE_SIZE*sizeof(float)); printf(“size of image = %in”,IMAGE_SIZE); for (int i=0; i<IMAGE_SIZE; i++) printf(“image point %i = %fn”,i,image[i]); free(image); return 0; } wjb19@psu.edu
  • 24. UNIX C Good Practice I ●Use three streams, with file descriptors 0,1,2 respectively, allows assembly of operations into pipeline and these data streams are 'cheap' to use ●Only hand simple command line options to main() using argc,argv[]; in general we wish to handle short and long options (eg., see GNU coding standards) and the use of getopt_long() is preferable. ●Utilize the environment variables of the host shell, particularly in setting runtime conditions in executed code via getenv() eg., in Bash set in .bashrc config file or via command line: [wjb19@lionga scratch] $ export MY_STRING=hello ●If your project/program requires a) sophisticated objects b) many developers c) would benefit from object oriented design principles, you should consider writing in C++ (although being a higher-level language it is harder to optimize) wjb19@psu.edu
  • 25. UNIX C Good Practice II ●In high performance applications, avoid system calls eg., read/write where control is given over to the kernel and processes can be blocked until the resource is ready eg., disk ● IF system calls must be used, handle errors and report to stderr ● IF temporary files must be written, use mkstemp which sets permissions , followed by unlink; the file descriptor is closed by the kernel when the program exists and the file removed ●Use assert to test validity of function arguments, statements etc; will introduce performance hit, but asserts can be removed at compile time with NDEBUG macro (C standard) ●Debug with gdb, profile with gprof, valgrind; target most expensive functions for optimization Put common functions in/use libraries wherever possible.... ● wjb19@psu.edu
  • 26. Key HPC Libraries BLAS/LAPACK/ScaLAPACK ● ● Original basic and extended linear algebra routines ● http://www.netlib.org/ Intel Math Kernel Library (MKL) ● ● implementation of above routines, w/ solvers, fft etc ● http://software.intel.com/en-us/articles/intel-mkl/ AMD Core Math Library (ACML) ● ● Ditto ● http://developer.amd.com/libraries/acml/pages/default.aspx OpenMPI ● ● Open source MPI implementation ● http://www.open-mpi.org/ PETSc ● ● Data structures and routines for parallel scientific applications based on PDE's ● http://www.mcs.anl.gov/petsc/petsc-as/ wjb19@psu.edu
  • 27. UNIX C Compilation I ●In general the creation and use of shared libraries (*so) is preferable to static (*a), for space reasons and ease of software updates Program in modules and link separate objects ● ●Use ­fPIC flag in shared library compilation; PIC==position independent, code in shared object does not depend on address/location at which it is loaded. Use the make utility to manage builds (more next time) ● ●Don't forget to update your PATH and LD_LIBRARY_PATH env vars w/ your binary executable path & any libraries you need/created for the application, respectively wjb19@psu.edu
  • 28. UNIX C Compilation II ●Remember in compilation steps to ­I/set/header/paths and keep interface (in headers) separate from implementation as much as possible ●Remember in linking steps for shared libs to: ● ­L/set/path/to/library AND ● set flag ­lmyLib, where ● /set/path/to/library/libmyLib.so must exist otherwise you will have undefined references and/or 'can't find  ­lmyLib' etc Compile with ­Wall or similar and fix all warnings ● Read the manual :) ● wjb19@psu.edu
  • 29. Conclusions ●High Performance Computing Systems are an assembly of hardware and software working together, usually based on the UNIX OS; multiple compute nodes are connected together The UNIX kernel is surrounded by a shell eg., Bash; commands and constructs ● may be assembled into scripts ●UNIX, associated utilities and user applications are traditionally written in high- level languages like C ●HPC user applications may take advantage of shared or distributed memory compute models, or both ●Regardless, good code minimizes I/O, keeps data resident in memory for as long as possible and minimizes communication between processes ●User applications should take advantage of existing high performance libraries, and tools like gdb, gprof and valgrind wjb19@psu.edu
  • 30. References ●Dennis Ritchie, RIP ● http://en.wikipedia.org/wiki/Dennis_Ritchie ●Advanced bash scripting guide ● http://tldp.org/LDP/abs/html/ ●Text processing w/ GAWK ● http://www.gnu.org/s/gawk/manual/gawk.html ●Advanced Linux programming ● http://www.advancedlinuxprogramming.com/alp-folder/ ●Excellent optimization tips ● http://www.lri.fr/~bastoul/local_copies/lee.html ●GNU compiler collection documents ● http://gcc.gnu.org/onlinedocs/ ●Original RISC design paper ● http://www.eecs.berkeley.edu/Pubs/TechRpts/1982/CSD-82-106.pdf ●C++ FAQ ● http://www.parashift.com/c++-faq-lite/ ●VIM Wiki ● http://vim.wikia.com/wiki/Vim_Tips_Wiki wjb19@psu.edu
  • 31. Exercises ●Take supplied code and compile using gcc, creating executable foo.x; attempt to run as './foo.x' ●Code has a segmentation fault, an error in memory allocation which is handled via the malloc function ●Recompile with debug flag ­g, run through gdb and correct the source of the segmentation fault ●Load the valgrind module ie., 'module load valgrind' and then run as 'valgrind ./foo.x'; this powerful profiling tool will help identify memory leaks, or memory on the heap* which has not been freed ●Write a Bash script that stores your home directory file contents in an array and : ● Uses sed to swap vowels (eg., 'a' and 'e') in names ● Parses the array of names and returns only a single match, if it exists, else echo NO­MATCH *heap== region of dynamically allocated memory wjb19@psu.edu
  • 32. GDB quick start Launch : ● [wjb19@tesla1 scratch]$ gdb ./foo.x Run w/ command line argument '100' : ● (gdb) run 100   Set breakpoint at line 10 in source file : ● (gdb) b foo.c:10 Breakpoint 1 at 0x400594: file foo.c, line 10. (gdb) run Starting program: /gpfs/scratch/wjb19/foo.x  Breakpoint 1, main () at foo.c:22 22 int IMAGE_SIZE = 10; Step to next instruction (issuing 'continue' will resume execution) : ● (gdb) step 23 float * image = (float*) malloc (IMAGE_SIZE*sizeof(float)); Print second value in array 'image' : ● (gdb) p image[2] $4 = 0 Display full backtrace : ● (gdb) bt full #0  main () at foo.c:27         i = 0         IMAGE_SIZE = 10         image = 0x601010 wjb19@psu.edu
  • 33. HPC Essentials Part II : Elements of Parallelism Bill Brouwer Research Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  • 34. Outline ●Introduction ● Motivation ● HPC operations ● Multiprocessors ● Processes ● Memory Digression ● Virtual Memory ● Cache ●Threads ● POSIX ● OpenMP ● Affinity wjb19@psu.edu
  • 35. Motivation The problems in science we seek to solve are becoming increasingly large, as ● we go down in scale (eg., quantum chemistry) or up (eg., astrophysics) ●As a natural consequence, we seek both performance and scaling in our scientific applications ●Therefore we want to increase floating point operations performed and memory bandwidth and thus seek parallelization as we run out of resources using a single processor ●We are limited by Amdahl's law, an expression of the maximum improvement of parallel code over serial: 1/((1-P) + P/N) where P is the portion of application code we parallelize, and N is the number of processors ie., as N increases, the portion of remaining serial code becomes increasingly expensive, relatively speaking wjb19@psu.edu
  • 36. Motivation ●Unless the portion of code we can parallelize approaches 100%, we see rapidly diminishing returns with increasing numbers of processors 12 Improvement factor 10 P=90% 8 6 4 P=60% 2 P=30% P=10% 0 0 16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256 processors ●Nonetheless, for many applications we have a good chance of parallelizing the vast majority of the code... wjb19@psu.edu
  • 37. Example : Kirchhoff Time Migration ●KTM is a technique used widely in oil+gas exploration, providing images into the earth's interior, used to identify resources ●Seismic trace data acquired over 2D geometry is integrated to give image of earth's interior, using ~ Green's method ●Input is generally 10^4 – 10^6 traces, 10^3 – 10^4 data points each, ie., lots of data to process; output image is also very large ●This is an integral technique (ie., summation, easy to parallelize), just one of many popular algorithms performed in HPC x==image space ==seismic space t==traveltime Image point Weight Trace Data wjb19@psu.edu
  • 38. Common Operations in HPC ● Integration ● Load/store, add & multiply ● eg., transforms ● Derivatives (Finite differences) ● Load/store, subtract & divide ● eg., PDE ● Linear Algebra ● Load/store, subtract/add/multiply/divide ● chemistry & physics, solvers ● sparse (classical physics) & dense (quantum) ●Regardless of the operations performed, after compilation into machine code, when executed by the CPU, instructions are clocked through a pipeline into registers for execution ●Instruction execution generally takes place in four steps, and multiple instruction groups are concurrent within the pipeline; execution rate is a direct function of the clock rate wjb19@psu.edu
  • 39. Execution Pipeline ●This is the most fine-grained form of parallelism; it's efficiency is a strong function of branch prediction hardware, or the prediction of which instruction in a program is the next to execute* ●At a similar level, present in more recent devices are so-called streaming SIMD extension (SSE) registers and associated compute hardware Clock cycle 0 1 2 3 4 5 6 7 pending 1.Fetch 2.Decode PIPELINE executing 3.Execute 4.Write-back completed *assisted by compiler hints wjb19@psu.edu
  • 40. SSE ●Streaming SIMD (Single instruction, multiple Data) computation exploits special registers and instructions to increase computation many-fold in certain cases, since several data elements are operated on simultaneously ●Each of 8 SSE registers (labeled xmm0 through xmm7) is 128-bit longs, storing 4 x 32-bit floating-point numbers; SSE2 and SSE3 specifications have expanded the allowed datatypes to include doubles, ints etc float3 float2 float1 float0 Bit 127 0 ●Operations may be 'scalar' or 'pack' (ie., vector), expressed using intrinsics in __asm block within C code eg., addps   xmm0,xmm1 operation dst operand src operand One can either code the intrinsics explicitly, or rely on the compiler., eg., icc ● with optimization (­O3) ● The next level up of parallelization is the multiprocessor... wjb19@psu.edu
  • 41. Multiprocessor Overview ●Multiprocessors or multiple core CPU's are becoming ubiquitous; better scaling (cf Moore's law) but limited by contention for shared resources, especially memory ●Most commonly we deal with Symmetric Multiprocessors (SMP), with unique cache and registers, as well as shared memory region(s); more on cache in a moment ●Memory not necessarily next to processors → Non-uniform Memory Access (NUMA); CPU0 CPU1 try to ensure memory access is as local to registers registers CPU core(s) as possible ●The proc directory on UNIX machines is a cache cache special directory written and updated by the kernel, containing information on CPU (/proc/cpuinfo) and memory (/proc/meminfo) main memory ●The fundamental unit of work on the cores is a process... wjb19@psu.edu
  • 42. Processes ●Application processes are launched on the CPU by the kernel using the fork() system call; every process has a process ID pid, available on UNIX systems via the getpid() system call ●The kernel manages many processes concurrently; all information required to run a process is contained in the process control block (PCB) data structure, containing (among other things): ● The pid ● The address space ● I/O information eg., open files/streams ● Pointer to next PCB ●Processes may spawn children using the fork() system call; children are initially a copy of the parent, but may take on different attributes via the exec() call wjb19@psu.edu
  • 43. Processes ●A child process takes the id of the parent (ppid), and additionally has a unique pid eg., output from ps command, describing itself : [wjb19@tesla1 ~]$ ps  ­eHo "%P %p %c %t %C"   PPID   PID COMMAND             ELAPSED %CPU 12608  1719     sshd           01:07:54  0.0  1719  1724       sshd         01:07:49  0.0  1724  1725         bash       01:07:48  0.0  1725  1986           ps          00:00  0.0 ●During a context switch, kernel will swap one process control block for another; context switches are detrimental to HPC and have one or more triggers, including: ● I/O requests ● Timer interrupts ●Context switching is a very fine-grained form of scheduling; on compute clusters we also have coarse grained scheduling in the form of job scheduling software (more next time) ●The unique address space from the perspective of the process is referred to as virtual memory wjb19@psu.edu
  • 44. Virtual Memory ●A running process is given memory by the kernel, referred to as virtual memory (VM); address space does not correspond to physical memory address space ●The Memory Management Unit (MMU) on CPU translates between the two address spaces, for requests made between process and OS ●Virtual Memory for every process has the same structure, below left; virtual address space is divided into units called pages High Address ●The MMU is assisted in address Environment variables Function arguments translation by the Translation Lookaside Buffer (TLB), which stores Stack page details in a cache Unused ● Cache is high speed memory immediately adjacent to the CPU and it's registers, connected via bus(es) Heap Low Address Instructions wjb19@psu.edu
  • 45. Cache : Introduction In HPC, we talk about problems being compute or memory bound ● ● In the former case, we are limited by the rate at which instructions can be executed by the CPU ● In the latter, we are limited by the rate at which data can be processed by the CPU ●Both instructions and data are loaded into cache; cache memory is laid out in lines Cache memory is intermediate in the overall hierarchy, lying between ● CPU registers and main memory ● If the executing process requests an address corresponding to data or instructions in cache, we have a 'hit', else 'miss', and a much slower retrieval of instruction or data from main memory must take place wjb19@psu.edu
  • 46. Cache : Introduction ●Modern architectures have various levels of cache and divisions of responsibilities, we will follow valgrind-cachegrind convention, from the manual: ... It simulates a machine with independent first-level instruction and data caches (I1 and D1), backed by a unified second-level cache (L2). This exactly matches the configuration of many modern machines. However, some modern machines have three levels of cache. For these machines (in the cases where Cachegrind can auto-detect the cache configuration) Cachegrind simulates the first-level and third-level caches. The reason for this choice is that the L3 cache has the most influence on runtime, as it masks accesses to main memory. Furthermore, the L1 caches often have low associativity, so simulating them can detect cases where the code interacts badly with this cache (eg. traversing a matrix column-wise with the row length being a power of 2) wjb19@psu.edu
  • 47. Cache Example ●The distribution of data to cache levels is largely set by compiler, hardware and kernel, however the programmer is still responsible for the best data access patterns in his/her code possible ●Use cachegrind to optimize data alignment & cache usage eg., #include <stdlib.h> #include <stdio.h> int main(){         int SIZE_X,SIZE_Y;         SIZE_X=2048;         SIZE_Y=2048;         float * data = (float*) malloc(SIZE_X*SIZE_Y*sizeof(float));         for (int i=0; i<SIZE_X; i++)                 for (int j=0; j<SIZE_Y; j++)                         data[j+SIZE_Y*i] = 10.0f * 3.14f;                         //bad data access                                //data[i+SIZE_Y*j] = 10.0f * 3.14f;                      free(data);         return 0; } wjb19@psu.edu
  • 48. Cache : Bad Access bill@bill­HP­EliteBook­6930p:~$ valgrind ­­tool=cachegrind ./foo.x ==3088== Cachegrind, a cache and branch­prediction profiler ==3088== Copyright (C) 2002­2010, and GNU GPL'd, by Nicholas Nethercote et al. ==3088== Using Valgrind­3.6.1 and LibVEX; rerun with ­h for copyright info ==3088== Command: ./foo.x ==3088==  ==3088==  ==3088== I   refs:      50,503,275 ==3088== I1  misses:           734 ==3088== LLi misses:           733 instructions ==3088== I1  miss rate:       0.00% ==3088== LLi miss rate:       0.00% ==3088==  READ Ops WRITE Ops ==3088== D   refs:      33,617,678  (29,410,213 rd   + 4,207,465 wr) ==3088== D1  misses:     4,197,161  (     2,335 rd   + 4,194,826 wr) ==3088== LLd misses:     4,196,772  (     1,985 rd   + 4,194,787 wr) data ==3088== D1  miss rate:       12.4% (       0.0%     +      99.6%  ) ==3088== LLd miss rate:       12.4% (       0.0%     +      99.6%  ) ==3088==  ==3088== LL refs:        4,197,895  (     3,069 rd   + 4,194,826 wr) ==3088== LL misses:      4,197,505  (     2,718 rd   + 4,194,787 wr) ==3088== LL miss rate:         4.9% (       0.0%     +      99.6%  ) lowest level wjb19@psu.edu
  • 49. Cache : Good Access bill@bill­HP­EliteBook­6930p:~$ valgrind ­­tool=cachegrind ./foo.x ==4410== Cachegrind, a cache and branch­prediction profiler ==4410== Copyright (C) 2002­2010, and GNU GPL'd, by Nicholas Nethercote et al. ==4410== Using Valgrind­3.6.1 and LibVEX; rerun with ­h for copyright info ==4410== Command: ./foo.x ==4410==  ==4410==  ==4410== I   refs:      50,503,275 ==4410== I1  misses:           734 ==4410== LLi misses:           733 ==4410== I1  miss rate:       0.00% ==4410== LLi miss rate:       0.00% ==4410==  ==4410== D   refs:      33,617,678  (29,410,213 rd   + 4,207,465 wr) ==4410== D1  misses:       265,002  (     2,335 rd   +   262,667 wr) ==4410== LLd misses:       264,613  (     1,985 rd   +   262,628 wr) ==4410== D1  miss rate:        0.7% (       0.0%     +       6.2%  ) ==4410== LLd miss rate:        0.7% (       0.0%     +       6.2%  ) ==4410==  ==4410== LL refs:          265,736  (     3,069 rd   +   262,667 wr) ==4410== LL misses:        265,346  (     2,718 rd   +   262,628 wr) ==4410== LL miss rate:         0.3% (       0.0%     +       6.2%  ) wjb19@psu.edu
  • 50. Cache Performance ●For large data problems, any speedup introduced by parallelization can easily be negated by poor cache utilization ●In this case, memory bandwidth is an order of magnitude worse for problem size (2^14)^2 (cf earlier note on widely variable memory bandwidths; we have to work hard to approach peak) ● In many cases we are limited also by random access patterns 12 High % miss 10 8 time (s) 6 4 2 Low % miss 0 10 11 12 13 14 log2 SIZE_X wjb19@psu.edu
  • 51. Outline ●Introduction ● Motivation ● Computational operations ● Multiprocessors ● Processes ● Memory Digression ● Virtual Memory ● Cache ●Threads ● POSIX ● OpenMP ● Affinity wjb19@psu.edu
  • 52. POSIX Threads I ●A process may spawn one or more threads; on a multiprocessor, the OS can schedule these threads across a variety of cores, providing parallelism in the form of 'light-weight processes' (LWP) ●Whereas a child process receives a copy of the parent's virtual memory and executes independently thereafter, a thread shares the memory of the parent including instructions, and also has private data Using threads we perform shared memory processing (cf distributed ● memory, next time) ●We are at liberty to launch as many threads as we wish, although as you might expect, performance takes a hit as more threads are launched than can be scheduled simultaneously across available cores wjb19@psu.edu
  • 53. POSIX Threads II ●Pthreads refers to the POSIX standard, which is just a specification; implementations exist for various systems Each pthread has: ● ● An ID ● Attributes : ● Stack size ● Schedule information ●Much like processes, we can monitor thread execution using utilities such as top and ps ●The memory shared among threads must be used carefully in order to prevent race conditions, or threads seeing incorrect data during execution, due to more than one thread performing operations on said data, in an uncoordinated fashion wjb19@psu.edu
  • 54. POSIX Threads III ●Race conditions may be ameliorated through careful coding, but also through explicit constructs eg., locks, whereby a single thread gains and relinquishes control→ implies serialization and computational overhead ●Multi-Threaded programs must also avoid deadlock, a highly undesirous state where one or more threads await resources, and in turn are unable to offer up resources required by others ●Deadlocks can also be avoided through good coding, as well as the use of communication techniques based around semaphores, for example ●Threads awaiting resources may sleep (context switch by kernel, slow, saves cycles) or busy wait (executes while loop or similar checking semaphore, fast, wastes cycles) wjb19@psu.edu
  • 55. Pthreads Example #include <pthread.h> #include <stdio.h> #include <stdlib.h> int sum;  void *worker(void *param); global (shared) variable int main(int argc, char *argv[]){ main thread         pthread_t tid; thread id & attributes         pthread_attr_t attr;         if (argc!=2 || atoi(argv[1])<0){                 printf("usage : a.out <int value>, where int value > 0n");                 return ­1;         }           pthread_attr_init(&attr);         pthread_create(&tid,&attr,worker,argv[1]); worker thread         pthread_join(tid,NULL);         printf("sum = %dn",sum); creation & join } after completion void * worker(void *total){         int upper=atoi(total);         sum = 0; local (private) variable         for (int i=0; i<upper; i++)                 sum += i;         pthread_exit(0); } wjb19@psu.edu
  • 56. Valgrind-helgrind output [wjb19@hammer16 scratch]$ valgrind ­­tool=helgrind ­v ./foo.x 100  ==5185== Helgrind, a thread error detector ==5185== Copyright (C) 2007­2009, and GNU GPL'd, by OpenWorks LLP et al. ==5185== Using Valgrind­3.5.0 and LibVEX; rerun with ­h for copyright info ==5185== Command: ./foo.x 100 ==5185==  ­­5185­­ Valgrind options: system calls establishing thread ie., there ­­5185­­    ­­tool=helgrind is a COST to create and destroy threads ­­5185­­    ­v ­­5185­­ Contents of /proc/version: ­­5185­­   Linux version 2.6.18­274.7.1.el5 (mockbuild@x86­004.build.bos.redhat.com) (gcc version  ­­5185­­ REDIR: 0x3a97e7c240 (memcpy) redirected to 0x4a09e3c (memcpy) ­­5185­­ REDIR: 0x3a97e79420 (index) redirected to 0x4a09bc9 (index) ­­5185­­ REDIR: 0x3a98a069a0 (pthread_create@@GLIBC_2.2.5) redirected to 0x4a0b2a5  (pthread_create@*) ­­5185­­ REDIR: 0x3a97e749e0 (calloc) redirected to 0x4a05942 (calloc) ­­5185­­ REDIR: 0x3a98a08ca0 (pthread_mutex_lock) redirected to 0x4a076c2 (pthread_mutex_lock) ­­5185­­ REDIR: 0x3a97e74dc0 (malloc) redirected to 0x4a0664a (malloc) ­­5185­­ REDIR: 0x3a98a0a020 (pthread_mutex_unlock) redirected to 0x4a07b66 (pthread_mutex_unlock) ­­5185­­ REDIR: 0x3a97e79b50 (strlen) redirected to 0x4a09cbb (strlen) ­­5185­­ REDIR: 0x3a98a07a10 (pthread_join) redirected to 0x4a07431 (pthread_join) sum = 4950 ==5185==  ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3) ­­5185­­  ­­5185­­ used_suppression:      1 helgrind­glibc2X­101 ­­5185­­ used_suppression:      1 helgrind­glibc2X­112 ­­5185­­ used_suppression:      1 helgrind­glibc2X­102 ==5185==  ==5185== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 3 from 3) wjb19@psu.edu
  • 57. Pthreads: Race Condition #include <pthread.h> #include <stdio.h> #include <stdlib.h> int sum; void *worker(void *param); int main(int argc, char *argv[]){         pthread_t tid;         pthread_attr_t attr;         if (argc!=2 || atoi(argv[1])<0){                 printf("usage : a.out <int value>, where int value > 0n");                 return ­1;         }         pthread_attr_init(&attr);         pthread_create(&tid,&attr,worker,argv[1]);         int upper=atoi(argv[1]); main thread works on         sum=0;         for (int i=0; i<upper; i++) global variable as well,                 sum+=i; without synchronization/         pthread_join(tid,NULL); coordination         printf("sum = %dn",sum); } wjb19@psu.edu
  • 58. Helgrind output w/ race [wjb19@hammer16 scratch]$ valgrind ­­tool=helgrind ./foo.x 100  ==5384== Helgrind, a thread error detector ==5384== Copyright (C) 2007­2009, and GNU GPL'd, by OpenWorks LLP et al. ==5384== Using Valgrind­3.5.0 and LibVEX; rerun with ­h for copyright info ==5384== Command: ./foo.x 100 ==5384==  ==5384== Thread #1 is the program's root thread built foo.x with debug on (-g) to ==5384==  find source file line(s) w/ ==5384== Thread #2 was created ==5384==    at 0x3A97ED447E: clone (in /lib64/libc­2.5.so) error(s) ==5384==    by 0x3A98A06D87: pthread_create@@GLIBC_2.2.5 (in /lib64/libpthread­2.5.so) ==5384==    by 0x4A0B206: pthread_create_WRK (hg_intercepts.c:229) ==5384==    by 0x4A0B2AD: pthread_create@* (hg_intercepts.c:256) ==5384==    by 0x400748: main (fooThread2.c:18) ==5384==  ==5384== Possible data race during write of size 4 at 0x600cdc by thread #1 ==5384==    at 0x400764: main (fooThread2.c:20) ==5384==  This conflicts with a previous write of size 4 by thread #2 ==5384==    at 0x4007E3: worker (fooThread2.c:31) ==5384==    by 0x4A0B330: mythread_wrapper (hg_intercepts.c:201) ==5384==    by 0x3A98A0673C: start_thread (in /lib64/libpthread­2.5.so) ==5384==    by 0x3A97ED44BC: clone (in /lib64/libc­2.5.so) ==5384== ●Pthreads is a versatile albeit large and inherently complicated interface ●We are primarily concerned with 'simply' dividing a workload among available cores; OpenMP proves much less unwieldy to use wjb19@psu.edu
  • 59. OpenMP Introduction ●OpenMP is a set of multi-platform/OS compiler directives, libraries and environment variables for readily creating multi-threaded applications ●The OpenMP standard is managed by a review board, and is defined by a large number of hardware vendors ●Applications written using OpenMP employ pragmas, or statements interpreted by the preprocessor (before compilation), representing functionality like fork & join that would take considerably more effort and care to implement otherwise ●OpenMP pragmas or directives indicate parallel sections of code ie., after compilation, at runtime, threads are each given a portion of work eg., in this case, loop iterations will be divided evenly among running threads : #pragma omp parallel for for (int i=0; i<SIZE; i++) y[i]=x[i]*10.0f; wjb19@psu.edu
  • 60. OpenMP Clauses I ●The number of threads launched during parallel blocks may be set via function calls or by setting the OMP_NUM_THREADS environment variable ●Data objects are generally by default shared (loop counters are private by default), a number of pragma clauses are available, which are valid for the scope of the parallel section eg., : ● private ● shared ● firstprivate -initialized to value before parallel block ● lastprivate -variable keeps value after parallel block ● reduction -thread safe way of combining data at conclusion of parallel block ●Thread synchronization is implicit to parallel sections; there are a variety of clauses available for controlling this behavior also, including : ● critical-one thread at a time works in this section eg., in order to avoid race (expensive, design your code to avoid at all costs) ● atomic- safe memory updates performed using eg., mutual exclusion (cost) ● barrier-threads wait at this point for others to arrrive wjb19@psu.edu
  • 61. OpenMP Clauses II OpenMP has default thread scheduling behavior handled via the runtime library, ● which may be modified through use of the schedule(type,chunk) clause, with types : ● static ­ loop iterations are divided among threads equally by default; specifying an integer for the parameter chunk will allocate a number of contiguous iterations to a thread ● dynamic ­ total iterations form a pool, from which threads work on small contiguous subsets until all are complete, with subset size given again by chunk ● guided ­ a large section of contiguous iterations are allocated to each thread dynamically. The section size decreases exponentially with each successive allocation to a minimum size specified by chunk wjb19@psu.edu
  • 62. OpenMP Example : KTM ●In our first attempt at parallelization shortly, we simply add an OpenMP pragma before the computational loops in worker function: #pragma omp parallel for //loop over trace records for (int k=0; k<config­>traceNo; k++){ //loop over imageX for(int i=0; i<Li; i++){ tempC = ( midX[k] ­ imageXX[i]­offX[k]) * (midX[k]­ imageXX[i]­offX[k]);           tempD = ( midX[k] ­ imageXX[i]+offX[k]) * (midX[k]­ imageXX[i]+offX[k]);           //loop over imageY           for(int j=0; j<Lj; j++){            tempA = tempC + ( midY[k] ­ imageYY[j]­offY[k]) * (midY[k]­ imageYY[j]­offY[k]);                tempB = tempD + ( midY[k] ­ imageYY[j]+offY[k]) * (midY[k]­ imageYY[j]+offY[k]); //loop over imageZ                                              for (int l=0; l<Ll; l++){                 temp = sqrtf(tauS[l] + tempA * slownessS[l]);                     temp += sqrtf(tauS[l] + tempB * slownessS[l]);                     timeIndex = (int) (temp / sRate);                     if ((timeIndex < config­>tracePts) && (timeIndex > 0)){                     image[i*Lj*Ll + j*Ll + l] += traces[timeIndex + k * config­>tracePts] * temp *sqrtf(tauS[l] / temp);                    }                } //imageZ           } //imageY      } //imageX }//input trace records wjb19@psu.edu
  • 63. OpenMP KTM Results ●Scales well up to eight cores, then drops off; SMP model has deficiencies due to a number of factors, including : ● Coverage (Amdahl's law); as we increase processors, relative cost of serial code portion increases ● Hardware limitations ● Locality... 5 4.5 4 Execution time 3.5 3 2.5 2 1.5 1 0.5 0 1 2 4 8 16 CPU cores wjb19@psu.edu
  • 64. CPU Affinity (Intel*) ●Recall that the OS schedules processes and threads using context switches; can be detrimental → threads may resume on different core, destroying locality ●We can change this by restricting threads to execute on a subset of processors, by setting processor affinity ●Simplest approach is to set environment variable KMP_AFFINITY to: ● determine the machine topology, ● assign threads to processors ●Usage: KMP_AFFINITY=[<modifier>]<type>[<permute>][<offset>]  *For GNU, ~ equivalent env var == GOMP_CPU_AFFINITY wjb19@psu.edu
  • 65. CPU Affinity Settings ●The modifier may take settings corresponding to granularity (with specifiers: fine, thread, and core), as well as a processor list (proclist={<proc­ list>}), verbose, warnings and others ● The type settings refer to the nature of the affinity, and may take values : ● compact-try to assign thread n+1 context as close as possible to n ● disabled ● explicit-force assign of threads to processors in proclist ● none-just return the topology w/ verbose modifier ● scatter-distribute as evenly as possible ●fine & thread refer to the same thing, namely that threads only resume in the same context; the core modifier implies that they may resume within a different context, but the same physical core ●CPU Affinity can effect application performance significantly and is worth tuning, based on your application and the machine topology... wjb19@psu.edu
  • 66. CPU Topology Map ●For any given computational node, we have several different physical devices (packages in sockets), comprised of cores (eg., two here), which run one or two thread contexts ●Without hyperthreading, there is only a single context per core ie., modifiers thread/fine, core are indistinguishable Node packageA packageB core0 core1 core0 core1 0 1 0 1 0 1 0 1 Thread context wjb19@psu.edu
  • 67. CPU Affinity Examples ●Display machine topology map eg,. Hammer : [wjb19@hammer16 scratch] $ export KMP_AFFINITY=verbose,none [wjb19@hammer16 scratch] $ ./psktm.x OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #156: KMP_AFFINITY: 12 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores) OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {0,1,2,3,4,5,6,7,8,9,10,11} wjb19@psu.edu
  • 68. CPU Affinity Examples ●Set affinity with compact setting, fine granularity : [wjb19@hammer5 scratch]$ export KMP_AFFINITY=verbose,granularity=fine,compact [wjb19@hammer5 scratch]$ ./psktm.x  OMP: Info #204: KMP_AFFINITY: decoding cpuid leaf 11 APIC ids. OMP: Info #202: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {0,1,2,3,4,5,6,7,8,9,10,11} OMP: Info #156: KMP_AFFINITY: 12 available OS procs OMP: Info #157: KMP_AFFINITY: Uniform topology OMP: Info #179: KMP_AFFINITY: 2 packages x 6 cores/pkg x 1 threads/core (12 total cores) OMP: Info #206: KMP_AFFINITY: OS proc to physical thread map: OMP: Info #171: KMP_AFFINITY: OS proc 0 maps to package 0 core 0  OMP: Info #171: KMP_AFFINITY: OS proc 8 maps to package 0 core 1  OMP: Info #171: KMP_AFFINITY: OS proc 4 maps to package 0 core 2  OMP: Info #171: KMP_AFFINITY: OS proc 2 maps to package 0 core 8  OMP: Info #171: KMP_AFFINITY: OS proc 10 maps to package 0 core 9  OMP: Info #171: KMP_AFFINITY: OS proc 6 maps to package 0 core 10  OMP: Info #171: KMP_AFFINITY: OS proc 1 maps to package 1 core 0  OMP: Info #171: KMP_AFFINITY: OS proc 9 maps to package 1 core 1  OMP: Info #171: KMP_AFFINITY: OS proc 5 maps to package 1 core 2  OMP: Info #171: KMP_AFFINITY: OS proc 3 maps to package 1 core 8  OMP: Info #171: KMP_AFFINITY: OS proc 11 maps to package 1 core 9  OMP: Info #171: KMP_AFFINITY: OS proc 7 maps to package 1 core 10  OMP: Info #147: KMP_AFFINITY: Internal thread 0 bound to OS proc set {2} OMP: Info #147: KMP_AFFINITY: Internal thread 1 bound to OS proc set {10} OMP: Info #147: KMP_AFFINITY: Internal thread 2 bound to OS proc set {6} OMP: Info #147: KMP_AFFINITY: Internal thread 3 bound to OS proc set {1} OMP: Info #147: KMP_AFFINITY: Internal thread 4 bound to OS proc set {9} OMP: Info #147: KMP_AFFINITY: Internal thread 5 bound to OS proc set {5} OMP: Info #147: KMP_AFFINITY: Internal thread 6 bound to OS proc set {3} OMP: Info #147: KMP_AFFINITY: Internal thread 7 bound to OS proc set {11} wjb19@psu.edu
  • 69. Conclusions ●Scientific research is supported by computational scaling and performance, both provided by parallelism, limited to some extent by Amdahl's law ●Parallelism has various levels of granularity; at the finest level is the instruction pipeline and vectorized registers eg., SSE ●The next level up in parallel granularity is the multiprocessor; we may run many concurrent threads using the pthreads API or the OpenMP standard for instance ●Threads must be coded and handled with care, to avoid race and deadlock conditions ●Performance is a strong function of cache utilization; benefits introduced through parallelization can easily be negated by sloppy use of memory bandwidth ●Scaling across cores is limited by hardware, Amdahl's law but also locality; we have some control over the latter using  KMP_AFFINITY for instance wjb19@psu.edu
  • 70. References ●Valgrind (buy the manual, worth every penny) ● http://valgrind.org/ ●OpenMP ● http://openmp.org/wp/ ●GNU OpenMP ● http://gcc.gnu.org/projects/gomp/ ●Summary of OpenMP 3.0 C/C++ Syntax ● http://openmp.org/mp-documents/OpenMP3.1-CCard.pdf ●Summary of OpenMP 3.0 Fortran Syntax ● http://www.openmp.org/mp-documents/OpenMP3.0-FortranCard.pdf ●Nice SSE tutorial ● http://neilkemp.us/src/sse_tutorial/sse_tutorial.html ●Intel Nehalem ● http://en.wikipedia.org/wiki/Nehalem_%28microarchitecture%29 ●GNU Make ● http://www.gnu.org/s/make/ ●Intel hyperthreading ● http://en.wikipedia.org/wiki/Hyper-threading wjb19@psu.edu
  • 71. Exercises ●Take the supplied code and parallelize using OpenMP pragma around the worker function ●Create a makefile which builds the code, compare timings btwn serial & parallel by varying OMP_NUM_THREADS ●Examine effect of various settings for KMP_AFFINITY wjb19@psu.edu
  • 72. Build w/ Confidence : make #Makefile for basic Kirchhoff Time Migration example #set compiler CC=icc ­openmp #set build options CFLAGS=­std=c99 ­c #main executable all: psktm.x #objects and dependencies psktm.x: psktmCPU.o demoA.o         $(CC) psktmCPU.o demoA.o ­o psktm.x psktmCPU.o: psktmCPU.c         $(CC) $(CFLAGS) psktmCPU.c demoA.o: demoA.c         $(CC) $(CFLAGS) demoA.c clean:         rm ­rf *o psktm.x wjb19@psu.edu indent with tab only!
  • 73. HPC Essentials Part III : Message Passing Interface Bill Brouwer Research Computing and Cyberinfrastructure (RCC), PSU wjb19@psu.edu
  • 74. Outline ●Motivation ●Interprocess Communication ● Signals ● Sockets & Networks ●procfs Digression ●Message Passing Interface ● Send/Receive ● Communication ● Parallel Constructs ● Grouping Data ● Communicators & Topologies wjb19@psu.edu
  • 75. Motivation ●We saw last time that Amdahl's law implies an asymptotic limit to performance gains from parallelism, where parallel P and serial code (1- P) portions have fixed relative cost ●We looked at threads (“light-weight processes”) and also saw that performance depends on a variety of things, including good cache utilization and affinity ●For the problem size investigated, ultimately the limiting factor was disk I/O, there was no sense going beyond a single compute node; in a machine with 16 cores or more, there is no point when P < 60%, should the process have sufficient memory ●However, as we increase our problem size, the relative parallel/serial cost changes and P can approach 1 wjb19@psu.edu
  • 76. Motivation ●In the limit as processors N → we find the maximum performance improvement : 1/(1-P) ●It is helpful to see the 3dB points for this limit ie., the number of processors N 1/2 required to achieve (1/√2)*max = 1/(√2*(1-P)); equating with Amdahl's law & after some algebra : N1/2 = 1/((1-P)*(√2-1)) 300 250 200 N1/2 150 100 50 0 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 Parallel code fraction P wjb19@psu.edu
  • 77. Motivation Points to note from the graph : ● ● P ~ 0.90, we can benefit from ~ 20 cores ● P ~ 0.99, we can benefit from a cluster size of ~ 256 cores ● P → 1, we approach the “embarrassingly parallel” limit ● P ~ 1, performance improvement directly proportional to cores ● P ~ 1 implies independent or batch processes ●Quite aside from considerations of Amdahl's law, as the problem size grows, we may simply exceed the memory available on a single node ●In this case, must move to a distributed memory processing model/multiple nodes (unless P ~ 1 of course) How do we determine P? → PROFILING ● wjb19@psu.edu
  • 78. Profiling w/ Valgrind [wjb19@lionxf scratch]$ valgrind ­­tool=callgrind ./psktm.x [wjb19@lionxf scratch]$ callgrind_annotate ­­inclusive=yes callgrind.out.3853  ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ Profile data file 'callgrind.out.3853' (creator: callgrind­3.5.0) ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ I1 cache:  D1 cache:  L2 cache:  Parallelizable worker Timerange: Basic block 0 ­ 2628034011 function is 99.5% of Trigger: Program termination Profiled target:  ./psktm.x (PID 3853, part 1) total instructions executed ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 20,043,133,545  PROGRAM TOTALS ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­             Ir  file:function ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 20,043,133,545  ???:0x0000003128400a70 [/lib64/ld­2.5.so] 20,042,523,959  ???:0x0000000000401330 [/gpfs/scratch/wjb19/psktm.x] 20,042,522,144  ???:(below main) [/lib64/libc­2.5.so] 20,042,473,687  /gpfs/scratch/wjb19/demoA.c:main 20,042,473,687  demoA.c:main [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644  psktmCPU.c:ktmMigrationCPU [/gpfs/scratch/wjb19/psktm.x] 19,934,044,644  /gpfs/scratch/wjb19/psktmCPU.c:ktmMigrationCPU  6,359,083,826  ???:sqrtf [/gpfs/scratch/wjb19/psktm.x]  4,402,442,574  ???:sqrtf.L [/gpfs/scratch/wjb19/psktm.x]    104,966,265  demoA.c:fileSizeFourBytes [/gpfs/scratch/wjb19/psktm.x] If we wish to scale outside a single node, we must use some form of interprocess communication wjb19@psu.edu
  • 79. Inter-Process Communication ● There are a variety of ways for processes to exchange information, including: ● Memory (~last week) ● Files ● Pipes (named/anonymous) ● Signals ● Sockets ● Message Passing ● File I/O is too slow, and read/writes liable to race conditions ● Anonymous & named pipes are highly efficient but FIFO (first in, first out) buffers, allowing only unidirectional communication, and between processes on the same node ●Signals are a very limited form of communication, sent to the process after an interrupt by the kernel, and handled using a default handler or one specified using signal() system call ●Signals may come from a variety of sources eg., segmentation fault (SIGSEGV), keyboard interrupt Ctrl-C (SIGINT) etc wjb19@psu.edu
  • 80. Signals ●strace is a powerful utility in UNIX which shows the interaction between a running process and kernel in the form of system calls and signals; here, a partial output showing mapping of signals to defaults with system call sigaction(), from ./psktm.x : UNIX signals rt_sigaction(SIGHUP, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGINT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGQUIT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGILL, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGABRT, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGFPE, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGBUS, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGSEGV, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGSYS, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGTERM, NULL, {SIG_DFL, [], 0}, 8) = 0 rt_sigaction(SIGPIPE, NULL, {SIG_DFL, [], 0}, 8) = 0 ●Signals are crude and restricted to local communication; to communicate remotely, we can establish a socket between processes, and communicate over the network wjb19@psu.edu
  • 81. Sockets & Networks ●Davies/Baran first devised packet switching, an efficient means of communication over a channel; a computer was conceived to realize their design and ARPANET went online Oct 1969 between UCLA and Stanford ●TCP/IP became the communication protocol of ARPANET 1 Jan 1983, which was retired in 1990 and NFSNET established; university networks in the US and Europe join ●TCP/IP is just one of many protocols, which describes the format of data packets, and the nature of the communication; an analogous connection method is used by Infiniband networks in conjunction with Remote Direct Memory Access (RDMA) ●Unreliable Datagram Protocol (UDP) is analogous to a connectionless method of communication used by Infiniband high performance networks wjb19@psu.edu
  • 82. Sockets : UDP host example #include <stdio.h> #include <errno.h> #include <string.h> #include <sys/socket.h> #include <sys/types.h> #include <netinet/in.h> #include <unistd.h> /* for close() for socket */  #include <stdlib.h>   int main(void) {   //creates an endpoint & returns file descriptor   //uses IPv4 domain, datagram type, UDP transport   int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);      //socket address object (sa) and memory buffer   struct sockaddr_in sa;    char buffer[1024];   ssize_t recsize;   socklen_t fromlen;     //specify same domain type, any input address and port 7654 to listen on   memset(&sa, 0, sizeof sa);   sa.sin_family = AF_INET;   sa.sin_addr.s_addr = INADDR_ANY;   sa.sin_port = htons(7654);   fromlen = sizeof(sa);         wjb19@psu.edu
  • 83. Sockets : host example cont.   //we bind an address (sa) to the socket using fd sock   if (­1 == bind(sock,(struct sockaddr *)&sa, sizeof(sa)))   {     perror("error bind failed");     close(sock);     exit(EXIT_FAILURE);   }      for (;;)    {     //listen and dump buffer to stdout where applicable     printf ("recv test....n");     recsize = recvfrom(sock, (void *)buffer, 1024, 0, (struct sockaddr *)&sa, &fromlen);     if (recsize < 0) {       fprintf(stderr, "%sn", strerror(errno));       exit(EXIT_FAILURE);     }     printf("recsize: %zn ", recsize);     sleep(1);     printf("datagram: %.*sn", (int)recsize, buffer);   } }       wjb19@psu.edu
  • 84. Sockets : client example int main(int argc, char *argv[]) {   //create a buffer with character data   int sock;   struct sockaddr_in sa;   int bytes_sent;   char buffer[200];     strcpy(buffer, "hello world!");     //create a socket, same IP and transport as before, address of host 127.0.0.1   sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP);   if (­1 == sock) /* if socket failed to initialize, exit */     {       printf("Error Creating Socket");       exit(EXIT_FAILURE);     }     memset(&sa, 0, sizeof sa);   sa.sin_family = AF_INET;   sa.sin_addr.s_addr = inet_addr("127.0.0.1");   sa.sin_port = htons(7654);     bytes_sent = sendto(sock, buffer, strlen(buffer), 0,(struct sockaddr*)&sa, sizeof sa);   if (bytes_sent < 0) {     printf("Error sending packet: %sn", strerror(errno));     exit(EXIT_FAILURE);   }     close(sock); /* close the socket */   return 0; } ●You can monitor sockets by using the netstat facility, which takes it's data from /proc/net wjb19@psu.edu
  • 85. Outline ●Motivation ●Interprocess Communication ● Signals ● Sockets & Networks ●procfs Digression ●Message Passing ● Send/Receive ● Communication ● Parallel Constructs ● Grouping Data ● Communicators & Topologies wjb19@psu.edu
  • 86. procfs ●We mentioned the /proc directory previously in the context of cpu and memory information, which is frequently referred to as the proc filesystem or procfs ●It is a veritable treasure trove of information, written periodically by the kernel, and is used by a variety of tools eg., ps ● Each running process is assigned a directory, whose name is the process id ●Each directory contains text files and subdirectories with every detail of a running process, including context switching statistics, memory management, open file descriptors and much more ●Much like the ptrace() system call, procfs also gives user applications the ability to directly manipulate running processes, given sufficient permission; you can explore that on your own :) wjb19@psu.edu
  • 87. procfs : examples ● Some of the more useful files : ● /proc/PID/cmdline : command used to launch process ● /proc/PID/cwd : current working directory ● /proc/PID/environ : environment variables for the process ● /proc/PID/fd : directory w/ symbolic link for each open file descriptor eg., streams ● /proc/PID/status : information including signals, state, memory usage ● /proc/PID/maps : memory map between virtual and physical addresses ● ● eg., contents of the fd firectory for running process ./psktm.x : [wjb19@hammer1 fd]$ ls ­lah total 0 dr­x­­­­­­ 2 wjb19 wjb19  0 Dec  7 12:13 . dr­xr­xr­x 6 wjb19 wjb19  0 Dec  7 12:10 .. lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 0 ­> /dev/pts/28 lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 1 ­> /dev/pts/28 lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 2 ­> /dev/pts/28 lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 3 ­> /gpfs/scratch/wjb19/inputDataSmall.bin lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 4 ­> /gpfs/scratch/wjb19/inputSrcXSmall.bin lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 5 ­> /gpfs/scratch/wjb19/inputSrcYSmall.bin lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 6 ­> /gpfs/scratch/wjb19/inputRecXSmall.bin lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 7 ­> /gpfs/scratch/wjb19/inputRecYSmall.bin lrwx­­­­­­ 1 wjb19 wjb19 64 Dec  7 12:13 8 ­> /gpfs/scratch/wjb19/velModel.bin  wjb19@psu.edu
  • 88. procfs : status file extract [wjb19@hammer1 30769]$ more status Name: psktm.x State: R (running) SleepAVG: 0% Tgid: 30769 Pid: 30769 PPid: 30687 TracerPid: 0 Uid: 2511 2511 2511 2511 Gid: 2530 2530 2530 2530 FDSize: 256 Groups: 2472 2530 3835 4933 5505 5732  VmPeak:    65520 kB VmSize:    65520 kB VmLck:        0 kB VmHWM:    37016 kB VmRSS:    37016 kB VmData:    51072 kB VmStk:       88 kB Virtual memory usage VmExe:       64 kB VmLib:     2944 kB VmPTE:      164 kB StaBrk: 1289a000 kB Brk: 128bb000 kB StaStk: 7fffbd0a0300 kB Threads: 5 SigQ: 0/398335 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 signals SigIgn: 0000000000000000 SigCgt: 0000000180000000 wjb19@psu.edu
  • 89. Outline ●Motivation ●Interprocess Communication ● Signals ● Sockets & Networks ●procfs Digression ●Message Passing Interface ● Send/Receive ● Communication ● Parallel Constructs ● Grouping Data ● Communicators & Topologies wjb19@psu.edu
  • 90. Message Passing Interface (MPI) ●Classical von Neumann machine has single instruction/data stream (SISD) → single process & memory ●Multiple Instruction, multiple data (MIMD) system → connected processes are asynchronous, generally distributed memory (may also be shared where processes on single node) MIMD Processors are connected in some network topology; we don't have to ● worry about the details, MPI abstracts this away ●MPI is a standard for parallel programming first established in 1991, updated occasionally, by academics and industry ●It comprises routines for point-to-point and collective communication, with bindings to C/C++ and fortran ● Depending on underlying network fabric, communication maybe TCP or UDP- like in Infiniband networks wjb19@psu.edu
  • 91. MPI : Basic communication ●Multiple, distributed processes are spawned at initialization, each process assigned a unique rank 0,1,...,p-1 ● One may send information referencing process rank eg.,: MPI_Send(&x, 1, MPI_FLOAT, 1, 0, MPI_COMM_WORLD); Buffer address Rank of rcv ● This function has a receive analogue; both routines are blocking by default ●Send/receive statements generally occur in same code, processors execute appropriate statement according to rank & code branch Non-blocking functions available, allows communicating processes to continue ● with execution where able wjb19@psu.edu
  • 92. MPI : Requisite functions ●Bare minimum → initialize, get rank for process, total processes and finalize when done MPI_Init(&argc, &argv); //Start up MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); //My rank MPI_Comm_size(MPI_COMM_WORLD, &p); //No. processors MPI_Finalize(); //close up shop ●MPI_COMM_WORLD is a communicator parameter, a collection of processes that can send messages to each other. ●Messages are sent with tags to identify them, allowing specificity beyond using just a source/destination parameter wjb19@psu.edu
  • 93. MPI : Datatypes MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float  MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED wjb19@psu.edu
  • 94. Minimal MPI example #include "mpi.h" #include <stdio.h> int main(int argc, char *argv[]) {         int rank, size, i;         int buffer[10];         MPI_Status status;         MPI_Init(&argc, &argv);         MPI_Comm_size(MPI_COMM_WORLD, &size);         MPI_Comm_rank(MPI_COMM_WORLD, &rank);         if (rank > 0)         {         for (int i =0; i<10; i++)                         buffer[i]=i * rank;           MPI_Send(buffer, 10, MPI_INT, 0, 0, MPI_COMM_WORLD);         } else {            for (int i=1; i<size; i++){             MPI_Recv(buffer, 10, MPI_INT, i, 0, MPI_COMM_WORLD, &status);            printf("buffer element 0 : %i from proc : %i n",buffer[0],i);       }      }   MPI_Finalize();    return 0; } wjb19@psu.edu
  • 95. MPI : Collective Communication ● A communication pattern involving all processes in a communicator is a collective communication eg., a broadcast ● Same data sent to every process in communicator, more efficient than using multiple p2p routines, optimized : MPI_Bcast(void* message, int count, MPI_Datatype type,  int root, MPI_Comm comm) ● Sends copy of data in message from root process to all in comm, a scatter/map operation ● Collective communication is at the heart of efficient parallel operations wjb19@psu.edu
  • 96. Parallel Operations : Reduction ● Data maybe gathered/reduced after computation via : MPI_Reduce(void* operand, void* result, int count,  MPI_Datatype type, MPI_Op operator, int root, MPI_Comm  comm) ● Combines all operand, using operator and stores result on process root, in result ● A tree-structured reduce at all nodes == MPI_Allreduce,ie., every process in comm gets a copy of the result 1 2 3 p-1 0 root wjb19@psu.edu
  • 97. Reduction Ops MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND Logical and MPI_BAND Bitwise and MPI_LOR Logical or MPI_BOR Bitwise or MPI_LXOR Logical XOR MPI_BXOR Bitwise XOR MPI_MAXLOC Max w/ location MPI_MINLOC Min w/ location MPI_PACKED wjb19@psu.edu
  • 98. Parallel Operations : Scatter/Gather ● Bulk transfers of many-to-one and one-to-many are accomplished by gather and scatter operations respectively ● These operations form the kernel of matrix/vector operations for example; they are useful for distributing and reassembling arrays Process 0 x0 a00 a01 a02 a03 Process 1 x1 x2 Process 2 x3 Process 3 Gather Scatter wjb19@psu.edu
  • 99. Scatter/Gather Syntax ● MPI_Gather(void* send_data, int send_count, MPI_Datatype  send_type, void* recv_data, int recv_count, MPI_Datatype  recv_type, int root, MPI_Comm comm) ● Collects data referenced by send_data from each process in comm and stores data in process rank order on process w/ rank root, in memory referenced by recv_data ● MPI_Scatter(void* send_data, int send_count,  MPI_Datatype send_type, void* recv_data, int recv_count,  MPI_Datatype recv_type, int root, MPI_Comm comm) ● Splits data referenced by send_data on process w/ rank root into segments, send_count elements each, w/ send_type & distributed in order to processes ● For gather result to ALL processes → MPI_Allgather wjb19@psu.edu
  • 100. Grouping Data I ● Communication is expensive → bundle variables into single message ● We must define a derived type than can describe the heterogeneous contents of a message using type and displacement pairs ● Several ways to build this MPI_Datatype eg., MPI_Type_Struct(int count, int block_lengths[], //contains no. entries in each block MPI_Aint displacements[], //element offset from msg start MPI_Datatype typelist[], //exactly that MPI_Datatype* new_mpi_t //a pointer to this new type) Allows for addresses > int ● A very general derived type, although arrays to struct must be constructed explicitly using other MPI commands ● Simpler when less heterogeneous eg., MPI_Type_vector, MPI_Type_Contiguous, MPI_Type_indexed wjb19@psu.edu
  • 101. Grouping Data II ● Before these derived types can be used by a communication function, must be committed with MPI_type_commit function call ● In order for message to be received, type signatures at send and receive must be compatible; if a collective communication, signatures must be identical ● MPI_Pack & MPI_Unpack are useful for when messages of heterogeneous data are infrequent, and cost of constructing derived type outweighs benefit ● These methods also allow buffering in user versus system memory, and the number of items transmitted is in the message itself ● Group data allows for sophisticated objects; we can also create more fined grained communication objects wjb19@psu.edu
  • 102. Communicators ● Process subsets or groups expand communication beyond simple p2p and broadcast communication, to create : ● Intra-communicators → communicate among one other and participate in collective communication, composed of : – an ordered collection of processes (group) – a context ● Inter-communicators → communicate between different groups ● Communicators/groups are opaque, internals not directly accessible; these objects are referenced by a handle wjb19@psu.edu
  • 103. Communicators Cont. ● Internal contents manipulated by methods, much like private data in C++ class objects eg., ● int MPI_Group_incl(MPI_Group old_group,int  new_group_size, int ranks_in_old_group[], MPI_Group*  new_group) → create a new_group from old_group, using ranks_in_old_group[] etc ● int MPI_Comm_create(MPI_Comm old_comm, MPI_Group  new_group, MPI_Comm* new_comm) → create a new communicator from the old, with context ● MPI_Comm_group and MPI_Group_incl are local methods without communication, MPI_Comm_create is a collective communication implying synchronization ie,. to establish single context ● Multiple communicators may be created simultaneously using MPI_Comm_split wjb19@psu.edu
  • 104. Topologies I ● MPI allows one to associate different addressing schemes to processes within a group ● This is a virtual versus real or physical topology, and is either a graph structure or a (Cartesian) grid; properties: ● Dimensions, w/ – Size of each – Period of each ● Option to have processes reordered optimally within grid ● Method to establish Cartesian grid cart_comm : int MPI_Cart_create(MPI_Comm old_comm, int  number_of_dims, int dim_sizes[], int wrap_around[],  int reorder, MPI_Comm* cart_comm) ● old_comm is typically just MPI_COMM_WORLD created at init wjb19@psu.edu
  • 105. Topologies II ● cart_comm will contain the processes from old_comm with associated coordinates, available from MPI_Cart_coords: int coordinates[2]; int my_grid_rank; MPI_Comm_rank(cart_comm, &my_grid_rank); MPI_Cart_Coords(cart_comm,  my_grid_rank,2,coordinates); ● Call to MPI_Comm_rank is necessary because of process rank reordering (optimization) ● Processes in cart_comm are stored in row major order ● Can also partition in to sub-grid(s) using MPI_Cart_sub eg., for row: int free_coords[2]; MPI_Comm row_comm; //new sub­grid free_coords[0]=0; //bool; first coordinate fixed free_coords[1]=1; //bool; second coordinate free MPI_Cart_sub(cart_comm,free_coords,&row_comm); wjb19@psu.edu
  • 106. Writing Parallel Code ● Assuming we've profiled our code and decided to parallelize, equipped with MPI routines, we must decide whether to take a : ● Domain parallel (divide tasks, similar data) or ● Data parallel (divide data, similar tasks) approach ● Data parallel in general scales much better, implies lower communication overhead ● Regardless, easiest to begin by selecting or designing data structures, and subsequently their distribution using a constructed topology or scatter/gather routines, for example ● Program in modules, beginning with easiest/essential functions (eg., I/O), relegating 'hard' functionality to stubs initially ● Time code sections, look at targets for optimization & redesign ● Only concern yourself with the highest levels of abstraction germane to your problem, use parallel constructs wherever possible wjb19@psu.edu
  • 107. A Note on the OSI Model ●We've been playing fast and loose with a variety of communication entities; sockets, networks, protocols like UDP, TCP etc ●The Open Systems Interconnection model separates these entities into 7 layers of abstraction, each layer providing services to the layer immediately above ●Data becomes increasingly fine grained going down from layer 7 to 1 ●As application developers and/or scientists, we need only be concerned with layers 4 and above Layer Granularity Function Example 7.Application data process accessing network MPI 6.Presentation data encryt/decrypt, data conversion MPI 5.Session data management MPI 4.Transport segments reliability & flow control IB verbs 3.Network packets path Infiniband 2.Data Link frames addressing Infiniband 1.Physical bits signals/electrical Infiniband wjb19@psu.edu