SlideShare une entreprise Scribd logo
1  sur  71
Introduction toOpenMP Scalable Computing Support Center Duke University http://wiki.duke.edu/display/SCSC scsc at duke.edu John Pormann, Ph.D. jbp1 at duke edu
What is OpenMP? OpenMP is an industry-standard combination of compiler directives and library routines, for shared-memory computers, that allow programmers to specify parallelism in their code without getting too caught up in the details of parallel programming OpenMP is a response to increasing hardware-parallelism in most (all) HPC platforms Sort of a “bolt-on” to the language, not a new language unto itself In Fortran, it hides many parallel constructs within comment lines (so your code still compiles/runs just fine on 1 CPU) In C, it uses “pragmas” (older compilers will just ignore the parallel constructs)
Basic SMP Architecture CPU CPU CPU CPU Memory Bus Main Memory
“Hello World” Program in OpenMP Hello world! 0 1 2 3 Good-bye!       program hello       print *, ’Hello world!’ c$omp parallel       print *, omp_get_thread_num() c$omp end parallel       print *, ’Good-bye!’       end Hello world! 2 3 0 1 Good-bye! Possible Outputs: int main() { printf(“Hello World!”); #pragmaomp parallel printf(“%i”,omp_get_thread_num()); printf(“Good-bye!”);      return( 0 ); } Hello world! 01 23 Good-bye!
Operational Model for OpenMP Master Thread performs the serial work print *, ’Hello world!’ Master Thread encounters ‘parallel region’ - creates N-1 additional Threads c$omp parallel All Threads (including Master) perform the parallel work print *,  omp_get_thread_num() Master Thread waits for all Threads to finish - implied Barrier at end of every parallel region c$omp end parallel print *, ’Good-bye!’
Operational Model for OpenMP Master Thread performs the serial work print *, ’Hello world!’ Master Thread encounters ‘parallel region’ - creates N-1 additional Threads c$omp parallel All Threads (including Master) perform the parallel work print *,  omp_get_thread_num() Master Thread waits for all Threads to finish - implied Barrier at end of every parallel region c$omp end parallel print *, ’Good-bye!’ Every Parallel Region has a “cost” (i.e. can slow down your code)
What is a Thread? A “Thread” is an operating system construct that contains all of the CPU-state associated with a piece of the parallel program A set of floating-point and integer registers, a set of memory pointers, a stack pointer, a program counter, etc. Sort of a “Virtual CPU” which competes for access to a real CPU in order to do real work Generally, you want to use N Threads on a N CPU machine But you can use any number of Threads on any number of CPUs OS will time-share the system in (we hope) a fair manner For comparison:  a Unix/Linux “Process” is a similar OS construct, except that 2 Processes cannot directly share memory
Compiling OpenMP Programs OpenMP *REQUIRES* an OpenMP-compliant compiler gcc 4.0 has some support gcc 2.x, 3.x can NOT handle OpenMP! Generally, you’ll need a “professional” compiler suite Portland Group, Pathscale, Intel, Sun All you have to do is “turn on” the OpenMP support: There is also a library that is required, but the compiler generally knows to include this automatically % gcc -fopenmp % cc -xopenmp=parallel % pgcc -mp % icc -openmp Note that some compilers require ‘#pragma’ to appear in the first column of the line % xlc -qsmp=omp
OpenMP Basics How does OpenMP know how many threads to use? It builds in the capability to use any number of threads What happens at run-time?  It depends … Sometimes it’s a system-default It may be assigned to you by a queuing system Sometimes it can be overridden by: This can be a problem! It may not be efficient to run a small loop with lots of threads If there isn’t much work to do, the overhead of the parallel region may be greater than the cost of doing the work on 1 thread setenv OMP_NUM_THREADS 4 OMP_NUM_THREADS=4; export OMP_NUM_THREADS
OpenMP Basics, cont’d How does the compiler know where to end a parallel region? Fortran:  you use ‘c$omp end parallel’ C:  the compiler parallelizes the NEXT STATEMENT only Often, the next statement is a loop ... and the whole loop is parallelized You can always make the next line be an opening curly-brace ‘{‘ Then everything within the braces will be parallelized int main() { #pragmaomp parallel      { printf(“Hello World!”); printf(“%i”,omp_get_thread_num()); printf(“Good-bye!”);      }      return( 0 ); } These dangling braces can be confusing, use indentation and comments to improve readability
OpenMP Basics, cont’d Back to original code, output could be: Why isn’t the output consistent? CPUs are inherently asynchronous … there are operating system processes that pop up now and then that may slow down some CPUs ... the exact timing leads to different outputs This is called a “Race Condition” ... one of the most prevalent bugs in parallel programs Hello world! 0 1 2 3 Good-bye! Hello world! 01 23 Good-bye! Hello world! 2 3 0 1 Good-bye!
Loop Parallelism Loop-level parallelism is the most common approach to extracting parallel work from a computer program E.g. you probably have lots of loops in your program that iterate over very large arrays: Instead of 1 CPU handling all 1,000,000 entries, we use N threads which each handle (1,000,000/N) entries       do i = 1, 1000000          y(i) = alpha*x(i) + y(i)       enddo for(i=0;i<1000000;i++) {     y[i] = alpha*x[i] + y[i] } c$omp parallel do       do i = 1, 1000000          y(i) = alpha*x(i) + y(i)       enddo c$omp end parallel #pragma omp parallel for for(i=0;i<1000000;i++) {     y[i] = alpha*x[i] + y[i] }
Loop Parallelism, cont’d c$omp parallel do       do i = 1, 1000000          y(i) = alpha*x(i) + y(i)       enddo c$omp end parallel #pragma omp parallel for for(i=0;i<1000000;i++) {     y[i] = alpha*x[i] + y[i] } The compiler inserts code to compute the loop bounds for each thread For 10 threads, thread 0 works on i = 1 to 100,000; thread 1 works on 100,001 to 200,000, etc. Since this is a shared memory environment, every thread can “see” all data items in all arrays … so we don’t have to coordinate each thread’s loop bounds with it’s portion of each array This often makes it easier to program in OpenMP than MPI
Data Sharing Examples c$omp parallel do       do i = 1, 1000000 a(i) = b(i+1) + c(i+5)*d(i-3) enddo c$omp end parallel This is Ok since each thread is only READING different array locations and WRITEs do not conflict with each other (i.e. only one write to a(i) )
Data Sharing Examples, cont’d c$omp parallel do       do i = 1, 1000000          do j = 1, 1000000 a(i,j-1) = b(i+1,j) + c(i)*d(j-1) enddo enddo c$omp end parallel This is Ok since each thread is only READING different array locations and WRITEs do not conflict (i.e. only one thread writes to a(i,j-1) )
Data Sharing Examples, cont’d c$omp parallel do       do i = 1, 1000000 a(i) = 0.5*( a(i) + b(i-1) ) b(i) = 2.0*c(i)+d(i) enddo c$omp end parallel PROBLEM! -- How do we know if b(i-1) is being computed right now by another thread?
Data Sharing Examples, cont’d c$omp parallel do       do i = 1, 1000000 b(i) = 2.0*c(i) + d(i) enddo c$omp end parallel c$omp parallel do       do i = 1, 1000000 a(i) = 0.5*( a(i) + b(i-1) ) enddo c$omp end parallel This is now Ok
A More Complex Loop       do i = 1, 100          do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) enddo enddo for(i=0;i<100;i++) {    for(j=0;j<100;j++) {       x = i/100;       y = j/100;       d[i][j] = mandelbrot(x,y);    } } Now, there is a function call inside the inner-most loop
A More Complex Loop, cont’d c$omp parallel do       do i = 1, 100          do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) enddo enddo c$omp end parallel #pragma omp parallel for for(i=0;i<100;i++) {    for(j=0;j<100;j++) {       x = i/100;       y = j/100;       d[i][j] = mandelbrot(x,y);    } } Any problems?   (what is the value of ‘x’?)
A More Complex Loop, cont’d Recall that all variables are shared across all threads EXCEPT the loop-index/loop-bounds So in this example, x and y are written and over-written by ALL threads for EVERY iteration in the loop This example will compile, and even run, without any errors But it will produce very strange results  … since the value that a thread computes for x may be overwritten just before that thread calls the mandelbrot() function e.g. Thread-0 sets x=0.0, y=0.0 and prepares to call the function but just before the function call, Thread-1 writes x=0.5, y=0.0 so Thread-0 actually calls mandelbrot(0.5,0) instead of (0,0) and then, Thread-0 stores the result into d(1,0)
Private vs. Shared Variables Single-thread portion of the program “c$omp parallel” starts new parallel region i = 51 i = 1 2 Threads are created, each with a different loop-range x = 0.0 “x=(i-1)/100” uses private-i and writes to global-x y = 0.0 x = 0.5 y = 0.0 What values of (x,y) are used depends on the actual system timing … which can change each time the program is run - could change each iteration! mandel(x,y) i = 2 x = 0.01 mandel(x,y) i = 52 y = 0.0
Private vs. Shared Variables, cont’d OpenMP allows the programmer to specify if additional variables need to be Private to each Thread This is specified as a “clause” to the c$omp or #pragma line: c$omp parallel do private(j,x,y)       do i = 1, 100          do j = 1, 100             x = (i-1)/100             y = (j-1)/100             d(i,j) = mandelbrot(x,y)          enddo       enddo c$omp end parallel #pragma omp parallel for br />       private(j,x,y) for(i=0;i<100;i++) {    for(j=0;j<100;j++) {       x = i/100;       y = j/100;       d[i][j] = mandelbrot(x,y);    } }
Private Variable Examples c$omp parallel do private(tmp1,tmp2)       do i = 1, 1000000 tmp1 = ( vm(i) - 75.0 )/alpha tmp2 = ( 33.0 - vm(i) )*beta a(i) = exp(tmp1)/(1.0-exp(tmp2)) enddo c$omp end parallel struct xyz point1,point2; #pragmaomp parallel for private(point1,d) for(i=0;i<1000000;i++) { point1.x = i+1; d = distance(point1,point2);    if( d > 100.0 ) { printf(“Error: distance too great”);    } }
Mixing Shared and Private Variables What if we want to accumulate the total count from all of the mandelbrot() calculations?       total = 0 c$omp parallel do private(j,x,y)       do i = 1, 100          do j = 1, 100             x = (i-1)/100             y = (j-1)/100             d(i,j) = mandelbrot(x,y)             total = total + d(i,j)          enddo       enddo c$omp end parallel total = 0; #pragma omp parallel for br />      private(j,x,y) for(i=0;i<100;i++) {    for(j=0;j<100;j++) {       x = i/100;       y = j/100;       d[i][j] = mandelbrot(x,y);       total += d[i][j];    } } We want ‘total’ to be shared (not private) … which is the default
Synchronization for Shared Variables This is really several discrete steps: Read ‘total’ from global/shared memory into a CPU-register Read ‘d(i,j)’ into a CPU-register Add the two registers together Store the sum back to the ‘total’ global/shared memory location There is a possible race condition:  another Thread could have incremented ‘total’ just after this Thread read the value So our new ‘total’ will be incorrect … but will overwrite any previous (correct) value
Synchronization … Critical Sections OpenMP allows user-defined Critical Sections Only ONE THREAD at a time is allowed in a Critical Section Note that Critical Sections imply SERIAL computation PARALLEL performance will likely be reduced!       total = 0 c$omp parallel do private(j,x,y)       do i = 1, 100          do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) c$omp critical             total = total + d(i,j) c$omp end critical enddo enddo c$omp end parallel total = 0; #pragma omp parallel for br />      private(j,x,y) for(i=0;i<100;i++) {    for(j=0;j<100;j++) {       x = i/100;       y = j/100;       d[i][j] = mandelbrot(x,y); #pragma omp critical       total += d[i][j];    } }
“Named” Critical Regions What if there are different regions, each of which needs a single-Thread restriction, but multiple such regions could execute simultaneously? E.g. compute min and max of a list of items Updating of global-min must be single-Thread-only Updating of global-max must be single-Thread-only But one Thread can update min while another updates max “Named” Critical Sections You can include a (Programmer-defined) name for each Critical Section c$omp critical (MINLOCK) c$omp critical (MAXLOCK)
Named Critical Region Example c$omp parallel do       do i = 1,1000000 c$omp critical (MINLOCK)          if( a(i) < cur_min ) then cur_min = a(i) endif c$omp end critical c$omp critical (MAXLOCK)          if( a(i) > cur_max ) then cur_max = a(i) endif c$omp end critical enddo c$omp end parallel Performance?
Named Critical Region Example, cont’d c$omp parallel do       do i = 1,1000000          if( a(i) < cur_min ) then c$omp critical (MINLOCK)             cur_min = a(i) c$omp end critical          endif          if( a(i) > cur_max ) then c$omp critical (MAXLOCK)             cur_max = a(i) c$omp end critical          endif       enddo c$omp end parallel Is this all we need?
Named Critical Region Example, cont’d c$omp parallel do       do i = 1,1000000          if( a(i) < cur_min ) then c$omp critical (MINLOCK)             if( a(i) < cur_min ) then cur_min = a(i) endif c$omp end critical endif          if( a(i) > cur_max ) then c$omp critical (MAXLOCK)             if( a(i) > cur_max ) then cur_max = a(i) endif c$omp end critical endif enddo c$omp end parallel Why bother?
“Very Small” Critical Sections In some cases, we may use critical reasons to provide locks on small increments to a global counter, or other “small” computation We don’t really care what the value is, we just want to increment it What we really want is an ‘atomic’ update of that global variable This uses unique hardware features on the CPU, so it is very fast But there is a limit to what can be done #pragma omp parallel for     for(i=0;i<1000000;i++) { #pragma omp atomic       sum += a[i];     } c$omp parallel do       do i = 1, 1000000 c$omp atomic          sum = sum + a(i) enddo c$omp end parallel
Atomic Operations The OpenMP compiler may or may not make use of the hardware features in the CPU (if you even have them), but it will guarantee the functionality of the atomic operations Fortran: + - * / .AND. .OR. etc. MIN() MAX() IANDIOR etc. C/C++: + - * / & ^ | << >> ++ -- += -= *=
Synchronization … Reduction Operations Where Critical Sections can be any arbitrary piece of code, OpenMP also allows for optimized routines for certain standard operations Accumulate across all Threads, Min across all Threads, etc. Called a “Reduction” Operation       total = 0 c$omp parallel do private(j,x,y) c$omp+ reduction(+:total)       do i = 1, 100          do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y)             total = total + d(i,j) enddo enddo c$omp end parallel total = 0; #pragmaomp parallel for br />private(j,x,y) br />reduction(+:total) for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y);       total += d[i][j];    } }
Reduction Operations Fortran: + - * .AND.  .OR.  .EQV.  .NEQV. MIN  MAX IANDIORIEOR C: + - * &&  || &  |  ^ Suitable initial values are chosen for the reduction variable E.g. ‘reduction(+:total)’ initializes ‘total=0’ in each Thread
Initializing Private Variables What if we want each Thread’s private variable to start off at a specific value? Reduction operations automatically initialize the values to something meaningful (0 or 1) Other major option is to copy the global value into each Thread’s private variable Called “firstprivate(…)” clause
Initializing Private Variables, cont’d var1 = 0 c$omp parallel do private(x) c$omp+ firstprivate(var1)       do i = 1, 100 x = (i-1)/100 var1 = var1 + f(x) enddo       print *,’partial sum’,var1 c$omp end parallel var1 = 0; #pragma omp parallel for br />      private(x) br />      firstprivate(var1) {   for(i=0;i<100;i++) {      x = i/100;      var1 += f(x);   }   printf(“partial sum %i”,var1); } At the end of the loop, (global) var1 is UNCHANGED! var1 == 0
Private vs. Firstprivate vs. Copyin vs …
OpenMP and Function Calls       total = 0 c$omp parallel do private(j,x,y)       do i = 1, 100          do j = 1, 100             x = (i-1)/100             y = (j-1)/100             d(i,j) = mandelbrot(x,y) c$omp critical             total = total + d(i,j) c$omp end critical          enddo       enddo c$omp end parallel total = 0; #pragma omp parallel for br />      private(j,x,y) for(i=0;i<100;i++) {    for(j=0;j<100;j++) {       x = i/100;       y = j/100;       d[i][j] = mandelbrot(x,y); #pragma omp critical       total += d[i][j];    } } What happens inside the mandelbrot() function? Are variables defined inside mandelbrot() private or shared?
Function Calls, cont’d In C and Fortran, variables defined inside a function are placed on the stack, NOT in a “global” memory area Each Thread has its own stack … so each Thread will have its own set of function-variables on its own stack I.e. variables defined in a function are PRIVATE       subroutine mandelbrot(x,y)       real x,y       real temp1,temp2       integer it_ctr c     ** do real work **       mandelbrot = it_ctr       end subroutine int mandelbrot( float x,float y ) {     float temp1, temp2;     int it_ctr;     /* do real work */     return( it_ctr ); } Private
Function Calls, cont’d In C and Fortran, variables defined inside a function are placed on the stack, NOT in a “global” memory area Each Thread has its own stack … so each Thread will have its own set of function-variables on its own stack I.e. variables defined in a function are PRIVATE Private Shared?       subroutine mandelbrot(x,y)       real x,y       real temp1,temp2       integer it_ctr c     ** do real work **       mandelbrot = it_ctr       end subroutine int mandelbrot( float x,float y ) {     float temp1, temp2;     int it_ctr;     /* do real work */     return( it_ctr ); }
OpenMP and Function Calls, Take-II HOWEVER, if you have a parallel section that is defined wholly inside a function, then the “usual” rules apply E.g. Master thread, running by itself, calls a function which breaks into a parallel region … then all function-local variables CAN be seen (shared) by all Threads       subroutine mandelbrot(x,y)       real x,y       real temp1,temp2       integer it_ctr c$omp parallel do       do i = 1,N          . . . enddo mandelbrot = it_ctr       end subroutine int mandelbrot( float x,float y ) {     float temp1, temp2;     int it_ctr; #pragma omp parallel for     for(i=0;i<N;i++) {        . . .     }     return( it_ctr ); } Inside the loop temp1, temp2, it_ctr are SHARED
Parallel Sections Loop-level parallelism tends to imply a Domain Decomposition or Iterative Parallelism approach to splitting a problem up Parallel Sections can be used if there are fundamentally different algorithms that do not need to interact One Thread needs to act as Server while the others act as Clients One Thread handles all I/O while the others “do the work” Imagine 2 loops which compute the sum and sum-squared of a set of data You could split up each loop with its own ‘c$omp parallel do’ Or you could split the whole program up so that one thread computes the sum and another thread computes sum-squared What if we have more than 2 CPUs?
Parallel Sections, cont’d c$omp parallel do       do i = 1,num         val = val + data(i)       enddo c$omp parallel do       do i = 1,num         val2 = val2 + data(i)*data(i)       enddo c$omp parallel sections c$omp section       do i = 1,num         val = val + data(i)       enddo c$omp section       do i = 1,num         val2 = val2 + data(i)*data(i)       enddo c$omp end parallel sections
A Few More Tidbits omp_get_wtime() Returns the wall-clock time (in seconds, double-precision value) omp_get_wtick() returns time-resolution of wtime Note that some other system timers return “User Time” which is the wall-clock time multiplied by the number of threads Can be confusing if you’re looking for speed-up numbers
“DIY” Critical Sections You can also create your own “locks” (critical regions) Useful if the number of locks needs to be determined at run-time (recall that CRITICAL sections are defined at compile-time) omp_init_lock( &lock ) omp_set_lock( &lock ) omp_test_lock( &lock ) omp_unset_lock( &lock ) omp_destroy_lock( &lock )
The FLUSH Directive Modern CPU architecture doesn’t entirely “play nice” with shared memory programming Cache memory is used to store recently used data items What happens if a cached memory item is needed by another Thread? I.e. the data-item currently resides in cache on CPU#1 and is not visible to CPU#2, how do we force CPU#1 to write the data to main memory? #pragma omp flush (var1,var2,var3) c$omp flush
FLUSH Directive, cont’d while( (volatile)(GlobalFlag) == 0 ); . . . /* on some other thread */ GlobalFlag = 1; Without ‘volatile’ compiler may remove the loop First time through loop, GlobalFlag is read  	into cache . . . and may never be re-read
FLUSH Directive, cont’d   while( (volatile)(GlobalFlag) == 0 ) { #pragmaompflush (GlobalFlag)   }   . . .   /* on some other thread */ GlobalFlag = 1; #pragmaompflush (GlobalFlag) Make sure we read new value Make sure we write new value Fortran: may need to tag GlobalFlag as ‘static’ or ‘save’
Basic Tips OpenMP is built for large-loop parallelism with little synchronization If your loops require tight synchronization, you may have a hard time programming it in OpenMP It is sometimes easier to have a single (outer) OpenMP parallel-for loop which makes function calls to a “worker” routine  This kind of loop can be pipelined for(i=0;i<N;i++) {    a[i] = 0.5*( a[i-1] + b[i] ); } c$omp parallel do private(x)       do i=1,N          x = i*0.1          call DoWork( x )       enddo
One “Big” Parallel Region vs. Several Smaller Ones c$omp parallel do       do i = 1, N          y(i) = alpha*x(i) + y(i)       enddo c$omp end parallel       . . . c$omp parallel do       do i = 1, N          z(i) = beta*y(i) + z(i)       enddo c$omp end parallel #pragma omp parallel for for(i=0;i<N;i++) {     y[i] = alpha*x[i] + y[i] } . . . #pragma omp parallel for for(i=0;i<N;i++) {     z[i] = beta*y[i] + z[i]; } Each parallel region has an implied barrier (high cost)
One “Big” Parallel Region, cont’d c$omp parallel  c$omp do       do i = 1, N          y(i) = alpha*x(i) + y(i)       enddo c$omp enddo       . . . c$omp do       do i = 1, N          z(i) = beta*y(i) + z(i)       enddo c$omp enddo c$omp end parallel #pragma omp parallel { #pragma omp for   for(i=0;i<N;i++) {     y[i] = alpha*x[i] + y[i]   } . . . #pragma omp for   for(i=0;i<N;i++) {     z[i] = beta*y[i] + z[i];   } }  /* end of parallel region */ This is still a  parallel section
Single-CPU Region within a Parallel Region Master -- only 1 Thread (and only the Master Thread) executes the code Single -- only 1 Thread (chosen at random) executes the code c$omp parallel  c$omp do       /* parallel */ c$omp enddo c$omp master       /* single-cpu */ c$omp end master c$omp end parallel #pragma omp parallel { #pragma omp for     /* parallel */ #pragma omp single   {     /* single-cpu */   } }  /* end of parallel region */
One “Big” Parallel Region, cont’d One (larger) parallel region has only one implied barrier Should be lower “cost” ... hence, better performance The “interior” regions will be executed in parallel which may result in redundant computations If there is too much redundant computation, it could slow down the code Use ‘MASTER’ or ‘SINGLE’ regions if needed Note that a small amount of redundant computation is often cheaper than breaking out to a single-CPU region and then re-starting a parallel region
Performance Issue with Small Loops Each OpenMP parallel region (loop or section) has an overhead associated with it Non-zero time to start up N other Threads Non-zero time to synchronize at the end of the region If you have a very fast CPU and a very small loop, these overheads may be greater than the cost (time) to just do the loop on 1 CPU For small loops, we can tell OpenMP to skip the parallelism by using an ‘if’ clause i.e. only run the loop in parallel *IF* a certain condition holds
Small Loops, cont’d c$omp parallel do if(N>1000)       do i = 1, N y(i) = alpha*x(i) + y(i) enddo c$omp end parallel #pragma omp parallel for br />      if(N>1000) for(i=0;i<N;i++) {     y[i] = alpha*x[i] + y[i] } Where you should put this cut-off is entirely machine-dependent
Performance of Critical Sections Critical Sections can cause huge slow-downs in parallel code Recall, a Critical Section is serial (non-parallel) work, so it is not surprising that it would impact parallel performance Many times, you really want a ‘Reduction’ operation ‘Reduction’ will be much faster than a critical section For “small” Critical Sections, use ‘Atomic’ updates But check the compiler docs first … The Sun compilers turn ‘atomic’ blocks into comparable ‘critical’ sections
Alternatives to Critical Sections       integer temp(MAX_NUM_THREADS), sum       integer self c     assume temp(:) = 0 to start c$omp parallel private(self)       self = omp_get_thread_num() c$omp do       do i = 1, N          . . . temp(self) = temp(self) + val enddo c$ompenddo c$omp end parallel       sum = 0       do i = 1, MAX_NUM_THREADS          sum = sum + temp(i) enddo
“Dangerous” Performance Enhancement You can tell OpenMP to not insert barriers at the end of parallel regions This is “dangerous,” since with no barrier at the end, the Master Thread may begin running code that ASSUMES certain variables are stable c$omp parallel do nowait       do i = 1, N y(i) = alpha*x(i) + y(i) enddo c$omp end parallel       print *, ‘y(100)=‘, y(100) With no barrier at the end of the parallel region, we cannot guarantee that y(100) has been computed yet
Race Conditions A “Race” is a situation where multiple Threads are making progress toward some shared or common goal where timing is critical Generally, this means you’ve ASSUMED something is synchronized, but you haven’t FORCED it to be synchronized E.g. all Threads are trying to find a “hit” in a database When a hit is detected, the Thread increments a global counter Normally, hits are rare events and so incrementing the counter is “not a big deal” But if two hits occur simultaneously, then we have a race condition While one Thread is incrementing the counter, the other Thread may be reading it It may read the old value or new value, depending on the exact timing
Race Conditions, cont’d Some symptoms: Running the code with 4 Threads works fine, running with 5 causes erroneous output Running the code with 4 Threads works fine, running it with 4 Threads again produces different results Sometimes the code works fine, sometimes it crashes after 4 hours, sometimes it crashes after 10 hours, … Program sometimes runs fine, sometimes hits an infinite loop Program output is sometimes jumbled (the usual line-3 appears before line-1) Any kind of indeterminacy in the running of the code, or seemingly random changes to its output, could indicate a race condition
Race Conditions, cont’d Race Conditions can be EXTREMELY hard to debug Generally, race conditions don’t cause crashes or even logic-errors E.g. a race condition leads two Threads to both think they are in charge of data-item-X … nothing crashes, they just keep writing and overwriting X (maybe even doing duplicate computations) Often, race conditions don’t cause crashes at the time they actually occur … the crash occurs much later in the execution and for a totally unrelated reason E.g. a race condition leads one Thread to think that the solver converged but the other Thread thinks we need another iteration … crash occurs because one Thread tried to compute the global residual Sometimes race conditions can lead to deadlock Again, this often shows up as seemingly random deadlock when the same code is run on the same input
Deadlock A “Deadlock” condition occurs when all Threads are stopped at synchronization points and none of them are able to make progress Sometimes this can be something simple:   Thread-1 enters a Critical-Section and waits for Variable-X to be incremented, and then it will release the lock Thread-2 wants to increment X but needs to enter the Critical-Section in order to do that Sometimes it can be more complex or cyclic: Thread-1 enters Crit-Sec-1 and needs to enter Crit-Sec-2 Thread-2 enters Crit-Sec-2 and needs to enter Crit-Sec-3 Thread-3 enters Crit-Sec-3 and …
Starvation “Starvation” occurs when one Thread needs a resource (Critical Section) that is never released by another Thread Thread-1 grabs a lock/enters a critical region in order to write data to a file Every time Thread-1 is about to release the lock, it finds that it has another item to write ... it keeps the lock/re-enters the critical region Thread-2 continues to wait for the lock/critical region but never gets it Sometimes it may show up as “almost Deadlock”, other times it may show up as a performance degradation E.g. when Thread-1 is done writing all of its data, it finally gives the lock over to Thread-2 ... output is correct but performance is terrible
Deadlock, cont’d Symptoms: Program hangs  (!) Debugging deadlock is often straightforward … unless there is also a race condition involved Simple approach:  put print statement in front of all Critical Section and Lock calls Thread-1 acquired lock on region-A Thread-1 acquired lock on region-C Thread-2 acquired lock on region-B Thread-1 release lock on region-C Thread-3 acquired lock on region-C Thread-1 waiting for lock on region-B Thread-2 waiting for lock on region-A
Static Scheduling for Fixed Work-loads OpenMP allows dynamic scheduling of parallel loops If one CPU gets loaded down (by OS tasks), other CPUs will pick up additional iterations in the loop to offset that performance-loss Generally, this works well -- you don’t have to worry about all the random things that might happen, OpenMP does the “smart” thing However, this dynamic scheduling implies some additional “cost” If nothing else, the CPUs must check if other CPUs are loaded If your work-load inside a parallel region is fixed, then you may want to try a “static” work-assignment E.g. the loop is exactly 1,000,000 iterations, so the CPUs should take 100,000 each
Static Scheduling, cont’d c$omp parallel do schedule(static)       do i = 1, 1000000          y(i) = alpha*x(i) + y(i)       enddo c$omp end parallel #pragmaomp parallel for schedule(static) for(i=0;i<1000000;i++) { y[i] = alpha*x[i] + y[i] } If the per-iteration work-load is variable,   or you are sharing the CPUs with many other users, then a dynamic scheduling algorithm (the default) may be better
Moving Towards Pthreads Pthreads (POSIX Threads) is the other major shared-memory environment It is a little bit like the OpenMP ‘Section’ Directive E.g. you use pthread_create() to create a new Thread and assign it a “work-function” (a section of code to execute) You can use the OpenMP Library Functions to create your own splitting of the loops: self_thread = omp_get_thread_num() num_thr = omp_get_num_threads() num_proc = omp_get_num_procs() You may be able to avoid some of the overheads with multiple parallel regions by “forcing” a consistent splitting onto your data and loops
Alternative Approach to OpenMP datasize = 1000000 c$omp parallel private(i,self_thr,num_thr,istart,ifinish) self_thr = omp_get_thread_num() num_thr = omp_get_num_threads() istart = (self_thr-1)*(datasize/num_thr) + 1 ifinish = istart + (datasize/num_thr)       do i = istart, ifinish y(i) = alpha*x(i) + y(i) enddo c$omp end parallel
Alternative Approach to OpenMP, cont’d datasize = 1000000 #pragmaomp parallel private(i,self_thr,num_thr,istart,ifinish)       { self_thr = omp_get_thread_num(); num_thr = omp_get_num_threads(); istart = (self_thr-1)*(datasize/num_thr); ifinish = istart + (datasize/num_thr); for(i=istart;i<ifinish;i++) { y[i] = alpha*x[i] + y[i];         }       }
For More Info “Parallel Programming in OpenMP” Chandra, Dagum, et. al. http://openmp.org Good list of links to other presentations and tutorials Many compiler vendors have good end-user documentation IBM Redbooks, tech-notes, “porting” guides http://docs.sun.com/source/819-3694/index.html
Where to go from here? http://wiki.duke.edu/display/scsc/  . . . scsc at duke.edu Get on our mailing lists: ,[object Object]

Contenu connexe

Tendances

Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingVengada Karthik Rangaraju
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Akhila Prabhakaran
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open MpAnshul Sharma
 
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Takuo Watanabe
 
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...Takuo Watanabe
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Chris Fregly
 
.Net Multithreading and Parallelization
.Net Multithreading and Parallelization.Net Multithreading and Parallelization
.Net Multithreading and ParallelizationDmitri Nesteruk
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningSeiya Tokui
 
MPI in TNT for parallel processing
MPI in TNT for parallel processingMPI in TNT for parallel processing
MPI in TNT for parallel processingMartín Morales
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendHoang Nguyen
 

Tendances (20)

openmp
openmpopenmp
openmp
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
 
Open mp
Open mpOpen mp
Open mp
 
Introduction to OpenMP
Introduction to OpenMPIntroduction to OpenMP
Introduction to OpenMP
 
Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)Introduction to OpenMP (Performance)
Introduction to OpenMP (Performance)
 
OpenMp
OpenMpOpenMp
OpenMp
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
 
Openmp
OpenmpOpenmp
Openmp
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
Towards an Integration of the Actor Model in an FRP Language for Small-Scale ...
 
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
A Language Support for Exhaustive Fault-Injection in Message-Passing System M...
 
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
Swift for TensorFlow - Tanmay Bakshi - Advanced Spark and TensorFlow Meetup -...
 
Lecture8
Lecture8Lecture8
Lecture8
 
.Net Multithreading and Parallelization
.Net Multithreading and Parallelization.Net Multithreading and Parallelization
.Net Multithreading and Parallelization
 
Introduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep LearningIntroduction to Chainer: A Flexible Framework for Deep Learning
Introduction to Chainer: A Flexible Framework for Deep Learning
 
MPI in TNT for parallel processing
MPI in TNT for parallel processingMPI in TNT for parallel processing
MPI in TNT for parallel processing
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 

En vedette

Biref Introduction to OpenMP
Biref Introduction to OpenMPBiref Introduction to OpenMP
Biref Introduction to OpenMPJerryHe
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel Software Brasil
 
Cloud Services On UI and Ideas for Federated Cloud on idREN
Cloud Services On UI and Ideas for Federated Cloud on idRENCloud Services On UI and Ideas for Federated Cloud on idREN
Cloud Services On UI and Ideas for Federated Cloud on idRENTonny Adhi Sabastian
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Marcirio Chaves
 
Race conditions
Race conditionsRace conditions
Race conditionsMohd Arif
 
Parallel architecture-programming
Parallel architecture-programmingParallel architecture-programming
Parallel architecture-programmingShaveta Banda
 
MPI Presentation
MPI PresentationMPI Presentation
MPI PresentationTayfun Sen
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel ProgrammingUday Sharma
 
NUMA optimized Parallel Breadth first Search on Multicore Single node System
NUMA optimized Parallel Breadth first Search on Multicore Single node SystemNUMA optimized Parallel Breadth first Search on Multicore Single node System
NUMA optimized Parallel Breadth first Search on Multicore Single node SystemMohammad Tahsin Alshalabi
 
HowTo Install openMPI on Ubuntu
HowTo Install openMPI on UbuntuHowTo Install openMPI on Ubuntu
HowTo Install openMPI on UbuntuA Jorge Garcia
 

En vedette (17)

Biref Introduction to OpenMP
Biref Introduction to OpenMPBiref Introduction to OpenMP
Biref Introduction to OpenMP
 
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013Intel® MPI Library e OpenMP* - Intel Software Conference 2013
Intel® MPI Library e OpenMP* - Intel Software Conference 2013
 
OpenMP
OpenMPOpenMP
OpenMP
 
Lecture9
Lecture9Lecture9
Lecture9
 
Cloud Services On UI and Ideas for Federated Cloud on idREN
Cloud Services On UI and Ideas for Federated Cloud on idRENCloud Services On UI and Ideas for Federated Cloud on idREN
Cloud Services On UI and Ideas for Federated Cloud on idREN
 
Chpt7
Chpt7Chpt7
Chpt7
 
presentation
presentationpresentation
presentation
 
Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2Tutorial on Parallel Computing and Message Passing Model - C2
Tutorial on Parallel Computing and Message Passing Model - C2
 
Race conditions
Race conditionsRace conditions
Race conditions
 
Parallel architecture-programming
Parallel architecture-programmingParallel architecture-programming
Parallel architecture-programming
 
Openmp combined
Openmp combinedOpenmp combined
Openmp combined
 
Wolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat DresdenWolfgang Lehner Technische Universitat Dresden
Wolfgang Lehner Technische Universitat Dresden
 
MPI Presentation
MPI PresentationMPI Presentation
MPI Presentation
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
 
Bfs dfs
Bfs dfsBfs dfs
Bfs dfs
 
NUMA optimized Parallel Breadth first Search on Multicore Single node System
NUMA optimized Parallel Breadth first Search on Multicore Single node SystemNUMA optimized Parallel Breadth first Search on Multicore Single node System
NUMA optimized Parallel Breadth first Search on Multicore Single node System
 
HowTo Install openMPI on Ubuntu
HowTo Install openMPI on UbuntuHowTo Install openMPI on Ubuntu
HowTo Install openMPI on Ubuntu
 

Similaire à Intro to OpenMP

Similaire à Intro to OpenMP (20)

openmp.ppt
openmp.pptopenmp.ppt
openmp.ppt
 
openmp.ppt
openmp.pptopenmp.ppt
openmp.ppt
 
Loops in Python.pptx
Loops in Python.pptxLoops in Python.pptx
Loops in Python.pptx
 
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users GroupBest Practices for OpenMP on GPUs - OpenMP UK Users Group
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
 
Matlab-3.pptx
Matlab-3.pptxMatlab-3.pptx
Matlab-3.pptx
 
C Tutorials
C TutorialsC Tutorials
C Tutorials
 
Matlab ppt
Matlab pptMatlab ppt
Matlab ppt
 
ppt7
ppt7ppt7
ppt7
 
ppt2
ppt2ppt2
ppt2
 
name name2 n
name name2 nname name2 n
name name2 n
 
test ppt
test ppttest ppt
test ppt
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt21
ppt21ppt21
ppt21
 
name name2 n
name name2 nname name2 n
name name2 n
 
ppt17
ppt17ppt17
ppt17
 
ppt30
ppt30ppt30
ppt30
 
name name2 n2.ppt
name name2 n2.pptname name2 n2.ppt
name name2 n2.ppt
 
ppt18
ppt18ppt18
ppt18
 
Ruby for Perl Programmers
Ruby for Perl ProgrammersRuby for Perl Programmers
Ruby for Perl Programmers
 
ppt9
ppt9ppt9
ppt9
 

Dernier

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Intro to OpenMP

  • 1. Introduction toOpenMP Scalable Computing Support Center Duke University http://wiki.duke.edu/display/SCSC scsc at duke.edu John Pormann, Ph.D. jbp1 at duke edu
  • 2. What is OpenMP? OpenMP is an industry-standard combination of compiler directives and library routines, for shared-memory computers, that allow programmers to specify parallelism in their code without getting too caught up in the details of parallel programming OpenMP is a response to increasing hardware-parallelism in most (all) HPC platforms Sort of a “bolt-on” to the language, not a new language unto itself In Fortran, it hides many parallel constructs within comment lines (so your code still compiles/runs just fine on 1 CPU) In C, it uses “pragmas” (older compilers will just ignore the parallel constructs)
  • 3. Basic SMP Architecture CPU CPU CPU CPU Memory Bus Main Memory
  • 4. “Hello World” Program in OpenMP Hello world! 0 1 2 3 Good-bye! program hello print *, ’Hello world!’ c$omp parallel print *, omp_get_thread_num() c$omp end parallel print *, ’Good-bye!’ end Hello world! 2 3 0 1 Good-bye! Possible Outputs: int main() { printf(“Hello World!”); #pragmaomp parallel printf(“%i”,omp_get_thread_num()); printf(“Good-bye!”); return( 0 ); } Hello world! 01 23 Good-bye!
  • 5. Operational Model for OpenMP Master Thread performs the serial work print *, ’Hello world!’ Master Thread encounters ‘parallel region’ - creates N-1 additional Threads c$omp parallel All Threads (including Master) perform the parallel work print *, omp_get_thread_num() Master Thread waits for all Threads to finish - implied Barrier at end of every parallel region c$omp end parallel print *, ’Good-bye!’
  • 6. Operational Model for OpenMP Master Thread performs the serial work print *, ’Hello world!’ Master Thread encounters ‘parallel region’ - creates N-1 additional Threads c$omp parallel All Threads (including Master) perform the parallel work print *, omp_get_thread_num() Master Thread waits for all Threads to finish - implied Barrier at end of every parallel region c$omp end parallel print *, ’Good-bye!’ Every Parallel Region has a “cost” (i.e. can slow down your code)
  • 7. What is a Thread? A “Thread” is an operating system construct that contains all of the CPU-state associated with a piece of the parallel program A set of floating-point and integer registers, a set of memory pointers, a stack pointer, a program counter, etc. Sort of a “Virtual CPU” which competes for access to a real CPU in order to do real work Generally, you want to use N Threads on a N CPU machine But you can use any number of Threads on any number of CPUs OS will time-share the system in (we hope) a fair manner For comparison: a Unix/Linux “Process” is a similar OS construct, except that 2 Processes cannot directly share memory
  • 8. Compiling OpenMP Programs OpenMP *REQUIRES* an OpenMP-compliant compiler gcc 4.0 has some support gcc 2.x, 3.x can NOT handle OpenMP! Generally, you’ll need a “professional” compiler suite Portland Group, Pathscale, Intel, Sun All you have to do is “turn on” the OpenMP support: There is also a library that is required, but the compiler generally knows to include this automatically % gcc -fopenmp % cc -xopenmp=parallel % pgcc -mp % icc -openmp Note that some compilers require ‘#pragma’ to appear in the first column of the line % xlc -qsmp=omp
  • 9. OpenMP Basics How does OpenMP know how many threads to use? It builds in the capability to use any number of threads What happens at run-time? It depends … Sometimes it’s a system-default It may be assigned to you by a queuing system Sometimes it can be overridden by: This can be a problem! It may not be efficient to run a small loop with lots of threads If there isn’t much work to do, the overhead of the parallel region may be greater than the cost of doing the work on 1 thread setenv OMP_NUM_THREADS 4 OMP_NUM_THREADS=4; export OMP_NUM_THREADS
  • 10. OpenMP Basics, cont’d How does the compiler know where to end a parallel region? Fortran: you use ‘c$omp end parallel’ C: the compiler parallelizes the NEXT STATEMENT only Often, the next statement is a loop ... and the whole loop is parallelized You can always make the next line be an opening curly-brace ‘{‘ Then everything within the braces will be parallelized int main() { #pragmaomp parallel { printf(“Hello World!”); printf(“%i”,omp_get_thread_num()); printf(“Good-bye!”); } return( 0 ); } These dangling braces can be confusing, use indentation and comments to improve readability
  • 11. OpenMP Basics, cont’d Back to original code, output could be: Why isn’t the output consistent? CPUs are inherently asynchronous … there are operating system processes that pop up now and then that may slow down some CPUs ... the exact timing leads to different outputs This is called a “Race Condition” ... one of the most prevalent bugs in parallel programs Hello world! 0 1 2 3 Good-bye! Hello world! 01 23 Good-bye! Hello world! 2 3 0 1 Good-bye!
  • 12. Loop Parallelism Loop-level parallelism is the most common approach to extracting parallel work from a computer program E.g. you probably have lots of loops in your program that iterate over very large arrays: Instead of 1 CPU handling all 1,000,000 entries, we use N threads which each handle (1,000,000/N) entries do i = 1, 1000000 y(i) = alpha*x(i) + y(i) enddo for(i=0;i<1000000;i++) { y[i] = alpha*x[i] + y[i] } c$omp parallel do do i = 1, 1000000 y(i) = alpha*x(i) + y(i) enddo c$omp end parallel #pragma omp parallel for for(i=0;i<1000000;i++) { y[i] = alpha*x[i] + y[i] }
  • 13. Loop Parallelism, cont’d c$omp parallel do do i = 1, 1000000 y(i) = alpha*x(i) + y(i) enddo c$omp end parallel #pragma omp parallel for for(i=0;i<1000000;i++) { y[i] = alpha*x[i] + y[i] } The compiler inserts code to compute the loop bounds for each thread For 10 threads, thread 0 works on i = 1 to 100,000; thread 1 works on 100,001 to 200,000, etc. Since this is a shared memory environment, every thread can “see” all data items in all arrays … so we don’t have to coordinate each thread’s loop bounds with it’s portion of each array This often makes it easier to program in OpenMP than MPI
  • 14. Data Sharing Examples c$omp parallel do do i = 1, 1000000 a(i) = b(i+1) + c(i+5)*d(i-3) enddo c$omp end parallel This is Ok since each thread is only READING different array locations and WRITEs do not conflict with each other (i.e. only one write to a(i) )
  • 15. Data Sharing Examples, cont’d c$omp parallel do do i = 1, 1000000 do j = 1, 1000000 a(i,j-1) = b(i+1,j) + c(i)*d(j-1) enddo enddo c$omp end parallel This is Ok since each thread is only READING different array locations and WRITEs do not conflict (i.e. only one thread writes to a(i,j-1) )
  • 16. Data Sharing Examples, cont’d c$omp parallel do do i = 1, 1000000 a(i) = 0.5*( a(i) + b(i-1) ) b(i) = 2.0*c(i)+d(i) enddo c$omp end parallel PROBLEM! -- How do we know if b(i-1) is being computed right now by another thread?
  • 17. Data Sharing Examples, cont’d c$omp parallel do do i = 1, 1000000 b(i) = 2.0*c(i) + d(i) enddo c$omp end parallel c$omp parallel do do i = 1, 1000000 a(i) = 0.5*( a(i) + b(i-1) ) enddo c$omp end parallel This is now Ok
  • 18. A More Complex Loop do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) enddo enddo for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); } } Now, there is a function call inside the inner-most loop
  • 19. A More Complex Loop, cont’d c$omp parallel do do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) enddo enddo c$omp end parallel #pragma omp parallel for for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); } } Any problems? (what is the value of ‘x’?)
  • 20. A More Complex Loop, cont’d Recall that all variables are shared across all threads EXCEPT the loop-index/loop-bounds So in this example, x and y are written and over-written by ALL threads for EVERY iteration in the loop This example will compile, and even run, without any errors But it will produce very strange results … since the value that a thread computes for x may be overwritten just before that thread calls the mandelbrot() function e.g. Thread-0 sets x=0.0, y=0.0 and prepares to call the function but just before the function call, Thread-1 writes x=0.5, y=0.0 so Thread-0 actually calls mandelbrot(0.5,0) instead of (0,0) and then, Thread-0 stores the result into d(1,0)
  • 21. Private vs. Shared Variables Single-thread portion of the program “c$omp parallel” starts new parallel region i = 51 i = 1 2 Threads are created, each with a different loop-range x = 0.0 “x=(i-1)/100” uses private-i and writes to global-x y = 0.0 x = 0.5 y = 0.0 What values of (x,y) are used depends on the actual system timing … which can change each time the program is run - could change each iteration! mandel(x,y) i = 2 x = 0.01 mandel(x,y) i = 52 y = 0.0
  • 22. Private vs. Shared Variables, cont’d OpenMP allows the programmer to specify if additional variables need to be Private to each Thread This is specified as a “clause” to the c$omp or #pragma line: c$omp parallel do private(j,x,y) do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) enddo enddo c$omp end parallel #pragma omp parallel for br /> private(j,x,y) for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); } }
  • 23. Private Variable Examples c$omp parallel do private(tmp1,tmp2) do i = 1, 1000000 tmp1 = ( vm(i) - 75.0 )/alpha tmp2 = ( 33.0 - vm(i) )*beta a(i) = exp(tmp1)/(1.0-exp(tmp2)) enddo c$omp end parallel struct xyz point1,point2; #pragmaomp parallel for private(point1,d) for(i=0;i<1000000;i++) { point1.x = i+1; d = distance(point1,point2); if( d > 100.0 ) { printf(“Error: distance too great”); } }
  • 24. Mixing Shared and Private Variables What if we want to accumulate the total count from all of the mandelbrot() calculations? total = 0 c$omp parallel do private(j,x,y) do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) total = total + d(i,j) enddo enddo c$omp end parallel total = 0; #pragma omp parallel for br /> private(j,x,y) for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); total += d[i][j]; } } We want ‘total’ to be shared (not private) … which is the default
  • 25. Synchronization for Shared Variables This is really several discrete steps: Read ‘total’ from global/shared memory into a CPU-register Read ‘d(i,j)’ into a CPU-register Add the two registers together Store the sum back to the ‘total’ global/shared memory location There is a possible race condition: another Thread could have incremented ‘total’ just after this Thread read the value So our new ‘total’ will be incorrect … but will overwrite any previous (correct) value
  • 26. Synchronization … Critical Sections OpenMP allows user-defined Critical Sections Only ONE THREAD at a time is allowed in a Critical Section Note that Critical Sections imply SERIAL computation PARALLEL performance will likely be reduced! total = 0 c$omp parallel do private(j,x,y) do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) c$omp critical total = total + d(i,j) c$omp end critical enddo enddo c$omp end parallel total = 0; #pragma omp parallel for br /> private(j,x,y) for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); #pragma omp critical total += d[i][j]; } }
  • 27. “Named” Critical Regions What if there are different regions, each of which needs a single-Thread restriction, but multiple such regions could execute simultaneously? E.g. compute min and max of a list of items Updating of global-min must be single-Thread-only Updating of global-max must be single-Thread-only But one Thread can update min while another updates max “Named” Critical Sections You can include a (Programmer-defined) name for each Critical Section c$omp critical (MINLOCK) c$omp critical (MAXLOCK)
  • 28. Named Critical Region Example c$omp parallel do do i = 1,1000000 c$omp critical (MINLOCK) if( a(i) < cur_min ) then cur_min = a(i) endif c$omp end critical c$omp critical (MAXLOCK) if( a(i) > cur_max ) then cur_max = a(i) endif c$omp end critical enddo c$omp end parallel Performance?
  • 29. Named Critical Region Example, cont’d c$omp parallel do do i = 1,1000000 if( a(i) < cur_min ) then c$omp critical (MINLOCK) cur_min = a(i) c$omp end critical endif if( a(i) > cur_max ) then c$omp critical (MAXLOCK) cur_max = a(i) c$omp end critical endif enddo c$omp end parallel Is this all we need?
  • 30. Named Critical Region Example, cont’d c$omp parallel do do i = 1,1000000 if( a(i) < cur_min ) then c$omp critical (MINLOCK) if( a(i) < cur_min ) then cur_min = a(i) endif c$omp end critical endif if( a(i) > cur_max ) then c$omp critical (MAXLOCK) if( a(i) > cur_max ) then cur_max = a(i) endif c$omp end critical endif enddo c$omp end parallel Why bother?
  • 31. “Very Small” Critical Sections In some cases, we may use critical reasons to provide locks on small increments to a global counter, or other “small” computation We don’t really care what the value is, we just want to increment it What we really want is an ‘atomic’ update of that global variable This uses unique hardware features on the CPU, so it is very fast But there is a limit to what can be done #pragma omp parallel for for(i=0;i<1000000;i++) { #pragma omp atomic sum += a[i]; } c$omp parallel do do i = 1, 1000000 c$omp atomic sum = sum + a(i) enddo c$omp end parallel
  • 32. Atomic Operations The OpenMP compiler may or may not make use of the hardware features in the CPU (if you even have them), but it will guarantee the functionality of the atomic operations Fortran: + - * / .AND. .OR. etc. MIN() MAX() IANDIOR etc. C/C++: + - * / & ^ | << >> ++ -- += -= *=
  • 33. Synchronization … Reduction Operations Where Critical Sections can be any arbitrary piece of code, OpenMP also allows for optimized routines for certain standard operations Accumulate across all Threads, Min across all Threads, etc. Called a “Reduction” Operation total = 0 c$omp parallel do private(j,x,y) c$omp+ reduction(+:total) do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) total = total + d(i,j) enddo enddo c$omp end parallel total = 0; #pragmaomp parallel for br />private(j,x,y) br />reduction(+:total) for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); total += d[i][j]; } }
  • 34. Reduction Operations Fortran: + - * .AND. .OR. .EQV. .NEQV. MIN MAX IANDIORIEOR C: + - * && || & | ^ Suitable initial values are chosen for the reduction variable E.g. ‘reduction(+:total)’ initializes ‘total=0’ in each Thread
  • 35. Initializing Private Variables What if we want each Thread’s private variable to start off at a specific value? Reduction operations automatically initialize the values to something meaningful (0 or 1) Other major option is to copy the global value into each Thread’s private variable Called “firstprivate(…)” clause
  • 36. Initializing Private Variables, cont’d var1 = 0 c$omp parallel do private(x) c$omp+ firstprivate(var1) do i = 1, 100 x = (i-1)/100 var1 = var1 + f(x) enddo print *,’partial sum’,var1 c$omp end parallel var1 = 0; #pragma omp parallel for br /> private(x) br /> firstprivate(var1) { for(i=0;i<100;i++) { x = i/100; var1 += f(x); } printf(“partial sum %i”,var1); } At the end of the loop, (global) var1 is UNCHANGED! var1 == 0
  • 37. Private vs. Firstprivate vs. Copyin vs …
  • 38. OpenMP and Function Calls total = 0 c$omp parallel do private(j,x,y) do i = 1, 100 do j = 1, 100 x = (i-1)/100 y = (j-1)/100 d(i,j) = mandelbrot(x,y) c$omp critical total = total + d(i,j) c$omp end critical enddo enddo c$omp end parallel total = 0; #pragma omp parallel for br /> private(j,x,y) for(i=0;i<100;i++) { for(j=0;j<100;j++) { x = i/100; y = j/100; d[i][j] = mandelbrot(x,y); #pragma omp critical total += d[i][j]; } } What happens inside the mandelbrot() function? Are variables defined inside mandelbrot() private or shared?
  • 39. Function Calls, cont’d In C and Fortran, variables defined inside a function are placed on the stack, NOT in a “global” memory area Each Thread has its own stack … so each Thread will have its own set of function-variables on its own stack I.e. variables defined in a function are PRIVATE subroutine mandelbrot(x,y) real x,y real temp1,temp2 integer it_ctr c ** do real work ** mandelbrot = it_ctr end subroutine int mandelbrot( float x,float y ) { float temp1, temp2; int it_ctr; /* do real work */ return( it_ctr ); } Private
  • 40. Function Calls, cont’d In C and Fortran, variables defined inside a function are placed on the stack, NOT in a “global” memory area Each Thread has its own stack … so each Thread will have its own set of function-variables on its own stack I.e. variables defined in a function are PRIVATE Private Shared? subroutine mandelbrot(x,y) real x,y real temp1,temp2 integer it_ctr c ** do real work ** mandelbrot = it_ctr end subroutine int mandelbrot( float x,float y ) { float temp1, temp2; int it_ctr; /* do real work */ return( it_ctr ); }
  • 41. OpenMP and Function Calls, Take-II HOWEVER, if you have a parallel section that is defined wholly inside a function, then the “usual” rules apply E.g. Master thread, running by itself, calls a function which breaks into a parallel region … then all function-local variables CAN be seen (shared) by all Threads subroutine mandelbrot(x,y) real x,y real temp1,temp2 integer it_ctr c$omp parallel do do i = 1,N . . . enddo mandelbrot = it_ctr end subroutine int mandelbrot( float x,float y ) { float temp1, temp2; int it_ctr; #pragma omp parallel for for(i=0;i<N;i++) { . . . } return( it_ctr ); } Inside the loop temp1, temp2, it_ctr are SHARED
  • 42. Parallel Sections Loop-level parallelism tends to imply a Domain Decomposition or Iterative Parallelism approach to splitting a problem up Parallel Sections can be used if there are fundamentally different algorithms that do not need to interact One Thread needs to act as Server while the others act as Clients One Thread handles all I/O while the others “do the work” Imagine 2 loops which compute the sum and sum-squared of a set of data You could split up each loop with its own ‘c$omp parallel do’ Or you could split the whole program up so that one thread computes the sum and another thread computes sum-squared What if we have more than 2 CPUs?
  • 43. Parallel Sections, cont’d c$omp parallel do do i = 1,num val = val + data(i) enddo c$omp parallel do do i = 1,num val2 = val2 + data(i)*data(i) enddo c$omp parallel sections c$omp section do i = 1,num val = val + data(i) enddo c$omp section do i = 1,num val2 = val2 + data(i)*data(i) enddo c$omp end parallel sections
  • 44. A Few More Tidbits omp_get_wtime() Returns the wall-clock time (in seconds, double-precision value) omp_get_wtick() returns time-resolution of wtime Note that some other system timers return “User Time” which is the wall-clock time multiplied by the number of threads Can be confusing if you’re looking for speed-up numbers
  • 45. “DIY” Critical Sections You can also create your own “locks” (critical regions) Useful if the number of locks needs to be determined at run-time (recall that CRITICAL sections are defined at compile-time) omp_init_lock( &lock ) omp_set_lock( &lock ) omp_test_lock( &lock ) omp_unset_lock( &lock ) omp_destroy_lock( &lock )
  • 46. The FLUSH Directive Modern CPU architecture doesn’t entirely “play nice” with shared memory programming Cache memory is used to store recently used data items What happens if a cached memory item is needed by another Thread? I.e. the data-item currently resides in cache on CPU#1 and is not visible to CPU#2, how do we force CPU#1 to write the data to main memory? #pragma omp flush (var1,var2,var3) c$omp flush
  • 47. FLUSH Directive, cont’d while( (volatile)(GlobalFlag) == 0 ); . . . /* on some other thread */ GlobalFlag = 1; Without ‘volatile’ compiler may remove the loop First time through loop, GlobalFlag is read into cache . . . and may never be re-read
  • 48. FLUSH Directive, cont’d while( (volatile)(GlobalFlag) == 0 ) { #pragmaompflush (GlobalFlag) } . . . /* on some other thread */ GlobalFlag = 1; #pragmaompflush (GlobalFlag) Make sure we read new value Make sure we write new value Fortran: may need to tag GlobalFlag as ‘static’ or ‘save’
  • 49. Basic Tips OpenMP is built for large-loop parallelism with little synchronization If your loops require tight synchronization, you may have a hard time programming it in OpenMP It is sometimes easier to have a single (outer) OpenMP parallel-for loop which makes function calls to a “worker” routine This kind of loop can be pipelined for(i=0;i<N;i++) { a[i] = 0.5*( a[i-1] + b[i] ); } c$omp parallel do private(x) do i=1,N x = i*0.1 call DoWork( x ) enddo
  • 50. One “Big” Parallel Region vs. Several Smaller Ones c$omp parallel do do i = 1, N y(i) = alpha*x(i) + y(i) enddo c$omp end parallel . . . c$omp parallel do do i = 1, N z(i) = beta*y(i) + z(i) enddo c$omp end parallel #pragma omp parallel for for(i=0;i<N;i++) { y[i] = alpha*x[i] + y[i] } . . . #pragma omp parallel for for(i=0;i<N;i++) { z[i] = beta*y[i] + z[i]; } Each parallel region has an implied barrier (high cost)
  • 51. One “Big” Parallel Region, cont’d c$omp parallel c$omp do do i = 1, N y(i) = alpha*x(i) + y(i) enddo c$omp enddo . . . c$omp do do i = 1, N z(i) = beta*y(i) + z(i) enddo c$omp enddo c$omp end parallel #pragma omp parallel { #pragma omp for for(i=0;i<N;i++) { y[i] = alpha*x[i] + y[i] } . . . #pragma omp for for(i=0;i<N;i++) { z[i] = beta*y[i] + z[i]; } } /* end of parallel region */ This is still a parallel section
  • 52. Single-CPU Region within a Parallel Region Master -- only 1 Thread (and only the Master Thread) executes the code Single -- only 1 Thread (chosen at random) executes the code c$omp parallel c$omp do /* parallel */ c$omp enddo c$omp master /* single-cpu */ c$omp end master c$omp end parallel #pragma omp parallel { #pragma omp for /* parallel */ #pragma omp single { /* single-cpu */ } } /* end of parallel region */
  • 53. One “Big” Parallel Region, cont’d One (larger) parallel region has only one implied barrier Should be lower “cost” ... hence, better performance The “interior” regions will be executed in parallel which may result in redundant computations If there is too much redundant computation, it could slow down the code Use ‘MASTER’ or ‘SINGLE’ regions if needed Note that a small amount of redundant computation is often cheaper than breaking out to a single-CPU region and then re-starting a parallel region
  • 54. Performance Issue with Small Loops Each OpenMP parallel region (loop or section) has an overhead associated with it Non-zero time to start up N other Threads Non-zero time to synchronize at the end of the region If you have a very fast CPU and a very small loop, these overheads may be greater than the cost (time) to just do the loop on 1 CPU For small loops, we can tell OpenMP to skip the parallelism by using an ‘if’ clause i.e. only run the loop in parallel *IF* a certain condition holds
  • 55. Small Loops, cont’d c$omp parallel do if(N>1000) do i = 1, N y(i) = alpha*x(i) + y(i) enddo c$omp end parallel #pragma omp parallel for br /> if(N>1000) for(i=0;i<N;i++) { y[i] = alpha*x[i] + y[i] } Where you should put this cut-off is entirely machine-dependent
  • 56. Performance of Critical Sections Critical Sections can cause huge slow-downs in parallel code Recall, a Critical Section is serial (non-parallel) work, so it is not surprising that it would impact parallel performance Many times, you really want a ‘Reduction’ operation ‘Reduction’ will be much faster than a critical section For “small” Critical Sections, use ‘Atomic’ updates But check the compiler docs first … The Sun compilers turn ‘atomic’ blocks into comparable ‘critical’ sections
  • 57. Alternatives to Critical Sections integer temp(MAX_NUM_THREADS), sum integer self c assume temp(:) = 0 to start c$omp parallel private(self) self = omp_get_thread_num() c$omp do do i = 1, N . . . temp(self) = temp(self) + val enddo c$ompenddo c$omp end parallel sum = 0 do i = 1, MAX_NUM_THREADS sum = sum + temp(i) enddo
  • 58. “Dangerous” Performance Enhancement You can tell OpenMP to not insert barriers at the end of parallel regions This is “dangerous,” since with no barrier at the end, the Master Thread may begin running code that ASSUMES certain variables are stable c$omp parallel do nowait do i = 1, N y(i) = alpha*x(i) + y(i) enddo c$omp end parallel print *, ‘y(100)=‘, y(100) With no barrier at the end of the parallel region, we cannot guarantee that y(100) has been computed yet
  • 59. Race Conditions A “Race” is a situation where multiple Threads are making progress toward some shared or common goal where timing is critical Generally, this means you’ve ASSUMED something is synchronized, but you haven’t FORCED it to be synchronized E.g. all Threads are trying to find a “hit” in a database When a hit is detected, the Thread increments a global counter Normally, hits are rare events and so incrementing the counter is “not a big deal” But if two hits occur simultaneously, then we have a race condition While one Thread is incrementing the counter, the other Thread may be reading it It may read the old value or new value, depending on the exact timing
  • 60. Race Conditions, cont’d Some symptoms: Running the code with 4 Threads works fine, running with 5 causes erroneous output Running the code with 4 Threads works fine, running it with 4 Threads again produces different results Sometimes the code works fine, sometimes it crashes after 4 hours, sometimes it crashes after 10 hours, … Program sometimes runs fine, sometimes hits an infinite loop Program output is sometimes jumbled (the usual line-3 appears before line-1) Any kind of indeterminacy in the running of the code, or seemingly random changes to its output, could indicate a race condition
  • 61. Race Conditions, cont’d Race Conditions can be EXTREMELY hard to debug Generally, race conditions don’t cause crashes or even logic-errors E.g. a race condition leads two Threads to both think they are in charge of data-item-X … nothing crashes, they just keep writing and overwriting X (maybe even doing duplicate computations) Often, race conditions don’t cause crashes at the time they actually occur … the crash occurs much later in the execution and for a totally unrelated reason E.g. a race condition leads one Thread to think that the solver converged but the other Thread thinks we need another iteration … crash occurs because one Thread tried to compute the global residual Sometimes race conditions can lead to deadlock Again, this often shows up as seemingly random deadlock when the same code is run on the same input
  • 62. Deadlock A “Deadlock” condition occurs when all Threads are stopped at synchronization points and none of them are able to make progress Sometimes this can be something simple: Thread-1 enters a Critical-Section and waits for Variable-X to be incremented, and then it will release the lock Thread-2 wants to increment X but needs to enter the Critical-Section in order to do that Sometimes it can be more complex or cyclic: Thread-1 enters Crit-Sec-1 and needs to enter Crit-Sec-2 Thread-2 enters Crit-Sec-2 and needs to enter Crit-Sec-3 Thread-3 enters Crit-Sec-3 and …
  • 63. Starvation “Starvation” occurs when one Thread needs a resource (Critical Section) that is never released by another Thread Thread-1 grabs a lock/enters a critical region in order to write data to a file Every time Thread-1 is about to release the lock, it finds that it has another item to write ... it keeps the lock/re-enters the critical region Thread-2 continues to wait for the lock/critical region but never gets it Sometimes it may show up as “almost Deadlock”, other times it may show up as a performance degradation E.g. when Thread-1 is done writing all of its data, it finally gives the lock over to Thread-2 ... output is correct but performance is terrible
  • 64. Deadlock, cont’d Symptoms: Program hangs (!) Debugging deadlock is often straightforward … unless there is also a race condition involved Simple approach: put print statement in front of all Critical Section and Lock calls Thread-1 acquired lock on region-A Thread-1 acquired lock on region-C Thread-2 acquired lock on region-B Thread-1 release lock on region-C Thread-3 acquired lock on region-C Thread-1 waiting for lock on region-B Thread-2 waiting for lock on region-A
  • 65. Static Scheduling for Fixed Work-loads OpenMP allows dynamic scheduling of parallel loops If one CPU gets loaded down (by OS tasks), other CPUs will pick up additional iterations in the loop to offset that performance-loss Generally, this works well -- you don’t have to worry about all the random things that might happen, OpenMP does the “smart” thing However, this dynamic scheduling implies some additional “cost” If nothing else, the CPUs must check if other CPUs are loaded If your work-load inside a parallel region is fixed, then you may want to try a “static” work-assignment E.g. the loop is exactly 1,000,000 iterations, so the CPUs should take 100,000 each
  • 66. Static Scheduling, cont’d c$omp parallel do schedule(static) do i = 1, 1000000 y(i) = alpha*x(i) + y(i) enddo c$omp end parallel #pragmaomp parallel for schedule(static) for(i=0;i<1000000;i++) { y[i] = alpha*x[i] + y[i] } If the per-iteration work-load is variable, or you are sharing the CPUs with many other users, then a dynamic scheduling algorithm (the default) may be better
  • 67. Moving Towards Pthreads Pthreads (POSIX Threads) is the other major shared-memory environment It is a little bit like the OpenMP ‘Section’ Directive E.g. you use pthread_create() to create a new Thread and assign it a “work-function” (a section of code to execute) You can use the OpenMP Library Functions to create your own splitting of the loops: self_thread = omp_get_thread_num() num_thr = omp_get_num_threads() num_proc = omp_get_num_procs() You may be able to avoid some of the overheads with multiple parallel regions by “forcing” a consistent splitting onto your data and loops
  • 68. Alternative Approach to OpenMP datasize = 1000000 c$omp parallel private(i,self_thr,num_thr,istart,ifinish) self_thr = omp_get_thread_num() num_thr = omp_get_num_threads() istart = (self_thr-1)*(datasize/num_thr) + 1 ifinish = istart + (datasize/num_thr) do i = istart, ifinish y(i) = alpha*x(i) + y(i) enddo c$omp end parallel
  • 69. Alternative Approach to OpenMP, cont’d datasize = 1000000 #pragmaomp parallel private(i,self_thr,num_thr,istart,ifinish) { self_thr = omp_get_thread_num(); num_thr = omp_get_num_threads(); istart = (self_thr-1)*(datasize/num_thr); ifinish = istart + (datasize/num_thr); for(i=istart;i<ifinish;i++) { y[i] = alpha*x[i] + y[i]; } }
  • 70. For More Info “Parallel Programming in OpenMP” Chandra, Dagum, et. al. http://openmp.org Good list of links to other presentations and tutorials Many compiler vendors have good end-user documentation IBM Redbooks, tech-notes, “porting” guides http://docs.sun.com/source/819-3694/index.html
  • 71.
  • 72.
  • 73. Performance tuning methods (Vtune, Parallel tracing)
  • 74.

Notes de l'éditeur

  1. Welcome to thisslidecast titled Introduction to OpenMPMy name is John Pormann and I work at the Scalable Computing Support Center at Duke University.Our website and contact info are shown here.