2. Parallel Processes
Multicomputer
• A multicomputer is constructed out of multiple
computers and an interconnection network. The
processors on different computers interact by
passing messages to each other.
Multiprocessor
• It is a computer system with two or more CPUs. It is
highly integrated system in which all CPUs share
access to a single global memory.
2
3. Parallel Processes
Uniform Memory Access (UMA)
• It is a shared memory architecture in which
access time to a memory location is
independent of which processor makes the
request or which memory chip contains the
transferred data.
3
4. PROGRAMMING PARALLEL PROCESSES
Every parallel language must address certain issues, either explicitly
or implicitly
There must be a way to create parallel processes
there must be a way to coordinate the activities of these processes
when processes exchange results, they must communicate and
synchronize with each other.
Communication and synchronization can be accomplished by sharing
variables or by message passing.
5. PROGRAMMING PARALLEL PROCESSES
Communication and synchronization can be accomplished by sharing
variables or by message passing. There are two methods of
synchronization:
synchronization for
precedence
It guarantees that one
event does not begin until
another event has finished
synchronization for
mutual exclusion.
It guarantees that only one
process at a time enters a
critical section of code
where a data structure to
be shared is manipulated
8. An Illustrative Example for UMA
let's consider the problem of computing variance for a list of real numbers. The
values given are r1, r2, r3 ,..., rn
Variance = 𝒓𝒊 − 𝒎 𝟐/𝒏 where m = 𝒓𝒊/𝒏
Parallel Architecture used is UMA with 4 Processors
n real numbers are stored in shared memory
Four variables are also stored in shared memory
one will contain the grand total
another the mean
a third the global sum of squares
a fourth the variance
9. An Illustrative Example for UMA
Four processes are created one for each processors.
Each process has a local temporary variable.
Each process adds its share of the n values
Accumulating the sum in its local temporary variable.
When the process is through computing its subtotal, it adds its
subtotal to the shared variable, accumulating the grand total.
Since multiple processes are accessing the same global variable,
that portion of code is a critical section, and the processes must
enforce mutual exclusion.
11. An Illustrative Example for UMA
A barrier synchronization step, inserted in the algorithm after the critical
section, ensures that no process continues until all processes have added
their subtotals to the grand total.
In sequential code one process computes the average by dividing the
grand total by n.
To compute the global sum of squares, the processes go through a process
similar to that which computes the global sum.
Again, the processes must enforce mutual exclusion.
After another barrier synchronization, one process computes the variance
by dividing the sun of squares by the number of values.
12. An Illustrative Example for Multicomputer
Solve the same problem on a four-node multicomputer
There is no shared memory; the n values are
distributed among the local memories of the
nodes.
Each node process has four variables:
two to accumulate sums
one to store the mean
and another to store the variance
Each node process initializes the two accumulator
to 0.
At this point every process has a subtotal; the
four subtotals must be combined to find the
grand total.
14. An Illustrative Example for Multicomputer
The four subtotals must be
combined to find the grand total.
After two exchange-and-add
steps, every process has the grand
total.
Every process can divide the
grand total by n to determine the
mean.
15. An Illustrative Example for Multicomputer
A similar set of steps allows the
processes to compute the variance of
the list values.
Every process has the result. One of the
processes can pass the answer back to
program running on the front end, which
then de-allocates the hypercube,
surrendering access to the nodes.
16. A Sample Application
Integration: Find the area under the curve
4/( I +x2) between 0 and 1=
The interval [0, 1] is divided into n
subintervals of width I/n.
For each these intervals the algorithm
computes the area of a rectangle whose
height is such that the curve intersects the top
of the rectangle at its midpoint.
The sum of the areas of the n rectangles
approximates the area under the curve.
17. A Sample Application
This algorithm is data-parallel.
Since the areas of all the rectangles
can be
computed simultaneously.
Computing the area of each rectangle
requires the same amount of work:
hence load balancing is insignificant.
If the language requires to divide the
work among the processors, it can be
done easily.
19. FORTRAN 90
In 1978 the ANSI-accredited technical
committee, X3J3 began working on a
new version of the FORTRAN language.
In the early l990s the resulting language,
Fortran 90, was adopted as an ISO and
ANSI standard.
20. FORTRAN 90
Fortran 90 is a superset of FORTRAN 77. It includes all the features of FORTRAN 77, plus
Array operations
Improved facilities for numerical computations
Syntax to allow processors to support short integers, packed logicals, very large
character sets, and high-precision real and complex numbers
User-defined data types, structures, and pointers
Dynamic storage allocation
Modules to support abstract data types
Internal procedures and recursive procedures
Improvements to input-output facilities
New control structures
New intrinsic procedures
Terminal-oriented source form
21. FORTRAN 90
The committee also marked many language features as obsolete,
including
arithmetic IF
some DO construct variations,
assigned Go TO,
assigned formats
and the H edit descriptor
The next revision of the FORTRAN standard may
not contain these features.
22. FORTRAN 90 Programmer‘s Model
The Fortran 90 programmer has a model of parallel computation similar to a
PRAM. A CPU and a vector unit share a Single memory.
The CPU executes sequential
instructions, accessing variables
stored in the shard memory.
To execute parallel operations,
the CPU controls the vector unit.
Which also stores and fetches
data to and from the shared
memory
23. FORTRAN 90 Language Features
Fortran 90 gives the programmer the ability to specify the type of variables through type
declaration statements such as
• REAL A, B, C
• INTEGER I
Each type may have several kinds. For example, a real variable may be stored in 4 bytes or
8 bytes. The Fortran 90 programmer may specify explicitly the kind, as well as the type of
the variable, as in the following example:
• REAL ( KIND=LONG) PI
Fortran 90 introduces the notion of an array constant. For example.
• (/ 2, 3, 5, 7,11 /)
denotes a one dimensional array with five elements. It is possible to construct an array of
higher dimension by declaring an array constant, then changing its dimension with the
RESHAPE function.
24. FORTRAN 90 Language Features
An implied Do notation can simplify the specification of any constants.
For example, the array constant
• (/ 2, 4, 6, 8,10 /) may be specified as (/ (I, I = 2,10, 2) /)
Fortran 90 also allows operations on arrays. When applied to an array, the unary intrinsic
operators + and - return an array of the same dimensions, where the elements in the result
array are found by applying the operator to the corresponding elements in the operand
array.
Numerical, relational, and logical binary intrinsic operators can manipulate arrays having the
same dimensions. Each element in the result array is found by applying the operator to the
corresponding elements in the operand arrays.
A binary intrinsic operator can also manipulate an array and a scalar variable, resulting in
an array of the same dimensions as the array operand.
25. FORTRAN 90 Language Features
For example, given the array declarations
• REAL, DIMENSION(100,50) : : X, Y
• REAL, DIMENSION(100) : : Z
the following are examples of legal array expressions:
X + Y Array of shape(100,50), elements X(I,J) + Y(I,J)
X + 1.0 Array of shape(100,50), elements X(I,J) + 1.0
X .EQ. Y Value .True. If X(I,J) .EQ. Y(I,J) and .FALSE. otherwise
X(1:100,3) +Z Array of shape(100), elements X(I,3) + Z(I)
26. FORTRAN 90 Language Features
Sometimes it is important to be able to perform an
operation on a subset of the Array elements. The WHERE
statement allows the programmer to specify which array
elements are lo be active. For example, the statement
WHERE (A > 0.0) A = SORT (A)
replaces every positive element of A with its square root.
27. FORTRAN 90 Language Features
The WHERE statement divides the array elements into two sets, first performing
one or more array assignments on the elements for which the expression is true,
then performing one or more array assignments on the elements for which the
expression is false. The syntax of most general form of WHERE statement is-
WHERE(logical-array-expression)
array-assignment-statements
ELSEWHERE
array-assignment-statements
END WHERE
Finally new transformational functions allow the reduction of an array into a scalar
value. For example, the function SUM returns the sum of the elements of the any
passed to it as an argument.
30. Sequent C
Sequent computers run the DYNIX operating system, a version of UNIX
tailored for the multiprocessor environment.
In addition to the operating-system calls typically found in a UNIX
system, DYNIX provides a set of routines to facilitate parallel processing.
The commercial parallel programming languages the Sequent hardware
uses are simple extensions of sequential languages that allow
programmers to declare shared variables that interact via mutual
exclusion and barrier synchronization.
The resulting languages are primitive.
31. Sequent C - Shared Data
Parallel processes on the Sequent coordinate their activities by accessing
shared data structures.
The keyword shared placed before a global variable declaration,
indicates that all processes are to share a single instance of that
variable.
For example, if a 10-element global array is declared int a [10], then
every active process has its own copy of the array; if one process
modifies a value in its copy of a, no other process's value will change.
On the other hand, if the array is declared shared int a [10] , then all
active processes share a single instance of the array, and changes made
by one process can be detected by the other processes.
32. Sequent C - Parallel Processing Function
A program begins execution as a single process. This process is
responsible for executing those parts of the program that are
inherently sequential.
the original process forks a number of other processes, each
process performing its share of the work.
The total number of processes accessing shared data cannot
exceed the number of physical processors less one. Because there
are at least as many CPUs as active processes, each process may
execute on its own CPU.
This allows a major reduction in the execution time, assuming that
the computer is not executing any other jobs.
33. Sequent C - Parallel Processing Function
When control reaches an inherently sequential
portion of the computation, only the original
process executes the code; the remaining processes
wait until control reaches another portion of the
computation that can be divided into pieces and
executed concurrently. The program cycles through
these two modes until termination.
Parallel programs executing on the Sequent
alternate between sequential and parallel
execution.
The transition from parallel to sequential execution
is always delimited by a barrier synchronization.
In addition, data dependencies may
require the insertion of barrier
synchronizations within parallel code.
35. Sequent C - Parallel Processing Function
Sequent's microtasking library has seven key functions:
1. m_set_procs(p):
The parent process initializes to value p a shared variable that
controls the number of processes created by a subsequent call to
m_fork.
The value of p can not exceed the number of physical processors
in the system minus one.
The function also initializes barriers and locks.
36. Sequent C - Parallel Processing Function
2. m_fork(name[,arg,...]):
The parent process creates a number of child processes,
The parent process and the child processes begin executing
function name with the arguments (if any) also specified by the
call to m_fork.
After all the processes (the parent and all the children) have
completed execution of function name, the parent process
resumes execution with the code after m_fork, while the child
processes busy wait until the next call to m_fork.
37. Sequent C - Parallel Processing Function
2. m_fork(name[,arg,...]):
The parent
process
creates p
number of
child
processes
All processes
begin
executing
function name
The parent
process
resumes
execution with
the code after
m_fork
The child
processes
busy wait until
the next call
to m_fork.
Therefore, the first call to m_fork is more expensive than subsequent
calls, because only the first call entails process creation.
38. Sequent C - Parallel Processing Function
3. m_get_myid: A process calls function m_get_myid to get its
unique process number. If the total number of active processes
is p, then the process number of the parent is 0, while the
process numbers of the child processes range from 1 to p-1.
4. m_get_numprocs: Function m_get_numprocs returns the
number of processes executing in parallel. Given the total
number of processes and its own process number, a process
can determine which portion of a computation is its
responsibility.
39. Sequent C - Parallel Processing Function
5. m_lock, m_unlock: Functions m_lock and m_unlock
ensure mutually exclusive execution of the code that
the two calls surround. Once a process has entered a
block of code delimited by m_lock and m_unlock, no
other process may enter until the first process has left.
7. m_kill_procs: Function m_kill_procs kills the child
processes created by the first call to m_fork.
40. Sequent C - Monitor
Most parallel algorithms implemented on multiprocessors require a
process to perform a series of operations on a shared data structure, as
if it were an atomic operation.
For example, a process may need to fetch the value at the beginning of
a linked list and advance the list pointer to the next list element.
When the hardware cannot perform the entire series of operations as an
atomic operation, the process must have some way to enforce mutual
exclusion, keeping all other processes from referencing the resource while
it is being modified. The piece of code in which mutual exclusion must be
enforced is called a critical section.
41. Sequent C - Monitor
One way to structure accesses to shared resources is by using a monitor.
A monitor consists of variables representing the state of some resource,
procedures that implement operations on it, and initialization code.
The values of the variables are initialized before any procedure in the
monitor is called; these values are retained between procedure invocations
and may be accessed only by procedures in the monitor.
Monitor procedures resemble ordinary procedures in the programming
language with one significant exception. The execution of the procedures in
the same monitor is guaranteed to be mutually exclusive. Hence monitors
are a structured way of implementing mutual exclusion.
44. Sequent C - Monitor
Programming languages that support monitors include Concurrent
Pascal (Brinch Hansen 1975, 1977) and Modula (Wirth 1977a,
1977b, 1977c). Even if your parallel programming language does
not support monitors, you can implement one yourself. For example, in
the Sequent C language, you can implement a monitor by declaring a
shared lock variable for each resource, putting an s_lock statement
that accesses the variable at the start of each procedure, and putting
an s_unlock statement at the end of each procedure. You also must
have enough self-discipline to use only these procedures to access the
shared resource.