3. Motivation
• Distributed memory clusters are becoming pervasive in
industry and academia
• Shells are the default login environment on these
systems
• Shell pipes are commonly used for composing extensible
unix commands.
• There has been no change to the syntax/semantics of
shell pipes since their invention over 30 years ago.
• Growing need to compose massively parallel jobs
quickly, using existing software
4. Extending Shells for Parallel
Computing
• Build a simple, powerful coordination layer at the Shell
• The coordination layer transparently manages the
parallelism in the workflow
• User specifies parallel computation as a dataflow graph
using extensions to the Shell
• Provides the ability to combine different tools and build
interesting parallel programs quickly.
5. Shell pipe extensions
• Pipeline fork
A | B on n procs
• Pipeline join
A on n procs | B
• Pipeline cycles
(++ n A)
• Pipeline key-value aggregation
A | B on keys
6. Parallel shell tasks extensions
> function foo()
{
echo “hello world”
}
> foo on all procs # foo() on all CPUs
> foo on all nodes # foo() on all nodes
stride
> foo on 10:2 procs # 10 tasks, 2 tasks on each node
span
> foo on 10:2:2 procs # 10 tasks, 2 tasks on alternative node
7. Composing data-flow graphs
• Example 1:
function B1() {}
B1
function B2() {}
A C
function B()
{ B2
if (($_ASPECT_TASKID == 0 )) ; then
B1
else
B2
endif
}
A | B on 2 procs | C
8. Composing data-flow graphs
• Example 2:
function map()
{
reduce
emit_tuple –k key –v value map
}
Key-value
function reduce() DHT
{
consume_tuple –k key –v value map reduce
num=${#value[@]}
for ((i=0; i < $num; i++)) ; do
# process key=$key, value=${value[$i]}
done
}
map on all procs | reduce on keys
10. Startup Overlay
• Script may have many instances requiring
startup of parallel tasks
• Motivation for overlay:
– Fast startup of parallel shell workers
– Handles node failures gracefully
• Two level hierarchy: sectors and proxies
• Overlay node addressing: 7 0
Compute node ID
Sector id Proxy id
11. Fault-Tolerance
• Proxy nodes monitor peers within sector, and
sector heads monitor peer sectors
• Node 0 maintains a list of available nodes in the
overlay in a master_node file
Overlay sector 0 Overlay sector 1
Proxy Node 3 Proxy Node 0 Proxy Node 6 Proxy Node 7
exec exec exec exec
Node 2 Node 1 Node 4 Node 5
Proxy Proxy Proxy Proxy
exec exec exec exec
master_node
18. 1. Process B pipes stdin into stdin_file
A | B on N procs
stdin BASH
stdout pipe (1)
aspect-agent B
stdin
A reader
stdin_file
19. 2. Constructs command files for each
task
A | B on N procs
stdin BASH
stdout pipe (1)
aspect-agent B
Cmd
stdin dispatcher
A reader (2)
stdin_file
Cmd
files
B
cat stdin_file | B
20. 3. 4. and 5. Execute command files in shell
workers and marshal results back to shell
A | B on N procs
stdin BASH
stdout pipe (1)
control
stdout
aspect-agent B
Cmd
stdin dispatcher
A reader I/O
(2) flusher
flusher MUX
stdin_file flusher
(3)
qu
eue
Cmd (5)
files Node
B Node
MUX Node
MUX
cat stdin_file | B MUX
Compute node (4)
Shell Shell
worker worker
B B
21. 6. Replay command files on failure
A | B on N procs
stdin BASH
stdout pipe (1)
control
stdout
aspect-agent B
Cmd
stdin dispatcher
A reader I/O
(2) flusher
flusher MUX
stdin_file flusher
replayer (3)
(6)
qu
e
Local compute node
ue
Cmd (5)
Shell Shell files Node
worker worker B Node
MUX Node
MUX
cat stdin_file | B MUX
Compute node (4)
B B
Shell Shell
worker worker
B B
23. 1. Agent inspects and hashes key
A | B on keys
pipe
BASH
control control
(1)
aspect-agent B
Key
A
dispatcher
24. 2. Routes key-value to compute node based
on key hash, and stored in hash table
A | B on keys
pipe
BASH
control control
(1)
aspect-agent B
Key
A
dispatcher
(2)
Node
MUX
Compute node Compute node
Distributed Hash Table
Hash Hash
gdbm table gdbm table
25. 3. Each node constructs command files to
pipe the key-value entry from its hash table
into process B
A | B on keys
pipe
BASH
control control
(1)
aspect-agent B
Key
A
dispatcher
(2)
Node
MUX
Compute node Compute node
Distributed Hash Table
Hash Hash
gdbm table gdbm table
emit_tuple emit_tuple
(3)
B B
26. 4. Results from the command files
execution are marshaled back to the shell
A | B on keys
pipe
BASH
control control
(1)
stdout
control
aspect-agent B
Key I/O MUX
A
dispatcher
(2)
Node
MUX (4)
Compute node Compute node
Distributed Hash Table
Hash Hash
gdbm table gdbm table
emit_tuple emit_tuple
(3)
B B
31. TeraSort benchmark:
Parallel bucket sort
• Step 1: spawn the data generator in parallel on
each compute node, partitioning data across N
nodes for task T if the first 2 bytes fall in the
range:
16 T 16 T + 1
2 ∗ N , 2 ∗
N
• Step 2: perform sort on local data on each node
• Step 3: merge results onto global file system
33. Related Work
• Ptolemy – embedded system design
• Yahoo Pipes – web content filtering
• Hadoop – Java implementation of
MapReduce
• Dryad - distributed DAG data flow
computation
34. Conclusion
• A debugger would be extremely helpful.
Working on bashdb implementation.
• Run-time simulator would be helpful to
predict performance based on
characteristics of cluster.
• Still thinking about how to incorporate our
extensions for named pipes (i.e. mkfifo).