Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Japan Lustre User Group 2014
1. Extreme Big Data (EBD)
Convergence of Extreme Computing
and Big Data Technologies
7RUIHRCJPLTPBJTMU:SHPUTHTK3USV=PTN3LTL:
DUQ@UTXP=LUMDLJOTURUN@
8PUXOPCHU
2. Big Data Examples
Rates and Volumes are extremely immense
Social NW
• Facebook
– 1 billion users
– Average 130 friends
– 30 billion pieces of content
shared per month
• Twitter
– 500 million active users
– 340 million tweets per day
• Internet
– 300 million new websites per year
– 48 hours of video to YouTube per minute
– 30,000 YouTube videos played per second
Genomics
Social Simulation
3. Sequencing)data)(bp)/$)
)becomes)x4000)per)5)years)
c.f.,)HPC)x33)in)5)years
).121
,.1(,120,
.2252
• Applications
– Target Area: Planet
(Open Street Map)
– 7 billion people
• Input Data
– Road Network for Planet:
300GB (XML)
– Trip data for 7 billion people
10KB (1trip) x 7 billion = 70TB
– Real-Time Streaming Data
(e.g., Social sensor, physical data)
• Simulated Output for 1 Iteration
– 700TB
Weather
A#1.'Quality'Control
A#2.'Data'Processing
30#sec'Ensemble'
Forecast'Simulations
2'PFLOP
Ensemble
Data'Assimilation
2'PFLOP
Himawari
500MB/2.5min
Ensemble'Forecasts
8. Big Data Examples
Rates and Volumes are extremely immense
Social NW
• Facebook
Future “Extreme Big Data”
• NOT mining Tbytes Silo Data
• Peta~Zetabytes of Data
• Ultra High-BW Data Stream
• Highly Unstructured, Irregular
• Complex correlations between data from multiple sources
• Extreme Capacity, Bandwidth, Compute All Required
– 1 billion users
– Average 130 friends
– 30 billion pieces of content
shared per month
• Twitter
– 500 million active users
– 340 million tweets per day
• Internet
– 300 million new websites per year
– 48 hours of video to YouTube per minute
– 30,000 YouTube videos played per second
Genomics
Social Simulation
9. Sequencing)data)(bp)/$)
)becomes)x4000)per)5)years)
c.f.,)HPC)x33)in)5)years
).121
,.1(,120,
.2252
• Applications
– Target Area: Planet
(Open Street Map)
– 7 billion people
• Input Data
– Road Network for Planet:
300GB (XML)
– Trip data for 7 billion people
10KB (1trip) x 7 billion = 70TB
– Real-Time Streaming Data
(e.g., Social sensor, physical data)
• Simulated Output for 1 Iteration
– 700TB
Weather
A#1.'Quality'Control
A#2.'Data'Processing
30#sec'Ensemble'
Forecast'Simulations
2'PFLOP
Ensemble
Data'Assimilation
2'PFLOP
Himawari
500MB/2.5min
Ensemble'Forecasts
15. A: 0.57, B: 0.19
C: 0.19, D: 0.05
November(15,(2010(
Graph500TakesAimataNewKindofHPC
RichardMurphy(SandiaNL=Micron)
“(IexpectthatthisrankingmayatImeslookvery
differentfromtheTOP500list.Cloudarchitectureswill
almostcertainlydominateamajorchunkofpartofthe
list.”(
The 8th Graph500 List (June2014): K Computer #1, TSUBAME2 #12
Koji Ueno, Tokyo Institute of Technology/RIKEN AICS
RIKEN Advanced Institute for Computational
Science (AICS)’s K computer
is ranked
No.1
Reality: Top500 Supercomputers Dominate
on the Graph500 Ranking of Supercomputers with
17977.1 GE/s on Scale 40
on the 8th Graph500 list published at the International
Supercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
#1 K Computer
Global Scientific Information and Computing Center, Tokyo Institute
of Technology’s TSUBAME 2.5
is ranked
No.12
on the Graph500 Ranking of Supercomputers with
1280.43 GE/s on Scale 36
on the 8th Graph500 list published at the International
Supercomputing Conference, June 22, 2014.
Congratulations from the Graph500 Executive Committee
#12 TSUBAME2
No Cloud IDCs at all
23. A Major Northern Japanese
Cloud Datacenter (2013)
10GbE 10GbE
Juniper(MX480 Juniper(MX480
2(zone(switches((Virtual(Chassis)
10GbE
Juniper(EX8208 Juniper(EX8208
Juniper(
EX4200
Juniper(
EX4200
Zone((700(nodes)
Juniper(
EX4200
Juniper(
EX4200
Zone((700(nodes)
Juniper(
EX4200
10GbE
Juniper(
EX4200
Zone((700(nodes)
LACP
the(Internet
8 zones, Total 5600 nodes,
Injection 1GBps/Node
Bisection 160Gigabps
Supercomputer Tokyo Tech.
Tsubame 2.0
#4 Top500 (2010)
Advanced Silicon
Photonics 40G
single CMOS Die
1490nm DFB
100km Fiber
~1500 nodes compute storage
Full Bisection Multi-Rail
Optical Network
Injection 80GBps/Node
Bisection 220Terabps
x1000!
24. Towards Extreme-scale
Supercomputers and BigData Machines
• Computation
– Increase in Parallelism, Heterogeneity, Density
• Multi-core, Many-core processors
• Heterogeneous processors
• Hierarchial Memory/Storage Architecture
– NVM (Non-Volatile Memory),
SCM (Storage Class Memory)
• 61C8##3 #CDD B1 #BLB1 #
8 3#LJ
– Next-gen HDDs (SMR),
Tapes (LTFS)
(
Algorithm
Network
Locality
Power
FT Productivity
Storage Hierarchy
I/O
Problems
Scalability
Heterogeneity
25. Extreme Big Data (EBD)
Next Generation Big Data
Infrastructure Technologies Towards
Yottabyte/Year
Principal Investigator
Satoshi Matsuoka
Global Scientific Information and
Computing Center
Tokyo Institute of Technology
2014/11/05 JST CREST Big Data Symposium
26. EBE Research Scheme
Future Non-Silo Extreme Big Data Apps
Co-Design
Co-Design
Co-Design 日本地図13/06/06 22:36
EBD System Software
incl. EBD Object System
NVM/
Flash
NVM/
Flash
NVM/
Flash
DRAM
DRAM
DRAM
2Tbps HBM
4~6HBM Channels
1.5TB/s DRAM
NVM BW
30PB/s I/O BW Possible
1 Yottabyte / Year
TSV Interposer
NVM/
Flash
NVM/
Flash
NVM/
Flash
DRAM
DRAM
DRAM
EBD Bag
Cartesian
Plane
KV
S
KV
S
EBD KVS
1000km
KV
S
Convergent Architecture (Phases 1~4)
Large Capacity NVM, High-Bisection NW
Supercomputers
ComputeBatch-Oriented
Cloud+IDC
Very low BW Efficiencty
PCB
High Powered
Main CPU
Low
Power
CPU
Low
Power
CPU
Introduction
Problem Domain
In most living organisms their development molecule called DNA consists of called nucleotides.
The four bases found A), cytosine (C), Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Algorithm Large Scale
Metagenomics
Massive Sensors and
Data Assimilation in
Weather Prediction
Ultra Large Scale
Graphs and Social
Infrastructures
Exascale Big Data HPC
Graph
Store
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
28. Tasks(5c1~5c3 Task6
Problem Domain
In most their development molecule DNA consists called nucleotides.
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction
100,000TimesFoldEBD“Convergent”SystemOverview
Problem Domain
EBD(Performance(Modeling((
(EvaluaQon
Task(4
To decipher the information contained we need to determine the order This task is important for many areas of science and 日本地図medicine.
13/06/06 22:36
Cartesian(Plane
Modern sequencing techniques KVS
molecule into pieces (called reads) KVS
processed separately KVS
to increase sequencing throughput.
1000km
Reads must be aligned file:///Users/shirahata/Pictures/日本地図.svg to the 1/1 ページ
reference
sequence to determine their position molecule. This process is called alignment.
Task(3
EBD(Programming(System(
Graph(Store
EBD(ApplicaQon(Coc
Design(and(ValidaQon(
Ultra(High(BW((Low(Latency(NVM Ultra(High(BW((Low(Latency(NW(
Processorcincmemory 3D(stacking
Large(Scale(
Genomic(
CorrelaQon
Data(AssimilaQon(
in(Large(Scale(Sensors(
and(Exascale(
Atmospherics
Large(Scale(Graphs(
and(Social(
Infrastructure(Apps
TSUBAME(3.0
TSUBAME(2.0/2.5
EBD(“converged”(RealcTime(
Resource(Scheduling(
EBD(Distrbuted(Object(Store(on(
100,000(NVM(Extreme(Compute(
and(Data(Nodes(
Task(2
EBD(Bag
EBD(KVS(
Ultra(Parallel((Low(Powe(I/O(EBD(
“Convergent”(Supercomputer(
~10TB/s*~100TB/s*~10PB/s
Task(1
29. Problem Domain
In most their development molecule DNA consists called nucleotides.
The four A), cytosine Aleksandr Drozd, Naoya Maruyama, Satoshi Matsuoka A(TMITuEltCi HG)PU Read Alignment Introduction
100,000TimesFoldEBD“Convergent”SystemOverview
Problem Domain
To decipher the information contained we need to determine the order This task is important for many areas of science and medicine.
Modern sequencing techniques molecule into pieces (called reads) processed separately to increase sequencing throughput.
Reads must be aligned to the reference
sequence to determine their position molecule. This process is called alignment.
SQLforEBD Xpregel(Graph)
Workflow/ScripIngLanguagesforEBD MapReduceforEBD
日本地図13/06/06 22:36
Cartesian(Plane
KVS
EBDBurstI/OBuffer EBDNetworkTopologyandRouIng
KVS
1000km
KVS
file:///Users/shirahata/Pictures/日本地図.svg 1/1 ページ
MessagePassing(MPI,X10)forEBD
EBD(Bag
EBDFileSystem EBDDataObject
Graph(Store
Cloud(Datacenter
Large(Scale(Genomic(
CorrelaQon
Data(AssimilaQon(
in(Large(Scale(Sensors(
and(Exascale(
Atmospherics
Large(Scale(Graphs(
and(Social(
Infrastructure(Apps
TSUBAME(3.0( TSUBAMEcGoldenBox
EBD(KVS(
Interconnect
(InfiniBand(100GbE)
EBDAbstractDataModels
(Distributed(Array,(Key(Value,(Sparse(Data(Model,(Tree,(etc.)
EBDAlgorithmKernels((
(Search/(Sort,(Matching,(Graph(Traversals,(,(etc.)(
NVM
(61C8##3 #CDD B1 #
BLB1 #8 3#LJ)(
HPCStorage
PGAS/GlobalArrayforEBD
Network
(SINET5)
Intercloud(/(Grid((HPCI)
WebObjectStorage
34. EBD-IO Device: A Prototype of Local Storage
reliable storage designs for resilient extreme scale computing.
Configuration [Shirahata, Sato et al. GTC2014]
3.2 Burst Buffer System
To solve the problems in a flat buffer system, we consider a
burst buffer system [21]. A burst buffer is a storage space to
bridge the gap in latency and bandwidth between node-local stor-age
High(Bandwidth(and(IOPS,(Huge(Capacity,(Low(Cost,(Power(Efficient((
16cardsofmSATASSDdevices
and the PFS, and is shared by a subset of compute nodes.
Although additional nodes are required, a burst buffer can offer
a system many advantages including higher reliability and effi-ciency
Capacity:256GBx16→4TBReadBW:0.5GB/sx16→8GB/s
over a flat buffer system. A burst buffer system is more
reliable for checkpointing because burst buffers are located on
a smaller number of dedicated I/O nodes, so the probability of
lost checkpoints is decreased. In addition, even if a large number
of compute nodes fail concurrently, an application can still ac-cess
the checkpoints from the burst buffer. A burst buffer system
provides more efficient utilization of storage resources for partial
restart of uncoordinated checkpointing because processes involv-ing
restart can exploit higher storage bandwidth. For example, if
compute node 1 and 3 are in the same cluster, and both restart
from a failure, the processes can utilize all SSD bandwidth unlike
a flat buffer system. This capability accelerates the partial restart
of uncoordinated checkpoint/restart.
Table 1 Node specification
CPU Intel Core i7-3770K CPU (3.50GHz x 4 cores)
Memory Cetus DDR3-1600 (16GB)
M/B GIGABYTE GA-Z77X-UD5H
SSD Crucial m4 msata 256GB CT256M4SSD3
(Peak read: 500MB/s, Peak write: 260MB/s)
SATA converter KOUTECH IO-ASS110 mSATA to 2.5’ SATA
Device Converter with Metal Fram
RAID Card Adaptec RAID 7805Q ASR-7805Q Single
A(single(mSATA(SSD( 8(integrated(mSATA(SSDs(
RAID(cards( Prototype/Test(machine(
36. Sorting for EBD
Plugging in GPUs for large-scale sorting
30
20
10
0
0 500 1000 1500 2000
# of proccesses (2 proccesses per node)
Keys/second(billions)
HykSort 1thread
HykSort 6threads
HykSort GPU + 6threads
K20x x4 faster than K20x
60
40
20
0
0 500 1000 1500 2000 0 500 1000 1500 2000
# of proccesses (2 proccesses per node)
Keys/second(billions)
HykSort 6threads
HykSort GPU + 6threads
PCIe_10
PCIe_100
PCIe_200
PCIe_50
Prediction of our implementation
[Shamoto, Sato et al.
BigData 2014]
• GPU implementation of
splitter-based sorting (HykSort)
• Weak scaling performance (Grand
Challenge on TSUBAME2.5)
– 1 ~ 1024 nodes (2 ~ 2048 GPUs)
– 2 processes per node and each node has
2GB 64bit integer
• Yahoo/Hadoop Terasort: 0.02[TB/s]
– Including I/O
x1.4
x3.61
x389
0.25
[TB/s]
• Performance prediction
! PCIe_#: #GB/s bandwidth
x2.2 speedup compared
to CPU-based
implmentataion when the
# of PCI bandwidth
increase to 50GB/s
of interconnect between
CPU and GPU
8.8% reduction of
overall runtime when
the accelerators work 4
times faster than K20x
37. Graph500Benchmark h{p://www.graph500.org
! New BigData Benchmark based on Large-scale Graph Search for Ranking
Supercomputers
! BFS (Breadth First Search) from a single vertex on a static, undirected Kronecker
graph with average vertex degree edgegactor (=16).
! Evaluation criteria: TEPS (Traversed Edges Per Second), and problem size that
can be solved on a system, minimum execution time.
SCALE(and(edgefactor(=16)
Input parameters Graph generation Graph construction BFS Validation
Results
BFS Validation
Input parameters Graph generation Graph construction BFS Validation
TEPS
MedianTEPS
1. GeneraIon
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS Time
- Traversed edges
- TEPS
TEPS
ratio
64 Iterations
- SCALE
- edgefactor
- SCALE
- edgefactor
- BFS - Traversed - TEPS
Input parameters Graph generation Graph construction TEPS
ratio
64 Iterations
- SCALE
- edgefactor
ratio
64 Iterations
2. ConstrucIon 3. BFS
x64
• Kronecker(graph( TEPS(raQox64
– 2SCALE(verQces(and(2SCALE+4(edges(
– syntheQc(scalecfree(network(
38. Scalable distributed memory BFS for Graph500
Koji Ueno (Tokyo Institute of Technology) et al.
! What’s the best algorithm for distributed memory
BFS?
We proposed and explored many
optimizations.
Optimizations SC11 ISC12 SC12 ISC14
2D decomposition ✓ ✓ ✓ ✓
vertex sorting ✓
direction optimization ✓
data compression ✓ ✓ ✓
sparse vector with pop counting ✓
adaptive data representation ✓
overlapped communication ✓ ✓ ✓ ✓
shared memory ✓
GPGPU ✓ ✓
✓ Utilization for each version
100
317
462
1,280
1400
1200
1000
800
600
400
200
0
SC'11 ISC'12 SC'12 ISC'14
Performance (GTEPS)
Graph500 score history of
TSUBAME2
Continuous effort to improve
performance
Optimized for various machines
Machine # of nodes Performance
K computer 65536 5524 GTEPS
TSUBAME2.5 1024 1280 GTEPS
TSUBAME-KFC 32 104 GTEPS
45. Expectations for Next-Gen Storage System
(Towards TSUBAME3.0)
• Achieving high IOPS
– Many apps with massive small I/O ops
• graph, etc.
• Utilizing NVM devices
– Discrete local SSDs on TSUBAME2
– How to aggregate them?
• Stability/Reliability as Archival Storage
• I/O resource reduction/consolidation
– Can we allow a large number of OSSs for
achieving ~TB/s throughput
– Many constraints
• Space, Power, Budget, etc.
46. Current Status
• New Approaches
• Tsubame 2.0 has pioneered the use of local flash storage as
a high-IOPS alternative to an external PFS
• Tired and hybrid storage environments, combining (node)
local flash with an external PFS
• Industry Status
• High-performance, high-capacity flash (and other new
semiconductor devices) are becoming available
at reasonable cost
• New approaches/interface to use high-performance devices
(e.g. NVMexpress)