Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
HPCS16 - Frederick Lefebvre - Bridging the last mile
1. A platform for data management and
analytics in campuses and research labs
Frédérick Lefebvre
frederick.lefebvre@calculquebec.ca
2. ● Compute Canada and its regional partners have put a lot
of work into using Canarie’s and the Nren’s network to
interconnect their infrastructure through high speed
networks
● 10 GbE right now / 100 GbE for all new systems
● 25 Globus/GridFTP data transfer nodes have been
deployed to facilitate data movement across the Compute
Canada infrastructure
3. Fast data transfers between datacenters
is great but what about everyone else ?
4. ● Data doesn't just magically appear on on
Compute Canada’s systems.
● It gets created “somewhere”, has a life of its
own, comes to our systems for a brief time and
goes back home...
5. Utilization data from the
CC Globus
infrastructure over the
past 2 years supports
this model
6. ● Transfers to and from our infrastructure
○ More data moves back out but not by much
7. ● As we centralize resources, we are moving
storage and computing further away from
researchers
● Visualization, real-time computations as well
as application development and prototyping
can be impaired by the increase latency with
the systems and their teams
8. ● There is a need to improve tools available to
researchers to facilitate their use of Advanced
Research Computing resources.
○ Improved end-to-end networking
○ Wider deployment of data movement and pre-
processing infrastructure
9. ● Deploy Data Transfer Nodes (DTN) close to
where data is generated and extend the
science-dmz all the way to the labs
○ DTNs administered by the local ARC team
○ Local ingestion points can be dedicated to a research
lab or the whole campus
Based on the Fiona DTN developed by SDSC for the Pacific Research Platform
https://fasterdata.es.net/science-dmz/DTN/fiona-flash-i-o-network-appliance/
10. ● Science-DMZ
○ Dedicated
research network
○ Away from
firewalls
○ All the way to the
researchers
Ref: Science-dmz - es.net
http://fasterdata.es.net/science-dmz/science-dmz-architecture/
11. ● High speed data transfers need purpose built
Data Transfer Node
● Above all, they require fast drives to prevent
disk IOs from becoming the bottleneck
● Spinning disks are seldom usable unless you
are going to have lots of them
○ Think 10s of them to achieve 40 Gbps!!!
12. ● Modern processors have much more power
that what is required to move data from drives
to networks
● The fast IOs of a DTN and its large memory
make it ideal to run streaming workload, data
analytics and general data transformation
● Why let it sit idle ?
13. ● Enhance the DTNs with the ability to run code
on local data through a web interface
○ Focus for now on scripting languages and big data
analytics with Apache Spark
○ Creates an environment where data can be ingested,
explored, modified and then moved elsewhere
14.
15. grifFTP
server inside
container,
bound to
specific cores
All other
cores
shared by
the OS and
user code
● JupyterLab to manage and launch
user’s Notebooks
● Authentication against the CC ldap
directory
16.
17. ● Perfsonar in containers (in progress)
● Scale out whole Notebooks or Apache Spark
workloads to a parallel cluster (in progress)
● Network export of local storage
● Automated data transformation pipelines
● Software building blocks & code snippets in
the Notebooks
18. S3
Sensors upload data
to local storage
through an S3 API
Researcher explores
its data with R and
Apache Spark in a
Notebook
1.
2.
Data is anonymized3.
Anonymized data is
transferred to a CC
system using Globus
4.
19. Sequencers output
data on local storage
through CIFS share
Fastq files are
preprocessed locally
1.
2.
Files are
characterized and
indexed
3.
Data is transferred to
parallel system for
further processing
4.
20. ● A gateway to get researcher’s data onto
Compute Canada’s infrastructure
● A local platform for data exploration &
visualization, pre-processing and prototyping
21. ● A generic web portal to submit workloads on
ARC systems
○ We have automated node reservation to scale out
Notebooks on Colosse.
○ The way we do it on Colosse requires the portal to be
a submit host
○ There has to be a better way. Web API ?
22. Processors 2x Xeon E5-2640v4 = 40 logical cores
Memory 128 GB DDR4
Network interfaces Mellanox ConnectX3-pro dual port 40GbE
Drives for OS 2x 128 GB SATA SSD
Local storage (Perf. option) 8x 400GB nvme drives
Local storage (Capacity option) 24x 8TB NL SAS drives
● Cost is from ~12K to 25K and up
○ storage is the differentiator
23. ● There is a need for high speed data transport
services in campuses and larger labs
● Local computing capabilities create new
opportunities for quick innovation
● We envision a model where researchers
finance their local portal to size it up to their
needs
24. ● We have selected 2 pilot sites that will be
deployed this summer
● You can participate by:
○ Becoming a pilot site
○ Contribute to the platform design and development
○ Letting us know how we can improve the model
○ Help us find a better name…
● Contact us: frederick.lefebvre@calculquebec.ca