SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
A platform for data management and
analytics in campuses and research labs
Frédérick Lefebvre
frederick.lefebvre@calculquebec.ca
● Compute Canada and its regional partners have put a lot
of work into using Canarie’s and the Nren’s network to
interconnect their infrastructure through high speed
networks
● 10 GbE right now / 100 GbE for all new systems
● 25 Globus/GridFTP data transfer nodes have been
deployed to facilitate data movement across the Compute
Canada infrastructure
Fast data transfers between datacenters
is great but what about everyone else ?
● Data doesn't just magically appear on on
Compute Canada’s systems.
● It gets created “somewhere”, has a life of its
own, comes to our systems for a brief time and
goes back home...
Utilization data from the
CC Globus
infrastructure over the
past 2 years supports
this model
● Transfers to and from our infrastructure
○ More data moves back out but not by much
● As we centralize resources, we are moving
storage and computing further away from
researchers
● Visualization, real-time computations as well
as application development and prototyping
can be impaired by the increase latency with
the systems and their teams
● There is a need to improve tools available to
researchers to facilitate their use of Advanced
Research Computing resources.
○ Improved end-to-end networking
○ Wider deployment of data movement and pre-
processing infrastructure
● Deploy Data Transfer Nodes (DTN) close to
where data is generated and extend the
science-dmz all the way to the labs
○ DTNs administered by the local ARC team
○ Local ingestion points can be dedicated to a research
lab or the whole campus
Based on the Fiona DTN developed by SDSC for the Pacific Research Platform
https://fasterdata.es.net/science-dmz/DTN/fiona-flash-i-o-network-appliance/
● Science-DMZ
○ Dedicated
research network
○ Away from
firewalls
○ All the way to the
researchers
Ref: Science-dmz - es.net
http://fasterdata.es.net/science-dmz/science-dmz-architecture/
● High speed data transfers need purpose built
Data Transfer Node
● Above all, they require fast drives to prevent
disk IOs from becoming the bottleneck
● Spinning disks are seldom usable unless you
are going to have lots of them
○ Think 10s of them to achieve 40 Gbps!!!
● Modern processors have much more power
that what is required to move data from drives
to networks
● The fast IOs of a DTN and its large memory
make it ideal to run streaming workload, data
analytics and general data transformation
● Why let it sit idle ?
● Enhance the DTNs with the ability to run code
on local data through a web interface
○ Focus for now on scripting languages and big data
analytics with Apache Spark
○ Creates an environment where data can be ingested,
explored, modified and then moved elsewhere
grifFTP
server inside
container,
bound to
specific cores
All other
cores
shared by
the OS and
user code
● JupyterLab to manage and launch
user’s Notebooks
● Authentication against the CC ldap
directory
● Perfsonar in containers (in progress)
● Scale out whole Notebooks or Apache Spark
workloads to a parallel cluster (in progress)
● Network export of local storage
● Automated data transformation pipelines
● Software building blocks & code snippets in
the Notebooks
S3
Sensors upload data
to local storage
through an S3 API
Researcher explores
its data with R and
Apache Spark in a
Notebook
1.
2.
Data is anonymized3.
Anonymized data is
transferred to a CC
system using Globus
4.
Sequencers output
data on local storage
through CIFS share
Fastq files are
preprocessed locally
1.
2.
Files are
characterized and
indexed
3.
Data is transferred to
parallel system for
further processing
4.
● A gateway to get researcher’s data onto
Compute Canada’s infrastructure
● A local platform for data exploration &
visualization, pre-processing and prototyping
● A generic web portal to submit workloads on
ARC systems
○ We have automated node reservation to scale out
Notebooks on Colosse.
○ The way we do it on Colosse requires the portal to be
a submit host
○ There has to be a better way. Web API ?
Processors 2x Xeon E5-2640v4 = 40 logical cores
Memory 128 GB DDR4
Network interfaces Mellanox ConnectX3-pro dual port 40GbE
Drives for OS 2x 128 GB SATA SSD
Local storage (Perf. option) 8x 400GB nvme drives
Local storage (Capacity option) 24x 8TB NL SAS drives
● Cost is from ~12K to 25K and up
○ storage is the differentiator
● There is a need for high speed data transport
services in campuses and larger labs
● Local computing capabilities create new
opportunities for quick innovation
● We envision a model where researchers
finance their local portal to size it up to their
needs
● We have selected 2 pilot sites that will be
deployed this summer
● You can participate by:
○ Becoming a pilot site
○ Contribute to the platform design and development
○ Letting us know how we can improve the model
○ Help us find a better name…
● Contact us: frederick.lefebvre@calculquebec.ca

Contenu connexe

Tendances

Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
Red Hat Storage for Mere Mortals
Red Hat Storage for Mere MortalsRed Hat Storage for Mere Mortals
Red Hat Storage for Mere MortalsRed_Hat_Storage
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheAlluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Ted Dunning
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetupvmoorthy
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokAlluxio, Inc.
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopDataWorks Summit
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analyticsmason_s
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioThai Bui
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Alluxio, Inc.
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.Roman Nikitchenko
 

Tendances (20)

HDFS Erasure Coding in Action
HDFS Erasure Coding in Action HDFS Erasure Coding in Action
HDFS Erasure Coding in Action
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Red Hat Storage for Mere Mortals
Red Hat Storage for Mere MortalsRed Hat Storage for Mere Mortals
Red Hat Storage for Mere Mortals
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Gfs vs hdfs
Gfs vs hdfsGfs vs hdfs
Gfs vs hdfs
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Drill at the Chug 9-19-12
Drill at the Chug 9-19-12Drill at the Chug 9-19-12
Drill at the Chug 9-19-12
 
Hadoop architecture meetup
Hadoop architecture meetupHadoop architecture meetup
Hadoop architecture meetup
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
 
Google mesa
Google mesaGoogle mesa
Google mesa
 
Efficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in HadoopEfficient processing of large and complex XML documents in Hadoop
Efficient processing of large and complex XML documents in Hadoop
 
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data AnalyticsSupersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
Supersized PostgreSQL: Postgres-XL for Scale-Out OLTP and Big Data Analytics
 
Scaling HDFS at Xiaomi
Scaling HDFS at XiaomiScaling HDFS at Xiaomi
Scaling HDFS at Xiaomi
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
 
Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...Scalable and High available Distributed File System Metadata Service Using gR...
Scalable and High available Distributed File System Metadata Service Using gR...
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 

En vedette

An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...
An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...
An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...Chun Phoebe Xu
 
Lettre Rep. Frederica Wilson au Secrétaire d'Etat John Kerry
Lettre Rep. Frederica Wilson au Secrétaire d'Etat John KerryLettre Rep. Frederica Wilson au Secrétaire d'Etat John Kerry
Lettre Rep. Frederica Wilson au Secrétaire d'Etat John KerryStanleylucas
 
Tosi & mac donough carpetagrafic adeppresentacion
Tosi & mac donough carpetagrafic adeppresentacionTosi & mac donough carpetagrafic adeppresentacion
Tosi & mac donough carpetagrafic adeppresentacionEleonora Elisa
 
Трудове навчання 5 клас 2_параграф
Трудове навчання 5 клас 2_параграфТрудове навчання 5 клас 2_параграф
Трудове навчання 5 клас 2_параграфAndy Levkovich
 
Presentación 1 - Informatica
Presentación 1 - InformaticaPresentación 1 - Informatica
Presentación 1 - InformaticaJhomayra1
 
moabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid Enginemoabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid EngineFrédérick Lefebvre
 
Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...
Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...
Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...Stanleylucas
 
Sample ex parte application for continuance of trial date for California evic...
Sample ex parte application for continuance of trial date for California evic...Sample ex parte application for continuance of trial date for California evic...
Sample ex parte application for continuance of trial date for California evic...LegalDocsPro
 
Dipangshu Final project work
Dipangshu Final project workDipangshu Final project work
Dipangshu Final project workDipangshu Sarkar
 
City of Fairborn Police Department Canine Policy
City of Fairborn Police Department Canine PolicyCity of Fairborn Police Department Canine Policy
City of Fairborn Police Department Canine PolicyQuinn Brandt
 

En vedette (16)

An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...
An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...
An Integrated Musculoskeletal-Finite-Element Model to Evaluate Effects of Loa...
 
Award who's who
Award who's whoAward who's who
Award who's who
 
Lettre Rep. Frederica Wilson au Secrétaire d'Etat John Kerry
Lettre Rep. Frederica Wilson au Secrétaire d'Etat John KerryLettre Rep. Frederica Wilson au Secrétaire d'Etat John Kerry
Lettre Rep. Frederica Wilson au Secrétaire d'Etat John Kerry
 
Tosi & mac donough carpetagrafic adeppresentacion
Tosi & mac donough carpetagrafic adeppresentacionTosi & mac donough carpetagrafic adeppresentacion
Tosi & mac donough carpetagrafic adeppresentacion
 
Трудове навчання 5 клас 2_параграф
Трудове навчання 5 клас 2_параграфТрудове навчання 5 клас 2_параграф
Трудове навчання 5 клас 2_параграф
 
Presentación 1 - Informatica
Presentación 1 - InformaticaPresentación 1 - Informatica
Presentación 1 - Informatica
 
Diagnosa keperawatan
Diagnosa keperawatanDiagnosa keperawatan
Diagnosa keperawatan
 
Theo te escribo.- Pierre Bachelet
Theo te escribo.- Pierre BacheletTheo te escribo.- Pierre Bachelet
Theo te escribo.- Pierre Bachelet
 
20160120 pru tp2 poster luis susana
20160120 pru tp2 poster luis susana20160120 pru tp2 poster luis susana
20160120 pru tp2 poster luis susana
 
moabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid Enginemoabcon2012 - Transitioning from Grid Engine
moabcon2012 - Transitioning from Grid Engine
 
Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...
Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...
Lettre du President du CEP au Coordonnateur de la Commission d'Evaluation Ele...
 
Sample ex parte application for continuance of trial date for California evic...
Sample ex parte application for continuance of trial date for California evic...Sample ex parte application for continuance of trial date for California evic...
Sample ex parte application for continuance of trial date for California evic...
 
Qrcode 150415192919-conversion-gate01
Qrcode 150415192919-conversion-gate01Qrcode 150415192919-conversion-gate01
Qrcode 150415192919-conversion-gate01
 
Redacción de textos
Redacción de textosRedacción de textos
Redacción de textos
 
Dipangshu Final project work
Dipangshu Final project workDipangshu Final project work
Dipangshu Final project work
 
City of Fairborn Police Department Canine Policy
City of Fairborn Police Department Canine PolicyCity of Fairborn Police Department Canine Policy
City of Fairborn Police Department Canine Policy
 

Similaire à HPCS16 - Frederick Lefebvre - Bridging the last mile

Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingGlobus
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettJim St. Leger
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
 
Spotlight on the petroleum and energy vertical
Spotlight on the petroleum and energy vertical Spotlight on the petroleum and energy vertical
Spotlight on the petroleum and energy vertical FileCatalyst
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...Facultad de Informática UCM
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesDATAVERSITY
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native PlatformSunil Govindan
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility ExhibitionGlobus
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Future services on Janet
Future services on JanetFuture services on Janet
Future services on JanetJisc
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Mellanox Technologies
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...SURFnet
 

Similaire à HPCS16 - Frederick Lefebvre - Bridging the last mile (20)

Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
DPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles ShiflettDPDK Summit 2015 - Aspera - Charles Shiflett
DPDK Summit 2015 - Aspera - Charles Shiflett
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Spotlight on the petroleum and energy vertical
Spotlight on the petroleum and energy vertical Spotlight on the petroleum and energy vertical
Spotlight on the petroleum and energy vertical
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
A Glass Half Full: Using Programmable Hardware Accelerators in Analytical Dat...
 
Slides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data LakesSlides: Accelerating Queries on Cloud Data Lakes
Slides: Accelerating Queries on Cloud Data Lakes
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility Exhibition
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...Network-aware Data Management for High Throughput Flows   Akamai, Cambridge, ...
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Future services on Janet
Future services on JanetFuture services on Janet
Future services on Janet
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions
 
IBM Aspera overview
IBM Aspera overview IBM Aspera overview
IBM Aspera overview
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
 

HPCS16 - Frederick Lefebvre - Bridging the last mile

  • 1. A platform for data management and analytics in campuses and research labs Frédérick Lefebvre frederick.lefebvre@calculquebec.ca
  • 2. ● Compute Canada and its regional partners have put a lot of work into using Canarie’s and the Nren’s network to interconnect their infrastructure through high speed networks ● 10 GbE right now / 100 GbE for all new systems ● 25 Globus/GridFTP data transfer nodes have been deployed to facilitate data movement across the Compute Canada infrastructure
  • 3. Fast data transfers between datacenters is great but what about everyone else ?
  • 4. ● Data doesn't just magically appear on on Compute Canada’s systems. ● It gets created “somewhere”, has a life of its own, comes to our systems for a brief time and goes back home...
  • 5. Utilization data from the CC Globus infrastructure over the past 2 years supports this model
  • 6. ● Transfers to and from our infrastructure ○ More data moves back out but not by much
  • 7. ● As we centralize resources, we are moving storage and computing further away from researchers ● Visualization, real-time computations as well as application development and prototyping can be impaired by the increase latency with the systems and their teams
  • 8. ● There is a need to improve tools available to researchers to facilitate their use of Advanced Research Computing resources. ○ Improved end-to-end networking ○ Wider deployment of data movement and pre- processing infrastructure
  • 9. ● Deploy Data Transfer Nodes (DTN) close to where data is generated and extend the science-dmz all the way to the labs ○ DTNs administered by the local ARC team ○ Local ingestion points can be dedicated to a research lab or the whole campus Based on the Fiona DTN developed by SDSC for the Pacific Research Platform https://fasterdata.es.net/science-dmz/DTN/fiona-flash-i-o-network-appliance/
  • 10. ● Science-DMZ ○ Dedicated research network ○ Away from firewalls ○ All the way to the researchers Ref: Science-dmz - es.net http://fasterdata.es.net/science-dmz/science-dmz-architecture/
  • 11. ● High speed data transfers need purpose built Data Transfer Node ● Above all, they require fast drives to prevent disk IOs from becoming the bottleneck ● Spinning disks are seldom usable unless you are going to have lots of them ○ Think 10s of them to achieve 40 Gbps!!!
  • 12. ● Modern processors have much more power that what is required to move data from drives to networks ● The fast IOs of a DTN and its large memory make it ideal to run streaming workload, data analytics and general data transformation ● Why let it sit idle ?
  • 13. ● Enhance the DTNs with the ability to run code on local data through a web interface ○ Focus for now on scripting languages and big data analytics with Apache Spark ○ Creates an environment where data can be ingested, explored, modified and then moved elsewhere
  • 14.
  • 15. grifFTP server inside container, bound to specific cores All other cores shared by the OS and user code ● JupyterLab to manage and launch user’s Notebooks ● Authentication against the CC ldap directory
  • 16.
  • 17. ● Perfsonar in containers (in progress) ● Scale out whole Notebooks or Apache Spark workloads to a parallel cluster (in progress) ● Network export of local storage ● Automated data transformation pipelines ● Software building blocks & code snippets in the Notebooks
  • 18. S3 Sensors upload data to local storage through an S3 API Researcher explores its data with R and Apache Spark in a Notebook 1. 2. Data is anonymized3. Anonymized data is transferred to a CC system using Globus 4.
  • 19. Sequencers output data on local storage through CIFS share Fastq files are preprocessed locally 1. 2. Files are characterized and indexed 3. Data is transferred to parallel system for further processing 4.
  • 20. ● A gateway to get researcher’s data onto Compute Canada’s infrastructure ● A local platform for data exploration & visualization, pre-processing and prototyping
  • 21. ● A generic web portal to submit workloads on ARC systems ○ We have automated node reservation to scale out Notebooks on Colosse. ○ The way we do it on Colosse requires the portal to be a submit host ○ There has to be a better way. Web API ?
  • 22. Processors 2x Xeon E5-2640v4 = 40 logical cores Memory 128 GB DDR4 Network interfaces Mellanox ConnectX3-pro dual port 40GbE Drives for OS 2x 128 GB SATA SSD Local storage (Perf. option) 8x 400GB nvme drives Local storage (Capacity option) 24x 8TB NL SAS drives ● Cost is from ~12K to 25K and up ○ storage is the differentiator
  • 23. ● There is a need for high speed data transport services in campuses and larger labs ● Local computing capabilities create new opportunities for quick innovation ● We envision a model where researchers finance their local portal to size it up to their needs
  • 24. ● We have selected 2 pilot sites that will be deployed this summer ● You can participate by: ○ Becoming a pilot site ○ Contribute to the platform design and development ○ Letting us know how we can improve the model ○ Help us find a better name… ● Contact us: frederick.lefebvre@calculquebec.ca