SlideShare une entreprise Scribd logo
1  sur  18
SciMATE: A Novel MapReduce-Like
 Framework for Multiple Scientific
          Data Formats

         Speaker: LIN Qian
http://www.comp.nus.edu.sg/~linqian
Scientific data analysis today
• Increasingly data-intensive
  – Volume approximately doubles each year
• Stored in certain specialized formats
  – NetCDF, HDF5, ADIOS ...
• Popularity of MapReduce and its variants
  – Free accessibility
  – Easy programmability
  – Good scalability
  – Built-in fault tolerance
                                             1
NetCDF
• Network Common Data Form




                             2
HDF5
• Hierarchical Data Format




                             3
Scientific data analysis today (cont.)
• “Store-first-analyze-after”
  – Reload data in another file system
     E.g. load data from PVFS to HDFS
  – Reload data in another data format
     E.g. load NetCDF/HDF5 data to a specific
     format
• Problems
  – Long data migration/transformation time
  – Stressing network and disks
                                                4
SciMATE
• In-situ scientific data analysis
  – MapReduce with AlternaTE API
  – Supporting NetCDF, HDF5 and flat-files
     oNo data reloading!
  – Transparent to app developers

• Optimized for
  – Access strategies
  – Access patterns
                                             5
System overview




                  6
Scientific Data Processing Module

                              Runtime
                               System
Integrating a new data format
• Data adaption layer is customizable
  – Third-party adapter
  – Open for extension but closed for
    modification
• Have to implement the generic block
  loader interface
  – Partitioning function and auxiliary
    functions
  – Data access functions
                                          8
Data access strategies and patterns
• full_read()
  – too expensive for reading small data
    subsets
• partial_read()
  – Strided pattern
     o partial_read_by_block()
  – Column pattern
     o partial_read_by_column()
  – Discrete point pattern
     o partial_read_by_list()
                                           9
Access Pattern Optimization
• Strided pattern
  – directly supported by API
• Discrete point pattern
  – no optimization
• Column pattern
  – fixed-size column read      1 2 3   4 5




  – contiguous column read        1      2

                                              10
Evaluation
• System functionality and scalability
  – 16 GB datasets
  – Data processing times
     ok-means, PCA, kNN
     othread scalability, node scalability
  – Data loading times
     ok-means, PCA
     onode scalability
• Partial read vs. Full read
• Fixed-size column read vs. Contiguous column
  read
                                             11
Thread scalability
Node scalability (data processing)
Node scalability (data loading)
Fixed-size column read vs. Contiguous column read




       NetCDF                        HDF5
Contiguous column read




NetCDF shows better column non-contiguity tolerance than HDF5.
                                                                 16
Conclusion and Future Work
• Conclusion
  – Avoid bulk data transfers and vast data
    transformation
  – Provide a customizable data format
    adaption API
  – Support optimized read via access
    strategies & patterns
• Future Work
  – Compare with SciHadoop
                                              17

Contenu connexe

Tendances

Tendances (20)

Hadoop and MapReduce
Hadoop and MapReduceHadoop and MapReduce
Hadoop and MapReduce
 
Data warehouse 11 introduction to data transformation
Data warehouse 11 introduction to data transformationData warehouse 11 introduction to data transformation
Data warehouse 11 introduction to data transformation
 
From Backups To Time Travel: A Systems Perspective on Snapshots
From Backups To Time Travel: A Systems Perspective on SnapshotsFrom Backups To Time Travel: A Systems Perspective on Snapshots
From Backups To Time Travel: A Systems Perspective on Snapshots
 
Online Analytical Processing
Online Analytical ProcessingOnline Analytical Processing
Online Analytical Processing
 
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use caseBDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case
 
HDF5 High Level and Lite Libraries
HDF5 High Level and Lite LibrariesHDF5 High Level and Lite Libraries
HDF5 High Level and Lite Libraries
 
2 bda module-2 apache hive
2 bda module-2 apache hive2 bda module-2 apache hive
2 bda module-2 apache hive
 
Open-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDFOpen-source Scientific Computing and Data Analytics using HDF
Open-source Scientific Computing and Data Analytics using HDF
 
BDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using ImpalaBDM8 - Near-realtime Big Data Analytics using Impala
BDM8 - Near-realtime Big Data Analytics using Impala
 
Generalized Conversion of HDF-EOS Products to GIS-Compatible Formats
Generalized Conversion of HDF-EOS Products to GIS-Compatible FormatsGeneralized Conversion of HDF-EOS Products to GIS-Compatible Formats
Generalized Conversion of HDF-EOS Products to GIS-Compatible Formats
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
Hadoop
HadoopHadoop
Hadoop
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
ODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" SourcesODI11g, Hadoop and "Big Data" Sources
ODI11g, Hadoop and "Big Data" Sources
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 
Indexing HDF5: A Survey
Indexing HDF5: A SurveyIndexing HDF5: A Survey
Indexing HDF5: A Survey
 
Big data and hadoop anupama
Big data and hadoop anupamaBig data and hadoop anupama
Big data and hadoop anupama
 

Similaire à SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
jasonfrantz
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 

Similaire à SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats (20)

HDF Update
HDF UpdateHDF Update
HDF Update
 
Drill architecture 20120913
Drill architecture 20120913Drill architecture 20120913
Drill architecture 20120913
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
Plans for Enhanced NetCDF-4 Interface to HDF5 DataPlans for Enhanced NetCDF-4 Interface to HDF5 Data
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014Tim Pugh-SPEDDEXES 2014
Tim Pugh-SPEDDEXES 2014
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
 
Navigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data DiscoveryNavigating the World of User Data Management and Data Discovery
Navigating the World of User Data Management and Data Discovery
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
HDF
HDFHDF
HDF
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...Data management for Quantitative Biology -Basics and challenges in biomedical...
Data management for Quantitative Biology -Basics and challenges in biomedical...
 

Plus de Qian Lin

Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsFine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Qian Lin
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud
Qian Lin
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Qian Lin
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable Computation
Qian Lin
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
Qian Lin
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Qian Lin
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid Virtualization
Qian Lin
 
Virtual Machine Performance
Virtual Machine PerformanceVirtual Machine Performance
Virtual Machine Performance
Qian Lin
 
Be an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a WriterBe an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a Writer
Qian Lin
 

Plus de Qian Lin (12)

Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain SystemsFine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
 
PaxosStore: High-availability Storage Made Practical in WeChat
PaxosStore: High-availability Storage Made Practical in WeChatPaxosStore: High-availability Storage Made Practical in WeChat
PaxosStore: High-availability Storage Made Practical in WeChat
 
Trinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory CloudTrinity: A Distributed Graph Engine on a Memory Cloud
Trinity: A Distributed Graph Engine on a Memory Cloud
 
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse MatricesPresto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
 
Adaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable ComputationAdaptive Execution Support for Malleable Computation
Adaptive Execution Support for Malleable Computation
 
C-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the CloudC-Cube: Elastic Continuous Clustering in the Cloud
C-Cube: Elastic Continuous Clustering in the Cloud
 
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected WorldKineograph: Taking the Pulse of a Fast-Changing and Connected World
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
 
Optimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid VirtualizationOptimizing Virtual Machines Using Hybrid Virtualization
Optimizing Virtual Machines Using Hybrid Virtualization
 
Virtual Machine Performance
Virtual Machine PerformanceVirtual Machine Performance
Virtual Machine Performance
 
Be an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a WriterBe an Explorer, Be a Coder, Be a Writer
Be an Explorer, Be a Coder, Be a Writer
 
In-situ MapReduce for Log Processing
In-situ MapReduce for Log ProcessingIn-situ MapReduce for Log Processing
In-situ MapReduce for Log Processing
 
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core ProcessorsC-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
 

Dernier

Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 

Dernier (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Magic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptxMagic bus Group work1and 2 (Team 3).pptx
Magic bus Group work1and 2 (Team 3).pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 

SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats

  • 1. SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
  • 2. Scientific data analysis today • Increasingly data-intensive – Volume approximately doubles each year • Stored in certain specialized formats – NetCDF, HDF5, ADIOS ... • Popularity of MapReduce and its variants – Free accessibility – Easy programmability – Good scalability – Built-in fault tolerance 1
  • 5. Scientific data analysis today (cont.) • “Store-first-analyze-after” – Reload data in another file system E.g. load data from PVFS to HDFS – Reload data in another data format E.g. load NetCDF/HDF5 data to a specific format • Problems – Long data migration/transformation time – Stressing network and disks 4
  • 6. SciMATE • In-situ scientific data analysis – MapReduce with AlternaTE API – Supporting NetCDF, HDF5 and flat-files oNo data reloading! – Transparent to app developers • Optimized for – Access strategies – Access patterns 5
  • 8. Scientific Data Processing Module Runtime System
  • 9. Integrating a new data format • Data adaption layer is customizable – Third-party adapter – Open for extension but closed for modification • Have to implement the generic block loader interface – Partitioning function and auxiliary functions – Data access functions 8
  • 10. Data access strategies and patterns • full_read() – too expensive for reading small data subsets • partial_read() – Strided pattern o partial_read_by_block() – Column pattern o partial_read_by_column() – Discrete point pattern o partial_read_by_list() 9
  • 11. Access Pattern Optimization • Strided pattern – directly supported by API • Discrete point pattern – no optimization • Column pattern – fixed-size column read 1 2 3 4 5 – contiguous column read 1 2 10
  • 12. Evaluation • System functionality and scalability – 16 GB datasets – Data processing times ok-means, PCA, kNN othread scalability, node scalability – Data loading times ok-means, PCA onode scalability • Partial read vs. Full read • Fixed-size column read vs. Contiguous column read 11
  • 14. Node scalability (data processing)
  • 16. Fixed-size column read vs. Contiguous column read NetCDF HDF5
  • 17. Contiguous column read NetCDF shows better column non-contiguity tolerance than HDF5. 16
  • 18. Conclusion and Future Work • Conclusion – Avoid bulk data transfers and vast data transformation – Provide a customizable data format adaption API – Support optimized read via access strategies & patterns • Future Work – Compare with SciHadoop 17