SlideShare une entreprise Scribd logo
1  sur  29
Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech
Data Analysis Timeline
ISAM files
COBOL/JCL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map Reduce/Hive
Data Analysis Timeline
ISAM files
COBOL/JCL
RDBMS
SQL
HDFS files
Map
Reduce/Hive
Data Analysis Timeline
HDFS files
Map
Reduce/Hive
SQL
Simplified View of
Co-existence
HDFS
Files
Map Reduce , Hive,
HBase
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
Simplified View of
Co-existence
SQL
HDFS Files
Map Reduce , Hive,
HBase
RDBMS
Files
HDFS
The
Great
Divide
PXF addresses the
divide.
Pivotal eXtension Framework (PXF)
• History
o Based on external table functionality of RDBMS
o Built at Pivotal by a small team in Israel
• Goals
o Single hop
o No Materialization of data
o Fully parallel for high throughput
o Extensible
Motivation for building PXF
• Use SQL engine’s statistical/analytic functions (e.g.
Madlib) on third party data stores e.g.
o HBase data
o Hive data
o Native Data on HDFS in a variety of formats
• Join in-database dimensions with other fact tables
• Fast ingest of data into SQL native format (insert into …
select * from …)
Motivation for building PXF
• Enterprises love the cheap storage offered by HDFS,
and want to store data over there
• M/R is very limiting
• Integrating with Third Party systems e.g. Accumulo etc.
• Existing techniques involved copying data to HDFS,
which is very brittle and in-efficient
High Level Flow
SQL
Data
Node5
Data
Node1
Data
Node2
Data
Node3
Data
Node4
Where is
the data for
table foo?
On
DataNodes
1,3 and 5
- Protocol is http
- End points are running on all data nodes
Name
Node
Major components
• Fragmenter
o Get the locations of fragments for a table
• Accessor
o Understand and read the fragment, return records
• Resolver
o Convert the records into a SQL engine format
• Analyzer
o Provide source stats to the Query optimizer
PXF Architecture
HAWQ
Master
M/R,
Pig,
Hive
Data Node
Container with End-Points
PXF Fragmenter
Local HDFS
Hadoop
Pivotal Green
Zookeeper
3
1
6
PSQL
select * from external table foo
location=”pxf://namenode:50070/financedata”
0
splits[..]
HAWQ
Segment
getSplit(0)
PXFWritable
A
B
0 6To
A BTo
MetaData
Data
Native
PHD
5
4
PXF Accessor/Resolver
Local HDFS
2
Classes
• The four major components are defined as interfaces and
base classes that can be extended. e.g. Fragmenter
/*
* Class holding information about fragments (FragmentInfo)
*/
public class FragmentsOutput {
public FragmentsOutput();
public void addFragment(String sourceName, String[] replicas, byte[] metadata );
public void addFragment(String sourceName, String[] replicas, byte[] metadata,
String userData);
public List<FragmentInfo> getFragments();
}
/* Internal interface that defines the access to data on the source
* data store (e.g, a file on HDFS, a region of an HBase table, etc).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IReadAccessor {
public boolean openForRead() throws Exception;
public OneRow readNextObject() throws Exception;
public void closeForRead() throws Exception;
}
/*
* An interface for writing data into a data store
* (e.g, a sequence file on HDFS).
* All classes that implement actual access to such data sources must
* respect this interface
*/
public interface IWriteAccessor {
public boolean openForWrite() throws Exception;
public OneRow writeNextObject(OneRow onerow) throws Exception;
public void closeForWrite() throws Exception;
}
Accessor Interface
/*
* Interface that defines the deserialization of one record brought from
* the data Accessor. Every implementation of a deserialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IReadResolver
{
public List<OneField> getFields(OneRow row) throws Exception;
}
/*
* Interface that defines the serialization of data read from the DB
* into a OneRow object.
* Every implementation of a serialization method
* (e.g, Writable, Avro, ...) must implement this interface.
*/
public interface IWriteResolver
{
public OneRow setFields(DataInputStream inputStream) throws Exception;
}
Resolver Interface
/*Abstract class that defines getting statistics for ANALYZE.
* GetEstimatedStats returns statistics for a given path
* (block size, number of blocks, number of tuples (rows)).
* Used when calling ANALYZE on a PXF external table, to get
* table's statistics that are used by the optimizer to plan queries.
*/
public abstract class Analyzer extends Plugin {
public Analyzer(InputData metaData){
super(metaData);
}
/** path is a data source name (e.g, file, dir, wildcard, table name)
* returns the data statistics in json format
*
* NOTE: It is highly recommended to implement an extremely fast logic
* that returns *estimated* statistics. Scanning all the data for exact
* statistics is considered bad practice.
*/
public String GetEstimatedStats(String data) throws Exception {
/* Return default values */
return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo());
}
}
Analyzer Interface
Syntax - Long Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?
FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter&
ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor&
RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver&
ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer')
format 'TEXT' (delimiter = ',');
Say WHAT???
Syntax - Short Form
CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer)
location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple')
format 'TEXT' (delimiter = ',');
Whew!!
Built in Profiles
• # of profiles are built in and more are being contributed
o HBase, Hive, HDFS Text, Avro, SequenceFiles,
GemFireXD, Accumulo, Cassandra, JSON
o PXF will be open-sourced completely, for using with your
favorite SQL engine.
o But you can write your own connectors right now, and
use it with HAWQ.
Predicate Pushdown
• SQL engines may push down parts of the “WHERE” clause
down to PXF.
• e.g. “where id > 500 and id < 1000”
• PXF provides a FilterBuilder class
• Filters can be combined together
• Simple expression “constant <OP> column”
• Complex expression “object(s) <OP> object(s)”
Demo
• Create a text file on HDFS
• Create a table using a SQL engine (HAWQ) on HDFS
• Create an external table using PXF
• Select from both tables separately
• Finally run a join across both tables
More info online...
• http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html
• http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
Questions?
Pivotal eXtension
Framework
Sameer Tiwari
Hadoop Storage Architect, Pivotal Inc.
stiwari@gopivotal.com, @sameertech

Contenu connexe

Tendances

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Mark Kerzner
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & ZingLong Dao
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the BasicsHBaseCon
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hari Shankar Sreekumar
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using PigDavid Wellman
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and HadoopEdureka!
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 

Tendances (20)

Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
 
HDFS
HDFSHDFS
HDFS
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Hadoop & Zing
Hadoop & ZingHadoop & Zing
Hadoop & Zing
 
Apache HBase - Just the Basics
Apache HBase - Just the BasicsApache HBase - Just the Basics
Apache HBase - Just the Basics
 
Hadoop distributed file system
Hadoop distributed file systemHadoop distributed file system
Hadoop distributed file system
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop - Introduction to Hadoop
Hadoop - Introduction to HadoopHadoop - Introduction to Hadoop
Hadoop - Introduction to Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
Hadoop architecture (Delhi Hadoop User Group Meetup 10 Sep 2011)
 
Practical Hadoop using Pig
Practical Hadoop using PigPractical Hadoop using Pig
Practical Hadoop using Pig
 
Introduction to Big Data and Hadoop
Introduction to Big Data and HadoopIntroduction to Big Data and Hadoop
Introduction to Big Data and Hadoop
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Hive
Apache HiveApache Hive
Apache Hive
 
Pptx present
Pptx presentPptx present
Pptx present
 

Similaire à PXF Framework for SQL Analytics on Hadoop Data

Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...VMware Tanzu
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLPeter Eisentraut
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practicesfelixcss
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeMarco Gralike
 
WhatsNewNIO2.pdf
WhatsNewNIO2.pdfWhatsNewNIO2.pdf
WhatsNewNIO2.pdfMohit Kumar
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会Toshihiro Suzuki
 
Working with the IFS on System i
Working with the IFS on System iWorking with the IFS on System i
Working with the IFS on System iChuck Walker
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapperpsoo1978
 
Data handling in r
Data handling in rData handling in r
Data handling in rAbhik Seal
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friendslucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsJulien Nioche
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comEdward D. Kim
 

Similaire à PXF Framework for SQL Analytics on Hadoop Data (20)

Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
Apache Kite
Apache KiteApache Kite
Apache Kite
 
SQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQLSQL/MED: Doping for PostgreSQL
SQL/MED: Doping for PostgreSQL
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
SQL/MED and PostgreSQL
SQL/MED and PostgreSQLSQL/MED and PostgreSQL
SQL/MED and PostgreSQL
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 
Android Data Storagefinal
Android Data StoragefinalAndroid Data Storagefinal
Android Data Storagefinal
 
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco GralikeBoost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
Boost Your Environment With XMLDB - UKOUG 2008 - Marco Gralike
 
WhatsNewNIO2.pdf
WhatsNewNIO2.pdfWhatsNewNIO2.pdf
WhatsNewNIO2.pdf
 
第2回 Hadoop 輪読会
第2回 Hadoop 輪読会第2回 Hadoop 輪読会
第2回 Hadoop 輪読会
 
Working with the IFS on System i
Working with the IFS on System iWorking with the IFS on System i
Working with the IFS on System i
 
Writing A Foreign Data Wrapper
Writing A Foreign Data WrapperWriting A Foreign Data Wrapper
Writing A Foreign Data Wrapper
 
Data handling in r
Data handling in rData handling in r
Data handling in r
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Hypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.comHypertable Distilled by edydkim.github.com
Hypertable Distilled by edydkim.github.com
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Hadoop
HadoopHadoop
Hadoop
 
Power tools in Java
Power tools in JavaPower tools in Java
Power tools in Java
 

Dernier

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 

Dernier (20)

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 

PXF Framework for SQL Analytics on Hadoop Data

  • 1. Pivotal eXtension Framework Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech
  • 2. Data Analysis Timeline ISAM files COBOL/JCL
  • 3. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL
  • 4. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 5. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 6. Data Analysis Timeline ISAM files COBOL/JCL RDBMS SQL HDFS files Map Reduce/Hive
  • 7. Data Analysis Timeline HDFS files Map Reduce/Hive SQL
  • 9. Simplified View of Co-existence SQL HDFS Files Map Reduce , Hive, HBase RDBMS Files HDFS
  • 10. Simplified View of Co-existence SQL HDFS Files Map Reduce , Hive, HBase RDBMS Files HDFS The Great Divide
  • 12. Pivotal eXtension Framework (PXF) • History o Based on external table functionality of RDBMS o Built at Pivotal by a small team in Israel • Goals o Single hop o No Materialization of data o Fully parallel for high throughput o Extensible
  • 13. Motivation for building PXF • Use SQL engine’s statistical/analytic functions (e.g. Madlib) on third party data stores e.g. o HBase data o Hive data o Native Data on HDFS in a variety of formats • Join in-database dimensions with other fact tables • Fast ingest of data into SQL native format (insert into … select * from …)
  • 14. Motivation for building PXF • Enterprises love the cheap storage offered by HDFS, and want to store data over there • M/R is very limiting • Integrating with Third Party systems e.g. Accumulo etc. • Existing techniques involved copying data to HDFS, which is very brittle and in-efficient
  • 15. High Level Flow SQL Data Node5 Data Node1 Data Node2 Data Node3 Data Node4 Where is the data for table foo? On DataNodes 1,3 and 5 - Protocol is http - End points are running on all data nodes Name Node
  • 16. Major components • Fragmenter o Get the locations of fragments for a table • Accessor o Understand and read the fragment, return records • Resolver o Convert the records into a SQL engine format • Analyzer o Provide source stats to the Query optimizer
  • 17. PXF Architecture HAWQ Master M/R, Pig, Hive Data Node Container with End-Points PXF Fragmenter Local HDFS Hadoop Pivotal Green Zookeeper 3 1 6 PSQL select * from external table foo location=”pxf://namenode:50070/financedata” 0 splits[..] HAWQ Segment getSplit(0) PXFWritable A B 0 6To A BTo MetaData Data Native PHD 5 4 PXF Accessor/Resolver Local HDFS 2
  • 18. Classes • The four major components are defined as interfaces and base classes that can be extended. e.g. Fragmenter /* * Class holding information about fragments (FragmentInfo) */ public class FragmentsOutput { public FragmentsOutput(); public void addFragment(String sourceName, String[] replicas, byte[] metadata ); public void addFragment(String sourceName, String[] replicas, byte[] metadata, String userData); public List<FragmentInfo> getFragments(); }
  • 19. /* Internal interface that defines the access to data on the source * data store (e.g, a file on HDFS, a region of an HBase table, etc). * All classes that implement actual access to such data sources must * respect this interface */ public interface IReadAccessor { public boolean openForRead() throws Exception; public OneRow readNextObject() throws Exception; public void closeForRead() throws Exception; } /* * An interface for writing data into a data store * (e.g, a sequence file on HDFS). * All classes that implement actual access to such data sources must * respect this interface */ public interface IWriteAccessor { public boolean openForWrite() throws Exception; public OneRow writeNextObject(OneRow onerow) throws Exception; public void closeForWrite() throws Exception; } Accessor Interface
  • 20. /* * Interface that defines the deserialization of one record brought from * the data Accessor. Every implementation of a deserialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IReadResolver { public List<OneField> getFields(OneRow row) throws Exception; } /* * Interface that defines the serialization of data read from the DB * into a OneRow object. * Every implementation of a serialization method * (e.g, Writable, Avro, ...) must implement this interface. */ public interface IWriteResolver { public OneRow setFields(DataInputStream inputStream) throws Exception; } Resolver Interface
  • 21. /*Abstract class that defines getting statistics for ANALYZE. * GetEstimatedStats returns statistics for a given path * (block size, number of blocks, number of tuples (rows)). * Used when calling ANALYZE on a PXF external table, to get * table's statistics that are used by the optimizer to plan queries. */ public abstract class Analyzer extends Plugin { public Analyzer(InputData metaData){ super(metaData); } /** path is a data source name (e.g, file, dir, wildcard, table name) * returns the data statistics in json format * * NOTE: It is highly recommended to implement an extremely fast logic * that returns *estimated* statistics. Scanning all the data for exact * statistics is considered bad practice. */ public String GetEstimatedStats(String data) throws Exception { /* Return default values */ return DataSourceStatsInfo.dataToJSON(new DataSourceStatsInfo()); } } Analyzer Interface
  • 22. Syntax - Long Form CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data? FRAGMENTER=com.pivotal.pxf.fragmenters.HdfsDataFragmenter& ACCESSOR=com.pivotal.pxf.accessors.LineBreakAccessor& RESOLVER=com.pivotal.pxf.resolvers.StringPassResolver& ANALYZER=com.pivotal.pxf.analyzers.HdfsAnalyzer') format 'TEXT' (delimiter = ','); Say WHAT???
  • 23. Syntax - Short Form CREATE EXTERNAL TABLE dummy_tbl (int1 integer, word text, int2 integer) location('pxf://localhost:50070/pxf-data?profile=HdfsTextSimple') format 'TEXT' (delimiter = ','); Whew!!
  • 24. Built in Profiles • # of profiles are built in and more are being contributed o HBase, Hive, HDFS Text, Avro, SequenceFiles, GemFireXD, Accumulo, Cassandra, JSON o PXF will be open-sourced completely, for using with your favorite SQL engine. o But you can write your own connectors right now, and use it with HAWQ.
  • 25. Predicate Pushdown • SQL engines may push down parts of the “WHERE” clause down to PXF. • e.g. “where id > 500 and id < 1000” • PXF provides a FilterBuilder class • Filters can be combined together • Simple expression “constant <OP> column” • Complex expression “object(s) <OP> object(s)”
  • 26. Demo • Create a text file on HDFS • Create a table using a SQL engine (HAWQ) on HDFS • Create an external table using PXF • Select from both tables separately • Finally run a join across both tables
  • 27. More info online... • http://docs.gopivotal.com/pivotalhd/PXFInstallationandAdministration.html • http://docs.gopivotal.com/pivotalhd/PXFExternalTableandAPIReference.html
  • 29. Pivotal eXtension Framework Sameer Tiwari Hadoop Storage Architect, Pivotal Inc. stiwari@gopivotal.com, @sameertech