SlideShare une entreprise Scribd logo
1  sur  38
Apache Pig for Data Science
Casey Stella
June, 2014
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Table of Contents
Preliminaries
Apache Pig
Pig in the Data Science Toolbag
Understanding Your Data
Machine Learning with Pig
Applying Models with Pig
Unstructured Data Analysis with Pig
Questions & Bibliography
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Introduction
• I’m a Principal Architect at Hortonworks
• I work primarily doing Data Science in the Hadoop Ecosystem
• Prior to this, I’ve spent my time and had a lot of fun
◦ Doing data mining on medical data at Explorys using the Hadoop
ecosystem
◦ Doing signal processing on seismic data at Ion Geophysical using
MapReduce
◦ Being a graduate student in the Math department at Texas A&M in
algorithmic complexity theory
• I’m going to talk about Apache Pig’s role for doing scalable data
science.
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Apache Pig: What is it?
Pig is a high level scripting language for operating on large datasets
inside Hadoop
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Apache Pig: What is it?
Pig is a high level scripting language for operating on large datasets
inside Hadoop
• Transforms high level data operations into MapReduce/Tez jobs
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Apache Pig: What is it?
Pig is a high level scripting language for operating on large datasets
inside Hadoop
• Transforms high level data operations into MapReduce/Tez jobs
• Optimizes such that the minimal number of MapReduce/Tez jobs
need be run
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Apache Pig: What is it?
Pig is a high level scripting language for operating on large datasets
inside Hadoop
• Transforms high level data operations into MapReduce/Tez jobs
• Optimizes such that the minimal number of MapReduce/Tez jobs
need be run
• Familiar relational primitives available
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Apache Pig: What is it?
Pig is a high level scripting language for operating on large datasets
inside Hadoop
• Transforms high level data operations into MapReduce/Tez jobs
• Optimizes such that the minimal number of MapReduce/Tez jobs
need be run
• Familiar relational primitives available
• Extensible via User Defined Functions and Loaders for customized
data processing and formats
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Apache Pig: An Familiar Example
SENTENCES= load ’ . . . ’ as ( sentence : c h a r a r r a y ) ;
WORDS = foreach SENTENCES
generate f l a t t e n (TOKENIZE( sentence ))
as word ;
WORD_GROUPS = group WORDS by word ;
WORD_COUNTS = foreach WORD_GROUPS
generate group as word , COUNT(WORDS) ;
s t o r e WORD_COUNTS i n t o ’ . . . ’ ;
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data
“80% of the work in any data project is in cleaning the
data.”
— D.J. Patel in Data Jujitsu
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data
A core pre-requisite to analyzing data is understanding data’s shape
and distribution. This requires (among other things):
• Computing distribution statistics on data
• Sampling data
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data: Datafu
An Apache Incubating project called datafu1 provides some of these
tooling in the form of Pig UDFs:
• Computing quantiles of data
1
http://datafu.incubator.apache.org/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data: Datafu
An Apache Incubating project called datafu1 provides some of these
tooling in the form of Pig UDFs:
• Computing quantiles of data
• Sampling
◦ Bernoulli sampling by probability (built into pig)
1
http://datafu.incubator.apache.org/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data: Datafu
An Apache Incubating project called datafu1 provides some of these
tooling in the form of Pig UDFs:
• Computing quantiles of data
• Sampling
◦ Bernoulli sampling by probability (built into pig)
◦ Simple Random Sample
1
http://datafu.incubator.apache.org/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data: Datafu
An Apache Incubating project called datafu1 provides some of these
tooling in the form of Pig UDFs:
• Computing quantiles of data
• Sampling
◦ Bernoulli sampling by probability (built into pig)
◦ Simple Random Sample
◦ Reservoir sampling
1
http://datafu.incubator.apache.org/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data: Datafu
An Apache Incubating project called datafu1 provides some of these
tooling in the form of Pig UDFs:
• Computing quantiles of data
• Sampling
◦ Bernoulli sampling by probability (built into pig)
◦ Simple Random Sample
◦ Reservoir sampling
◦ Weighted sampling without replacement
1
http://datafu.incubator.apache.org/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Understanding Data: Datafu
An Apache Incubating project called datafu1 provides some of these
tooling in the form of Pig UDFs:
• Computing quantiles of data
• Sampling
◦ Bernoulli sampling by probability (built into pig)
◦ Simple Random Sample
◦ Reservoir sampling
◦ Weighted sampling without replacement
◦ Random Sample with replacement
1
http://datafu.incubator.apache.org/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Case Study: Bootstrapping
Bootstrapping is a resampling technique which is intended to
measure accuracy of sample estimates. It does this by measuring an
estimator (such as mean) across a set of random samples with
replacement from an original (possibly large) dataset.
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Case Study: Bootstrapping
Datafu provides two tools which can be used together to provide that
random sample with replacement:
• SimpleRandomSampleWithReplacementVote – Ranks multiple
candidates for each position in a sample
• SimpleRandomSampleWithReplacementElect – Chooses, for each
position in the sample, the candidate with the lowest score
The datafu docs provide an example2 of generating a boostrap of the
mean estimator.
2
http://datafu.incubator.apache.org/docs/datafu/guide/sampling.html
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
What is Machine Learning?
Machine learning is the study of systems that can learn from data.
The general tasks fall into one of two categories:
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
What is Machine Learning?
Machine learning is the study of systems that can learn from data.
The general tasks fall into one of two categories:
• Unsupervised Learning
◦ Clustering
◦ Outlier detection
◦ Market Basket Analysis
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
What is Machine Learning?
Machine learning is the study of systems that can learn from data.
The general tasks fall into one of two categories:
• Unsupervised Learning
◦ Clustering
◦ Outlier detection
◦ Market Basket Analysis
• Supervised Learning
◦ Classification
◦ Regression
◦ Recommendation
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Building Machine Learning Models with Pig
Machine Learning at scale in Hadoop generally falls into two
categories:
• Build one large model on all (or almost all) of the data
• Sample the large dataset and build the model based on that sample
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Building Machine Learning Models with Pig
Machine Learning at scale in Hadoop generally falls into two
categories:
• Build one large model on all (or almost all) of the data
• Sample the large dataset and build the model based on that sample
Pig can assist in intelligently sampling down the large data into a
training set. You can then use your favorite ML algorithm (which can
be run on the JVM) to generate a machine learning model.
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Applying Models with Pig
Pig shines at batch application of an existing ML model. This
generally is of the form:
• Train a model out-of-band
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Applying Models with Pig
Pig shines at batch application of an existing ML model. This
generally is of the form:
• Train a model out-of-band
• Write a UDF in Java or another JVM language which can apply the
model to a data point
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Applying Models with Pig
Pig shines at batch application of an existing ML model. This
generally is of the form:
• Train a model out-of-band
• Write a UDF in Java or another JVM language which can apply the
model to a data point
• Call the UDF from a pig script to distribute the application of the
model across your data in parallel
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
What is Natural Language Processing?
• Natural language processing is the field of Computer Science,
Linguistics & Math that covers computer understanding and
manipulation of human language.
◦ Historically, linguists hand-coded rules to accomplish much analysis
◦ Most modern approaches involves using Machine Learning
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
What is Natural Language Processing?
• Natural language processing is the field of Computer Science,
Linguistics & Math that covers computer understanding and
manipulation of human language.
◦ Historically, linguists hand-coded rules to accomplish much analysis
◦ Most modern approaches involves using Machine Learning
• Mature field with many useful libraries on the JVM
◦ Apache OpenNLP
◦ Stanford CoreNLP
◦ MALLET
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Natural Language Processing with Large Data
• Generally low-volume, complex analysis
◦ Big companies often don’t have a ton of natural language data
◦ Dropped previously because they were unable to analyze
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Natural Language Processing with Large Data
• Generally low-volume, complex analysis
◦ Big companies often don’t have a ton of natural language data
◦ Dropped previously because they were unable to analyze
• Sometimes high-volume, complex analysis
◦ Search Engines
◦ Social media content analysis
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Natural Language Processing with Large Data
• Generally low-volume, complex analysis
◦ Big companies often don’t have a ton of natural language data
◦ Dropped previously because they were unable to analyze
• Sometimes high-volume, complex analysis
◦ Search Engines
◦ Social media content analysis
• Typically many small-data problems in parallel
◦ Often requires only the context of a single document
◦ Ideal for encapsulating as Pig UDFs
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Natural Language Processing: Demo
• Stanford CoreNLP integrated the work of Richard Socher, et al [2]
using recursive deep neural networks to predict sentiment of movie
reviews.
• There is a large set of IMDB movie reviews used to analyze
sentiment analysis [1].
• Let’s look at how to encapsulate this into a Pig UDF and run on
some movie review data.
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
The UDF
public class ANALYZE_SENTIMENT extends EvalFunc <String >
{
@Override
public String exec(Tuple objects) throws IOException
{
String document = (String)objects.get(0);
if(document == null || document.length () == 0)
{
return null;
}
//Call out to our handler that we wrote to do the
sentiment analysis
SentimentClass sentimentClass = SentimentAnalyzer.INSTANCE
.apply(document);
return sentimentClass == null?null: sentimentClass.
toString ();
}
}
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
The Pig Script
REGISTER ./ Pig_for_Data_Science -1.0- SNAPSHOT.jar
DEFINE SENTIMENT_ANALYSIS com.caseystella.ds.pig.ANALYZE_SENTIMENT
;
DOCUMENTS_POS = LOAD ...
DOCUMENTS_NEG = LOAD ...
NEG_DOCS_WITH_SENTIMENT = foreach DOCUMENTS_NEG generate ’NEGATIVE
’ as true_sentiment
, document as document;
POS_DOCS_WITH_SENTIMENT = foreach DOCUMENTS_POS generate ’POSITIVE
’ as true_sentiment
, document as document;
DOCS_WITH_SENTIMENT = UNION NEG_DOCS_WITH_SENTIMENT ,
POS_DOCS_WITH_SENTIMENT;
PREDICTED_SENTIMENT = foreach DOCS_WITH_SENTIMENT generate
SENTIMENT_ANALYSIS(document) as predicted_sentiment
, true_sentiment
, document;
STORE PREDICTED_SENTIMENT INTO ...
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Results
• Executing on a sample of size 1022 Positive and Negative
documents.
• Overall Accuracy of 77.2%
Actual
Positive Negative Total
Predicted
Positive 367 114 481
Negative 119 422 541
Total 486 536 1022
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Questions
Thanks for your attention! Questions?
• Code & scripts for this talk available on my github presentation
page.3
• Find me at http://caseystella.com
• Twitter handle: @casey_stella
• Email address: cstella@hortonworks.com
3
http://github.com/cestella/presentations/
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
Bibliography
[1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang,
Andrew Y. Ng, and Christopher Potts. Learning word vectors for
sentiment analysis. In Proceedings of the 49th Annual Meeting of
the Association for Computational Linguistics: Human Language
Technologies, pages 142–150, Portland, Oregon, USA, June 2011.
Association for Computational Linguistics.
[2] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang,
Christopher D. Manning, Andrew Y. Ng, and Christopher Potts.
Recursive deep models for semantic compositionality over a
sentiment treebank. In Proceedings of the 2013 Conference on
Empirical Methods in Natural Language Processing, pages
1631–1642, Stroudsburg, PA, October 2013. Association for
Computational Linguistics.
Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014

Contenu connexe

Tendances

Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]knowbigdata
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooMithun Radhakrishnan
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made EasyDataWorks Summit
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confSujee Maniyam
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst TrainingCloudera, Inc.
 

Tendances (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]Interview questions on Apache spark [part 2]
Interview questions on Apache spark [part 2]
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Faster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at YahooFaster Faster Faster! Datamarts with Hive at Yahoo
Faster Faster Faster! Datamarts with Hive at Yahoo
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka
 
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesHadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Spark Application Development Made Easy
Spark Application Development Made EasySpark Application Development Made Easy
Spark Application Development Made Easy
 
Intro To Hadoop
Intro To HadoopIntro To Hadoop
Intro To Hadoop
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Introduction to Data Analyst Training
Introduction to Data Analyst TrainingIntroduction to Data Analyst Training
Introduction to Data Analyst Training
 
Apache Pig
Apache PigApache Pig
Apache Pig
 

Similaire à Apache Pig for Data Scientists

IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabSri Ambati
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudDatabricks
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysisRaminder Singh
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHortonworks
 
Natural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual DataNatural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual Datagpano
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist ToolboxAndrei Savu
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemallMakoto Yui
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!Cloudera, Inc.
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009yhadoop
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Makoto Yui
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?J Langley
 

Similaire à Apache Pig for Data Scientists (20)

IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
 
Big Data - Part IV
Big Data - Part IVBig Data - Part IV
Big Data - Part IV
 
Data munging and analysis
Data munging and analysisData munging and analysis
Data munging and analysis
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
 
How to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDBHow to Use Apache Zeppelin with HWX HDB
How to Use Apache Zeppelin with HWX HDB
 
Natural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual DataNatural Language Processing on Non-Textual Data
Natural Language Processing on Non-Textual Data
 
Data Scientist Toolbox
Data Scientist ToolboxData Scientist Toolbox
Data Scientist Toolbox
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 
hadoop_module
hadoop_modulehadoop_module
hadoop_module
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Hw09 Hadoop Applications At Yahoo!
Hw09   Hadoop Applications At Yahoo!Hw09   Hadoop Applications At Yahoo!
Hw09 Hadoop Applications At Yahoo!
 
Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009Hadoop at Yahoo! -- Hadoop World NY 2009
Hadoop at Yahoo! -- Hadoop World NY 2009
 
Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17Talk about Hivemall at Data Scientist Organization on 2015/09/17
Talk about Hivemall at Data Scientist Organization on 2015/09/17
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?The Big Data Puzzle, Where Does the Eclipse Piece Fit?
The Big Data Puzzle, Where Does the Eclipse Piece Fit?
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Dernier (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Apache Pig for Data Scientists

  • 1. Apache Pig for Data Science Casey Stella June, 2014 Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 2. Table of Contents Preliminaries Apache Pig Pig in the Data Science Toolbag Understanding Your Data Machine Learning with Pig Applying Models with Pig Unstructured Data Analysis with Pig Questions & Bibliography Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 3. Introduction • I’m a Principal Architect at Hortonworks • I work primarily doing Data Science in the Hadoop Ecosystem • Prior to this, I’ve spent my time and had a lot of fun ◦ Doing data mining on medical data at Explorys using the Hadoop ecosystem ◦ Doing signal processing on seismic data at Ion Geophysical using MapReduce ◦ Being a graduate student in the Math department at Texas A&M in algorithmic complexity theory • I’m going to talk about Apache Pig’s role for doing scalable data science. Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 4. Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 5. Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Transforms high level data operations into MapReduce/Tez jobs Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 6. Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Transforms high level data operations into MapReduce/Tez jobs • Optimizes such that the minimal number of MapReduce/Tez jobs need be run Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 7. Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Transforms high level data operations into MapReduce/Tez jobs • Optimizes such that the minimal number of MapReduce/Tez jobs need be run • Familiar relational primitives available Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 8. Apache Pig: What is it? Pig is a high level scripting language for operating on large datasets inside Hadoop • Transforms high level data operations into MapReduce/Tez jobs • Optimizes such that the minimal number of MapReduce/Tez jobs need be run • Familiar relational primitives available • Extensible via User Defined Functions and Loaders for customized data processing and formats Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 9. Apache Pig: An Familiar Example SENTENCES= load ’ . . . ’ as ( sentence : c h a r a r r a y ) ; WORDS = foreach SENTENCES generate f l a t t e n (TOKENIZE( sentence )) as word ; WORD_GROUPS = group WORDS by word ; WORD_COUNTS = foreach WORD_GROUPS generate group as word , COUNT(WORDS) ; s t o r e WORD_COUNTS i n t o ’ . . . ’ ; Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 10. Understanding Data “80% of the work in any data project is in cleaning the data.” — D.J. Patel in Data Jujitsu Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 11. Understanding Data A core pre-requisite to analyzing data is understanding data’s shape and distribution. This requires (among other things): • Computing distribution statistics on data • Sampling data Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 12. Understanding Data: Datafu An Apache Incubating project called datafu1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 13. Understanding Data: Datafu An Apache Incubating project called datafu1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 14. Understanding Data: Datafu An Apache Incubating project called datafu1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 15. Understanding Data: Datafu An Apache Incubating project called datafu1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample ◦ Reservoir sampling 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 16. Understanding Data: Datafu An Apache Incubating project called datafu1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample ◦ Reservoir sampling ◦ Weighted sampling without replacement 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 17. Understanding Data: Datafu An Apache Incubating project called datafu1 provides some of these tooling in the form of Pig UDFs: • Computing quantiles of data • Sampling ◦ Bernoulli sampling by probability (built into pig) ◦ Simple Random Sample ◦ Reservoir sampling ◦ Weighted sampling without replacement ◦ Random Sample with replacement 1 http://datafu.incubator.apache.org/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 18. Case Study: Bootstrapping Bootstrapping is a resampling technique which is intended to measure accuracy of sample estimates. It does this by measuring an estimator (such as mean) across a set of random samples with replacement from an original (possibly large) dataset. Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 19. Case Study: Bootstrapping Datafu provides two tools which can be used together to provide that random sample with replacement: • SimpleRandomSampleWithReplacementVote – Ranks multiple candidates for each position in a sample • SimpleRandomSampleWithReplacementElect – Chooses, for each position in the sample, the candidate with the lowest score The datafu docs provide an example2 of generating a boostrap of the mean estimator. 2 http://datafu.incubator.apache.org/docs/datafu/guide/sampling.html Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 20. What is Machine Learning? Machine learning is the study of systems that can learn from data. The general tasks fall into one of two categories: Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 21. What is Machine Learning? Machine learning is the study of systems that can learn from data. The general tasks fall into one of two categories: • Unsupervised Learning ◦ Clustering ◦ Outlier detection ◦ Market Basket Analysis Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 22. What is Machine Learning? Machine learning is the study of systems that can learn from data. The general tasks fall into one of two categories: • Unsupervised Learning ◦ Clustering ◦ Outlier detection ◦ Market Basket Analysis • Supervised Learning ◦ Classification ◦ Regression ◦ Recommendation Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 23. Building Machine Learning Models with Pig Machine Learning at scale in Hadoop generally falls into two categories: • Build one large model on all (or almost all) of the data • Sample the large dataset and build the model based on that sample Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 24. Building Machine Learning Models with Pig Machine Learning at scale in Hadoop generally falls into two categories: • Build one large model on all (or almost all) of the data • Sample the large dataset and build the model based on that sample Pig can assist in intelligently sampling down the large data into a training set. You can then use your favorite ML algorithm (which can be run on the JVM) to generate a machine learning model. Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 25. Applying Models with Pig Pig shines at batch application of an existing ML model. This generally is of the form: • Train a model out-of-band Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 26. Applying Models with Pig Pig shines at batch application of an existing ML model. This generally is of the form: • Train a model out-of-band • Write a UDF in Java or another JVM language which can apply the model to a data point Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 27. Applying Models with Pig Pig shines at batch application of an existing ML model. This generally is of the form: • Train a model out-of-band • Write a UDF in Java or another JVM language which can apply the model to a data point • Call the UDF from a pig script to distribute the application of the model across your data in parallel Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 28. What is Natural Language Processing? • Natural language processing is the field of Computer Science, Linguistics & Math that covers computer understanding and manipulation of human language. ◦ Historically, linguists hand-coded rules to accomplish much analysis ◦ Most modern approaches involves using Machine Learning Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 29. What is Natural Language Processing? • Natural language processing is the field of Computer Science, Linguistics & Math that covers computer understanding and manipulation of human language. ◦ Historically, linguists hand-coded rules to accomplish much analysis ◦ Most modern approaches involves using Machine Learning • Mature field with many useful libraries on the JVM ◦ Apache OpenNLP ◦ Stanford CoreNLP ◦ MALLET Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 30. Natural Language Processing with Large Data • Generally low-volume, complex analysis ◦ Big companies often don’t have a ton of natural language data ◦ Dropped previously because they were unable to analyze Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 31. Natural Language Processing with Large Data • Generally low-volume, complex analysis ◦ Big companies often don’t have a ton of natural language data ◦ Dropped previously because they were unable to analyze • Sometimes high-volume, complex analysis ◦ Search Engines ◦ Social media content analysis Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 32. Natural Language Processing with Large Data • Generally low-volume, complex analysis ◦ Big companies often don’t have a ton of natural language data ◦ Dropped previously because they were unable to analyze • Sometimes high-volume, complex analysis ◦ Search Engines ◦ Social media content analysis • Typically many small-data problems in parallel ◦ Often requires only the context of a single document ◦ Ideal for encapsulating as Pig UDFs Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 33. Natural Language Processing: Demo • Stanford CoreNLP integrated the work of Richard Socher, et al [2] using recursive deep neural networks to predict sentiment of movie reviews. • There is a large set of IMDB movie reviews used to analyze sentiment analysis [1]. • Let’s look at how to encapsulate this into a Pig UDF and run on some movie review data. Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 34. The UDF public class ANALYZE_SENTIMENT extends EvalFunc <String > { @Override public String exec(Tuple objects) throws IOException { String document = (String)objects.get(0); if(document == null || document.length () == 0) { return null; } //Call out to our handler that we wrote to do the sentiment analysis SentimentClass sentimentClass = SentimentAnalyzer.INSTANCE .apply(document); return sentimentClass == null?null: sentimentClass. toString (); } } Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 35. The Pig Script REGISTER ./ Pig_for_Data_Science -1.0- SNAPSHOT.jar DEFINE SENTIMENT_ANALYSIS com.caseystella.ds.pig.ANALYZE_SENTIMENT ; DOCUMENTS_POS = LOAD ... DOCUMENTS_NEG = LOAD ... NEG_DOCS_WITH_SENTIMENT = foreach DOCUMENTS_NEG generate ’NEGATIVE ’ as true_sentiment , document as document; POS_DOCS_WITH_SENTIMENT = foreach DOCUMENTS_POS generate ’POSITIVE ’ as true_sentiment , document as document; DOCS_WITH_SENTIMENT = UNION NEG_DOCS_WITH_SENTIMENT , POS_DOCS_WITH_SENTIMENT; PREDICTED_SENTIMENT = foreach DOCS_WITH_SENTIMENT generate SENTIMENT_ANALYSIS(document) as predicted_sentiment , true_sentiment , document; STORE PREDICTED_SENTIMENT INTO ... Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 36. Results • Executing on a sample of size 1022 Positive and Negative documents. • Overall Accuracy of 77.2% Actual Positive Negative Total Predicted Positive 367 114 481 Negative 119 422 541 Total 486 536 1022 Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 37. Questions Thanks for your attention! Questions? • Code & scripts for this talk available on my github presentation page.3 • Find me at http://caseystella.com • Twitter handle: @casey_stella • Email address: cstella@hortonworks.com 3 http://github.com/cestella/presentations/ Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014
  • 38. Bibliography [1] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. [2] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Stroudsburg, PA, October 2013. Association for Computational Linguistics. Casey Stella (Hortonworks) Apache Pig for Data Science June, 2014