SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
Alan F. Gates,[object Object],Yahoo!,[object Object],Pig, Making Hadoop Easy,[object Object]
Who Am I?,[object Object],Pig committer,[object Object],Hadoop PMC Member,[object Object],An architect in Yahoo!grid team,[object Object],Or, as one coworker put it, “the lipstick on the Pig”,[object Object]
Who are you?,[object Object]
Motivation By Example,[object Object],   Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.,[object Object],Load Users,[object Object],Load Pages,[object Object],Filter by age,[object Object],Join on name,[object Object],Group on url,[object Object],Count clicks,[object Object],Order by clicks,[object Object],Take top 5,[object Object]
In Map Reduce,[object Object]
In Pig Latin,[object Object],Users = load‘users’as (name, age);Fltrd = filter Users by        age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;,[object Object]
Performance,[object Object],0.1,[object Object],0.4,,[object Object],0.5,[object Object],0.2,[object Object],0.3,[object Object],0.6, ,[object Object],0.7,[object Object]
Why not SQL?,[object Object],Data Factory,[object Object],Pig,[object Object],Pipelines,[object Object],Iterative Processing,[object Object],Research,[object Object],Data Warehouse,[object Object],Hive,[object Object],BI Tools,[object Object],Analysis,[object Object],Data Collection,[object Object]
Pig Highlights,[object Object],User defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM),[object Object],UDFs can be written to take advantage of the combiner,[object Object],Four join implementations built in:  hash, fragment-replicate, merge, skewed,[object Object],Multi-query:  Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scanned,[object Object],Order by provides total ordering across reducers in a balanced way,[object Object],Writing load and store functions is easy once an InputFormat and OutputFormat exist,[object Object],Piggybank, a collection of user contributed UDFs,[object Object]
Who uses Pig for What?,[object Object],70% of production jobs at Yahoo (10ks per day),[object Object],Also used by Twitter, LinkedIn, Ebay, AOL, …,[object Object],Used to,[object Object],Process web logs,[object Object],Build user behavior models,[object Object],Process images,[object Object],Build maps of the web,[object Object],Do research on raw data sets,[object Object]
Accessing Pig,[object Object],Submit a script directly,[object Object],Grunt, the pig shell,[object Object],PigServer Java class, a JDBC like interface,[object Object]
Components,[object Object],Job executes on cluster,[object Object],Hadoop Cluster,[object Object],Pig resides on user machine,[object Object],User machine,[object Object],No need to install anything extra on your Hadoop cluster.,[object Object]
How It Works,[object Object],Pig Latin,[object Object],A = LOAD ‘myfile’,[object Object],    AS (x, y, z);,[object Object],B = FILTER A by x > 0; ,[object Object],C = GROUP B BY x;,[object Object],D = FOREACH A GENERATE,[object Object],x, COUNT(B);,[object Object],STORE D INTO ‘output’;,[object Object],pig.jar:,[object Object],[object Object]
checks
optimizes
plans execution

Contenu connexe

Tendances

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingMitsuharu Hamba
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoopjeffturner
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - OverviewJay
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdfEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 

Tendances (19)

Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Hive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReadingHive vs Pig for HadoopSourceCodeReading
Hive vs Pig for HadoopSourceCodeReading
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
MapReduce basic
MapReduce basicMapReduce basic
MapReduce basic
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Hadoop - Overview
Hadoop - OverviewHadoop - Overview
Hadoop - Overview
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Administration pdf
Hadoop Administration pdfHadoop Administration pdf
Hadoop Administration pdf
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao2008 Ur Tech Talk Zshao
2008 Ur Tech Talk Zshao
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 

En vedette

Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopDavid Yahalom
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 

En vedette (10)

Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
A beginners guide to Cloudera Hadoop
A beginners guide to Cloudera HadoopA beginners guide to Cloudera Hadoop
A beginners guide to Cloudera Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 

Similaire à Pig, Making Hadoop Easy

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateYahoo Developer Network
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2Wes Floyd
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlKhanderao Kand
 
Pig on spark
Pig on sparkPig on spark
Pig on sparkSigmoid
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010Thejas Nair
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages JaunesDataiku
 
Ben ford intro
Ben ford introBen ford intro
Ben ford introPuppet
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordPuppet
 
Using Google (Cloud) APIs
Using Google (Cloud) APIsUsing Google (Cloud) APIs
Using Google (Cloud) APIswesley chun
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsJohn Herren
 
Practical pig
Practical pigPractical pig
Practical pigtrihug
 

Similaire à Pig, Making Hadoop Easy (20)

Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan GateApache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
 
03 pig intro
03 pig intro03 pig intro
03 pig intro
 
Apache PIG introduction
Apache PIG introductionApache PIG introduction
Apache PIG introduction
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Sql saturday pig session (wes floyd) v2
Sql saturday   pig session (wes floyd) v2Sql saturday   pig session (wes floyd) v2
Sql saturday pig session (wes floyd) v2
 
Big dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosqlBig dataarchitecturesandecosystem+nosql
Big dataarchitecturesandecosystem+nosql
 
Pig on spark
Pig on sparkPig on spark
Pig on spark
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010apache pig performance optimizations talk at apachecon 2010
apache pig performance optimizations talk at apachecon 2010
 
Hadoop overview
Hadoop overviewHadoop overview
Hadoop overview
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages JaunesBreizhJUG - Janvier 2014 - Big Data -  Dataiku - Pages Jaunes
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
 
Pig latin
Pig latinPig latin
Pig latin
 
Ben ford intro
Ben ford introBen ford intro
Ben ford intro
 
Telemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben FordTelemetry doesn't have to be scary; Ben Ford
Telemetry doesn't have to be scary; Ben Ford
 
Hi5 Open Social
Hi5   Open SocialHi5   Open Social
Hi5 Open Social
 
Using Google (Cloud) APIs
Using Google (Cloud) APIsUsing Google (Cloud) APIs
Using Google (Cloud) APIs
 
Mashup University 4: Intro To Mashups
Mashup University 4: Intro To MashupsMashup University 4: Intro To Mashups
Mashup University 4: Intro To Mashups
 
Practical pig
Practical pigPractical pig
Practical pig
 

Plus de Nick Dimiduk

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixNick Dimiduk
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 ReleaseNick Dimiduk
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101Nick Dimiduk
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low LatencyNick Dimiduk
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for ArchitectsNick Dimiduk
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)Nick Dimiduk
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the CloudNick Dimiduk
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for ArchitectsNick Dimiduk
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)Nick Dimiduk
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLNick Dimiduk
 

Plus de Nick Dimiduk (13)

Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - PhoenixApache Big Data EU 2015 - Phoenix
Apache Big Data EU 2015 - Phoenix
 
Apache HBase 1.0 Release
Apache HBase 1.0 ReleaseApache HBase 1.0 Release
Apache HBase 1.0 Release
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
HBase Blockcache 101
HBase Blockcache 101HBase Blockcache 101
HBase Blockcache 101
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
Apache HBase for Architects
Apache HBase for ArchitectsApache HBase for Architects
Apache HBase for Architects
 
HBase Data Types (WIP)
HBase Data Types (WIP)HBase Data Types (WIP)
HBase Data Types (WIP)
 
Bring Cartography to the Cloud
Bring Cartography to the CloudBring Cartography to the Cloud
Bring Cartography to the Cloud
 
HBase for Architects
HBase for ArchitectsHBase for Architects
HBase for Architects
 
HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)HBase Client APIs (for webapps?)
HBase Client APIs (for webapps?)
 
Introduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQLIntroduction to Hadoop, HBase, and NoSQL
Introduction to Hadoop, HBase, and NoSQL
 

Pig, Making Hadoop Easy

Notes de l'éditeur

  1. How many have used Pig? How many have looked at it and have a basic understanding of it?
  2. Demo script:Show group query first, talk about: load and schema (none, declared, from data) data types data sources need not be from HDFS or even from files parallel clause, how parallelism is determined on maps how grouping works in Pig LatinSo far what I’ve shown you is a simple join/group query. Now let’s look at something less straight forward in SQLOften people want to group data a number of different ways. Look at multiquery script: Note how there’s a branch in the logic nowOften want to operate on the result of each record in a previous statement. Look at top5 query Note nested foreach allows you to operate on each record coming out of group by Since result of group by is a bag in each record, can apply operators to that bag Currently support order, distinct, filter, limit Use of flatten at the end Use of positional parametersThere will always be logic you need to write that you can’t get from Pig Latin. This is where rich support of UDFs come in. Look at session query Note registering UDF UDF now called like any other Pig builtin function (in fact Pig builtins implemented as UDFs)Look at SessionAnalysis.java Class name is UDF name Input to UDF is always a Tuple, avoids need to declare expected input, means UDF has to check what it gets Talk about how projection of bags works Talk about how EvalFunc is templatized on return typeAlso easy to write load and store functions to fit your data needs