SlideShare une entreprise Scribd logo
1  sur  70
Télécharger pour lire hors ligne
Apache Pig
Making data transformation easy
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
➢ Need to solve complex problem
➢ More complex atomic operations than M/R
➢ Java is not a data oriented language → Low
productivity
➢ Any solutions?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig to the rescue!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Hadoop
public class DeliveryFileMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text>{
private String cellNumber,deliveryCode,fileTag="DR~";
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter) throws
IOException
{
String line = value.toString();
String splitarray[] = line.split(",");
cellNumber = splitarray[0].trim();
deliveryCode = splitarray[1].trim();
output.collect(new Text(cellNumber), new Text
(fileTag+deliveryCode));
}
}
** Extracted from http://kickstarthadoop.blogspot.com.
es/2011/09/joins-with-plain-map-reduce.html
public class SmsReducer extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private String customerName,deliveryReport;
private static Map<String,String> DeliveryCodesMap= new
HashMap<String,String>();
public void configure(JobConf job){
loadDeliveryStatusCodes();
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException{
while (values.hasNext()){
String currValue = values.next().toString();
String valueSplitted[] = currValue.split("~");
if(valueSplitted[0].equals("CD"))
customerName=valueSplitted[1].trim();
else if(valueSplitted[0].equals("DR"))
deliveryReport = DeliveryCodesMap.get
(valueSplitted[1].trim());
}
if(customerName!=null && deliveryReport!=null)
output.collect(new Text(customerName), new Text
(deliveryReport));
else if(customerName==null)
output.collect(new Text("customerName"), new Text
(deliveryReport));
else if(deliveryReport==null)
output.collect(new Text(customerName), new Text
("deliveryReport"));
}
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
A = JOIN A BY keyA, B BY keyB;
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig overview
➢ Framework layer over HDFS and Hadoop
➢ Developed by Yahoo at 2006
➢ Users: Yahoo, Linkedin, Twitter, IBM, etc.
➢ Last major release: 0.14.0 (November 2014)
http://pig.apache.org/
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Hadoop vs. Apache Pig
➢ M/R as atomic
operations
➢ Java is not data
oriented
➢ M/R inner flexibility
➢ Efficiency
➢ ETL operations: Join,
Filter, Group, etc.
➢ Pig Latin: Data
scripting language
➢ UDF with Java (and
others)
➢ Transform to M/R
overhead
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Pig Programming Model: Data
➢ Pig operations operate on relations
➢ A relation is a bag
➢ A bag is a collection of tuples
➢ A tuple is an ordered set of fields
➢ A field is any type of data
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Basic data types:
○ Boolean: True, False
○ Int and Long: 1, 2, 3, 4, 5
○ Float and Double: 2.3, 1.4, 4.5
○ Chararray: ‘Hello’, ‘I am a string’
○ DateTime: 2014-09-11T12:20:14.1234+00:00
○ … more but you won’t probably use them very often
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Tuple: A catch-all data type
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
➢ And relations? Just the most outer
(distributed) bags
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Loading data? No, first let’s meet our friend
Grunt
➢ Interactive pig shell → Nice for
debugging/experimenting
➢ pig -x local or pig -x mapred
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)
➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Students = LOAD ‘student_path’ USING PigStorage( ‘t’, ‘-noschema’ ) AS
(student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int);
Relation
Name
Path to
HD/HDFS
Connector Field
separator
Tuple
schema
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)
➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ );
Relation
Name
Path to
HD/HDFS
Connector Field
separator
Load schema from
.pig_schema
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DUMP instruction:
○ Prints the content of a relation at standard output
DUMP Students;
(1,John,Doe,M,18)
(2,Mary,Doe,F,20)
(3,Lara,Croft,F,25)
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(6,Sarah,Kerrigan,F,21)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
(9,Princess,Peach,F,21)
(10,Peter,Parker,M,23)
grunt>
Relation
Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DESCRIBE instruction:
○ Prints the schema of the relation at standard output
DESCRIBE Students;
Students: {student_id: long,name: chararray,surname: chararray,gender:
chararray,age: int}
grunt>
Relation
Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ ILLUSTRATE instruction:
○ Prints the schema of the relation and a tuple example
at standard output
ILLUSTRATE Students;
----------------------------------------------------------------------------
---------------------------------------
| Students | student_id:long | name:chararray | surname:chararray |
gender:chararray | age:int |
----------------------------------------------------------------------------
---------------------------------------
| | 9 | Princess | Peach |
F | 21 |
----------------------------------------------------------------------------
---------------------------------------
grunt>
Relation
Name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, name, age;
Relation
Name
Base
relation
Projected
data
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name,
surname) AS full_name, age;
Relation
Name
Base
relation
Projected
data
We can generate
new data too!!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Let us execute the instruction and… it seems that
nothing happens!
○ We had some tracing output with LOAD, DUMP, and
ILLUSTRATE…
○ Any ideas on this issue?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Pig employs lazy evaluation
➢ Computation only when:
○ LOAD, ILLUSTRATE, DUMP, STORE
➢ Pig keeps a DAG on MR jobs needed to
compute relations (optimized!)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Extend Student relation to add a field that
determines if the students is under 25 years
(1,John,Doe,M,18,true)
(2,Mary,Doe,F,20,true)
(3,Lara,Croft,F,25,false)
(4,Sherlock,Holmes,M,36,false)
(5,John,Watson,M,38,false)
(6,Sarah,Kerrigan,F,21,true)
...
Exercise: Who is under 25?
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FILTER instruction:
○ Generate a new relation by filtering data on a relation
StudentsFilt= FILTER Students BY age > 24 AND age < 34;
DUMP StudentsFilt;
(3,Lara,Croft,F,25)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
Relation
Name
Base
relation
Condition to fulfill
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:
○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale
OTHERWISE;
DUMP StudentsMale;
(1,John,Doe,M,18)
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
(10,Peter,Parker,M,23)
Base
relation
New
relation
Condition to fulfill by
new relation.
Otherwise means the
rest
New
relation
Condition to fulfill by
new relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:
○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF
age<30, OtherStudents OTHERWISE;
DUMP OtherStudents;
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(8,Tony,Stark,M,33)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:
○ Creates tuples with the key and a of bag tuples with
the same key values
StudentsGr = GROUP Students BY gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2,
Mary,Doe,F,20)})
(M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John,
Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)})
DESCRIBE StudentsGr;
StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray,
surname: chararray,gender: chararray,age: int)}}
Base
relation
New
relation
Use these fields’
values to make groups
New
schema!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:
○ We can use multiple relations. Creates one bag per
relation
StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY
gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John,
Watson,M,38),(4,Sherlock,Holmes,M,36)})
DESCRIBE StudentsGr;
StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name:
chararray,surname: chararray,gender: chararray,age: int)},OtherStudents:
{(student_id: long,name: chararray,surname: chararray,gender: chararray,age:
int)}}
Base
relation
New
relation
Use these fields’
values to make groups
New
schema!
Base
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Nested FOREACH:
○ Operate on data in bags inside a relation and then
project
StudentsNested = FOREACH StudentsGr{
Information = FOREACH Students GENERATE name, surname;
GENERATE group AS gender, Information AS
student_information;
}
DUMP StudentsNested;
(F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)})
(M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock,
Holmes),(John,Doe)})
Base
relation
New
relation
Bag inside base
relation
Finally
project
New bag
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ (inner) JOIN instruction:
○ Our classic database operator for relations!
StudentsGrades= JOIN Students BY student_id, Grades BY
student_id;
DUMP StudentsGrades;
(1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5)
(1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6)
(2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7)
…
DESCRIBE StudentsGrades;
StudentsGrades: {Students::student_id: long,Students::name: chararray,
Students::surname: chararray,Students::gender: chararray,Students::age: int,
Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Base
relation 1
New
relation
Use these fields’
values to group
New
schema!
Base
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ (left) JOIN instruction:
○ Our classic database operator for relations!
Operating on relations
StudentsGrades= JOIN Students BY student_id LEFT, Grades BY
student_id;
DUMP StudentsGrades;
(6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5)
(7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5)
(8,Tony,Stark,M,33,8,Math,6.7)
…
DESCRIBE StudentsGrades;
StudentsGrades: {Students::student_id: long,Students::name: chararray,
Students::surname: chararray,Students::gender: chararray,Students::age: int,
Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Left
relation
New
relation
Do not forget this one!
New
schema!
Right
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CROSS instruction:
○ Cartesian product of two or more relations
Operating on relations
StudentsCr= CROSS Students, Grades;
DUMP StudentsCr;
(10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0)
(10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5)
(10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7)
(10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3)
…
DESCRIBE StudentsCr;
StudentsCr: {Students::student_id: long,Students::name: chararray,Students::
surname: chararray,Students::gender: chararray,Students::age: int,Grades::
student_id: long,Grades::course: chararray,Grades::mark: double}
Relation 1
New
relation
Relation 2
New
schema!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ UNION instruction:
○ Joins in the same relation multiple relations
Operating on relations
StudentsUnion= UNION Students, Grades;
DUMP StudentsUnion;
(1,John,Doe,M,18) (1,Math,5.6)
(2,Mary,Doe,F,20) (2,Math,8.9)
(3,Lara,Croft,F,25) (3,Math,7.1)
…
DESCRIBE StudentsUnion;
Schema for StudentsUnion unknown.
Relation 1
New
relation
Relation 2
Union does not
preserve schemas!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ DISTINCT instruction:
○ Only preserves unique tuples
Operating on relations
Courses= FOREACH Grades GENERATE course AS course;
UniqueCourses= DISTINCT Courses;
DUMP UniqueCourses;
(Math)
(Biology)
(Physics)
(Engineering)
New
relation
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ORDER BY instruction:
○ Sorts relations by a specific criteria
Operating on relations
SortedGrades= ORDER Grades BY mark DESC;
DUMP SortedGrades;
(2,Biology,10.0)
(10,Engineering,10.0)
(10,Math,10.0)
(5,Biology,10.0)
(5,Engineering,9.0)
(7,Physics,8.9)
…
Base
relation
New
relation
field(s) used to sort
Sort criteria: DESC
(descendant) or ASC
(ascendant)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ LIMIT instruction:
○ Truncates relation’s size
Operating on relations
BestGrades= LIMIT SortedGrades 3;
DUMP BestGrades;
(10,Math,10.0)
(10,Engineering,10.0)
(2,Biology,10.0)
Base
relation
New
relation
Maximum number of
tuples
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:
○ Appends position of each tuple in the relation
Operating on relations
RankedGrades= RANK SortedGrades;
DUMP RankedGrades;
(1,2,Biology,10.0)
(2,10,Engineering,10.0)
(3,10,Math,10.0)
(4,5,Biology,10.0)
(5,5,Engineering,9.0)
…
DESCRIBE RankedGrades;
RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,
mark: double}
Base
relation
New
relation
Rank
number!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:
○ We can also sort and rank!
Operating on relations
RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC;
DUMP RankedGrades;
(1,1,Engineering,7.7)
(2,1,Math,5.6)
(3,1,Biology,4.5)
(4,1,Physics,2.3)
(5,2,Biology,10.0)
…
DESCRIBE RankedGrades;
RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,
mark: double}
Base
relation
New
relation
fields to
sort
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ SAMPLE instruction:
○ Sample the relation!
Operating on relations
SampledGrades= SAMPLE Grades 0.05;
DUMP SampledGrades;
(4,Engineering,8.0)
Base
relation
New
relation
proportion to sample
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Get the 3 top grades for each student
(1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)})
(2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)})
(3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)})
(4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)})
(5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)})
(6,{(,)})
...
Exercise: Top grades
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE instruction:
○ Is this really useful? Yes! Many aggregates with just
one operation
Operating on relations
CubedGrades= CUBE Grades BY CUBE(student_id,course);
CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark);
DUMP CubedGrades;
((,Math),7.188888888888889)
((,Biology),7.8)
((,Physics),5.375)
((,Engineering),6.877777777777778)
((,),6.729032258064516)
((2,Math),8.9)
((2,Biology),10.0)
((2,),8.075)
…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE/ROLLUP instruction:
○ Like standard CUBE but nulls values are introduced
from right to left
Operating on relations
RolledGrades= CUBE Grades BY ROLLUP(course,student_id);
RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube.
mark);
DUMP RolledGrades;
((Math,),7.188888888888889)
((Math,2),8.9)
((Math,3),7.1)
((Math,4),2.3)
((Math,5),6.7)
((Math,7),8.5)
((Math,8),6.7)
((Math,9),8.9)
…
order matters!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT instruction:
○ Assert that the whole relation fulfills a condition
○ Useful for debugging
Operating on relations
ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’;
Base
relation
Error
message
condition
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ STORE instruction:
○ Stores the relation into the local FS or HDFS (usually!)
○ Useful for debugging
Finally, storing data!
STORE BestGrades INTO ‘best_grades_path’ USING
PigStorage( ‘t’, ‘-noschema’ );
Relation
path to store
data
Connector Field
separator
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Problems solved?!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT
➢ GROUP
➢ CROSS
➢ CUBE
➢ DISTINCT
➢ FILTER
➢ FOREACH
➢ GROUP
Only these operations?
➢ JOIN
➢ LIMIT
➢ LOAD
➢ ORDER, RANK
➢ SAMPLE
➢ SPLIT
➢ UNION
➢ DUMP, ILLUSTRATE,
DESCRIBE
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Transform data in data projections
➢ Built-in functions:
○ math functions, string functions, datetime functions,
casting functions, etc.
➢ User defined functions:
○ Our own functions written in Java, Python, Ruby,
Javascript, etc.
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:
○ AVG/MAX/MIN/SUM: compute the
average/max/min/sum of a bag of numeric values
Functions & user defined functions
GradesGr = GROUP Grades BY course;
GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades.
mark) AS avg_mark;
DUMP GradesAvg;
(Math,7.188888888888889)
(Biology,7.8)
(Physics,5.375000000000001)
(Engineering,6.877777777777777)
Employ
only this
field in
bag/tuple
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:
○ COUNT: number of elements (not null) in a bag
Functions & user defined functions
GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT
(Grades) AS number_students;
DUMP GradesCount;
(Math,9)
(Biology,5)
(Physics,8)
(Engineering,9)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ FLATTEN: behavior depends on input
Functions & user defined functions
DUMP GradesCount;
(Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9),
(3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)})
(Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5),
(9,Biology,7.8)})
...
GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN
(Grades.mark) AS mark;
DUMP GradesFlat;
(Math,6.7)
(Math,5.6)
(Math,10.0)
…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ FLATTEN: behavior depends on input
Functions & user defined functions
GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course,
mark) AS tuple_mark;
DUMP GradesTuple
(1,(Math,5.6))
(2,(Math,8.9))
(3,(Math,7.1))
(4,(Math,2.3))
...
GradesUntupled= FOREACH GradesTuple GENERATE student_id AS
student_id, FLATTEN(tuple_mark);
DUMP GradesUntupled;
(1,Math,5.6)
(2,Math,8.9)
(3,Math,7.1)
(4,Math,2.3)
…
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ SUBTRACT: Tuples on first bag not in the second
Functions & user defined functions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF
age<20, OtherStudents OTHERWISE;
StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20
BY gender;
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT(
StudentsUnder25, StudentsUnder20 );
DUMP StudentsSub;
(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})
(M,{(10,Peter,Parker,M,23)})
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ DIFF: Non overlapping tuples on two bags
Functions & user defined functions
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF
(StudentsUnder25, StudentsUnder20);
DUMP StudentsDiff;
(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})
(M,{(10,Peter,Parker,M,23)})
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Math functions:
○ Common math functions for numeric values:
■ ABS
■ EXP
■ FLOOR
■ LOG
■ RANDOM
■ ROUND
■ SQRT
■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ String functions:
○ Transform chararrays:
■ ENDSWITH
■ LOWER
■ UPPER
■ SUBSTRING
■ TRIM
■ REPLACE
■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Datetime functions:
○ Get information on dates and timestamps:
■ AddDuration
■ CurrentTime
■ ToDate
■ ToString
■ ToUnixTime
■ DaysBetween
■ ...
Functions & user defined functions
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class SHUFFLE extends EvalFunc<DataBag> {
@Override
public DataBag exec( Tuple input ) throws
IOException {
if ( input == null )
throw new IOException("Invalid input:
null");
if( input.size() != 1 )
throw new IOException("Expected one
argument");
if( input.get( 0 ) == null )
return null;
TupleFactory tf = TupleFactory.getInstance();
DataBag bag = (DataBag) input.get( 0 );
List<Tuple> l = new ArrayList<Tuple>();
for( Tuple t : bag )
l.add( t );
Collections.shuffle( l );
DataBag resBag = B BagFactory.getInstance().
newDefaultBag( l );
return resBag;
}
User defined functions
@Override
public Schema outputSchema( Schema input ) {
try {
return new Schema( input.getField( 0 ) );
} catch( Exception e ){
return null;
}
}
}
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Library of useful UDFs released 2010
➢ Created by LinkedIn engineering team:
○ Stats: variance, quantiles, median, etc.
○ Bags: concat, append, preped, etc.
○ Sampling
○ Page rank
○ Session estimation
➢ Last major release: 1.2.0 (Dec, 2013)
http://datafu.incubator.apache.org/
More functions: Datafu Pig
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
How to use UDF libraries
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat
(StudentsUnder25,StudentsUnder20);
DUMP StudentBagConcat
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)})
Indicate UDF to be included
and name
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Scripting
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
Students= LOAD ‘$student_file’ USING PigStorage( ‘t’, ‘-noschema’ ) AS (
student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int)
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20,
OtherStudents OTHERWISE;
StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY
gender;
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat
(StudentsUnder25,StudentsUnder20);
STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘t’, ‘-schema’ );
A
s
d
a
Libraries and Udfs
Loaddata
TransformdataStoredata
parameter
parameter
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Calling a script
pig -x mapred -f myscript.pig -param student_file=students.csv -param
output=myoutput_path
parameter definition
execution mode script file
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Not limited to plain text
➢ Multiple supported format: Json, Avro,
Accumulo, etc.
➢ Connectors to data sources: MongoDb,
Cassandra, HBase, etc.
More on load/store
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Detect pairs of products bought together (e.g.,
chairs and tables)
➢ Goal: recommend related products
➢ Association score:
Exercise: Product association
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Purchases: purchases.tsv
➢ Products: products.tsv
Product association
product_id user_id price date
1 23 14.5 2014-03-03
4 15 11.2 2014-08-09
88 3 48.3 2011-01-01
...
product_id status
1 ok
5 ko
99 ok
...
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Time to work!
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Clear and simple
syntax
➢ Interactive client
➢ Transparent M/R
jobs
➢ Integration with
Java and others
Final notes: Pros & cons
➢ Not as flexible as
Hadoop
➢ Oriented towards
ETL, not AI
➢ No loops
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://pig.apache.org/
➢ Programming pig. Alan Gates. Ed. O’Reilly
➢ StackOverflow
Extra information

Contenu connexe

Tendances

Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramViswanath Gangavaram
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigJason Shao
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentSasha Ovsankin
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinPietro Michiardi
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latinknowbigdata
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Holden Karau
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 

Tendances (20)

Pig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaramPig power tools_by_viswanath_gangavaram
Pig power tools_by_viswanath_gangavaram
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
How LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product DevelopmentHow LinkedIn Uses Scalding for Data Driven Product Development
How LinkedIn Uses Scalding for Data Driven Product Development
 
Apache Pig
Apache PigApache Pig
Apache Pig
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
High-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig LatinHigh-level Programming Languages: Apache Pig and Pig Latin
High-level Programming Languages: Apache Pig and Pig Latin
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop10 concepts the enterprise decision maker needs to understand about Hadoop
10 concepts the enterprise decision maker needs to understand about Hadoop
 
Introduction to pig & pig latin
Introduction to pig & pig latinIntroduction to pig & pig latin
Introduction to pig & pig latin
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
myHadoop 0.30
myHadoop 0.30myHadoop 0.30
myHadoop 0.30
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 

Similaire à Apache Pig: Making data transformation easy

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to SchoolAdam Doyle
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
 
Resume_Weixiang Ding
Resume_Weixiang DingResume_Weixiang Ding
Resume_Weixiang DingWeixiang Ding
 
Artificial Intelligence Layer: Mahout, MLLib, and other projects
Artificial Intelligence Layer: Mahout, MLLib, and other projectsArtificial Intelligence Layer: Mahout, MLLib, and other projects
Artificial Intelligence Layer: Mahout, MLLib, and other projectsVictor Sanchez Anguix
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with SparkKrishna Sankar
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchainJie-Han Chen
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production Paolo Platter
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...Sri Ambati
 
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.srisatish ambati
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open DataJongwook Woo
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezBig Data Spain
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterLinaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterLinaro
 

Similaire à Apache Pig: Making data transformation easy (20)

Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Apache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map ReduceApache Hadoop: DFS and Map Reduce
Apache Hadoop: DFS and Map Reduce
 
9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School9/2017 STL HUG - Back to School
9/2017 STL HUG - Back to School
 
Yu's resume
Yu's resumeYu's resume
Yu's resume
 
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)
 
Resume_Weixiang Ding
Resume_Weixiang DingResume_Weixiang Ding
Resume_Weixiang Ding
 
Artificial Intelligence Layer: Mahout, MLLib, and other projects
Artificial Intelligence Layer: Mahout, MLLib, and other projectsArtificial Intelligence Layer: Mahout, MLLib, and other projects
Artificial Intelligence Layer: Mahout, MLLib, and other projects
 
Data Science with Spark
Data Science with SparkData Science with Spark
Data Science with Spark
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Bringing Deep Learning into production
Bringing Deep Learning into production Bringing Deep Learning into production
Bringing Deep Learning into production
 
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
qconsf 2013: Top 10 Performance Gotchas for scaling in-memory Algorithms - Sr...
 
Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.Top 10 Performance Gotchas for scaling in-memory Algorithms.
Top 10 Performance Gotchas for scaling in-memory Algorithms.
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Big Data Trend and Open Data
Big Data Trend and Open DataBig Data Trend and Open Data
Big Data Trend and Open Data
 
Aravind_Resume
Aravind_ResumeAravind_Resume
Aravind_Resume
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Multiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier DominguezMultiplatform Spark solution for Graph datasources by Javier Dominguez
Multiplatform Spark solution for Graph datasources by Javier Dominguez
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
 

Dernier

Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 

Dernier (16)

Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 

Apache Pig: Making data transformation easy

  • 1. Apache Pig Making data transformation easy Víctor Sánchez Anguix Universitat Politècnica de València MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image Course 2014/2015
  • 2. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Problem Solving Complex problem
  • 3. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Problem Solving Complex problem
  • 4. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Map Reduce Problem Solving ➢ Need to solve complex problem ➢ More complex atomic operations than M/R ➢ Java is not a data oriented language → Low productivity ➢ Any solutions?
  • 5. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Apache Pig to the rescue!
  • 6. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Join in Apache Hadoop public class DeliveryFileMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>{ private String cellNumber,deliveryCode,fileTag="DR~"; public void map(LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { String line = value.toString(); String splitarray[] = line.split(","); cellNumber = splitarray[0].trim(); deliveryCode = splitarray[1].trim(); output.collect(new Text(cellNumber), new Text (fileTag+deliveryCode)); } } ** Extracted from http://kickstarthadoop.blogspot.com. es/2011/09/joins-with-plain-map-reduce.html public class SmsReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { private String customerName,deliveryReport; private static Map<String,String> DeliveryCodesMap= new HashMap<String,String>(); public void configure(JobConf job){ loadDeliveryStatusCodes(); } public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{ while (values.hasNext()){ String currValue = values.next().toString(); String valueSplitted[] = currValue.split("~"); if(valueSplitted[0].equals("CD")) customerName=valueSplitted[1].trim(); else if(valueSplitted[0].equals("DR")) deliveryReport = DeliveryCodesMap.get (valueSplitted[1].trim()); } if(customerName!=null && deliveryReport!=null) output.collect(new Text(customerName), new Text (deliveryReport)); else if(customerName==null) output.collect(new Text("customerName"), new Text (deliveryReport)); else if(deliveryReport==null) output.collect(new Text(customerName), new Text ("deliveryReport")); }
  • 7. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Join in Apache Pig
  • 8. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Join in Apache Pig A = JOIN A BY keyA, B BY keyB;
  • 9. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Apache Pig overview ➢ Framework layer over HDFS and Hadoop ➢ Developed by Yahoo at 2006 ➢ Users: Yahoo, Linkedin, Twitter, IBM, etc. ➢ Last major release: 0.14.0 (November 2014) http://pig.apache.org/
  • 10. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Apache Hadoop vs. Apache Pig ➢ M/R as atomic operations ➢ Java is not data oriented ➢ M/R inner flexibility ➢ Efficiency ➢ ETL operations: Join, Filter, Group, etc. ➢ Pig Latin: Data scripting language ➢ UDF with Java (and others) ➢ Transform to M/R overhead
  • 11. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Pig Programming Model: Data ➢ Pig operations operate on relations ➢ A relation is a bag ➢ A bag is a collection of tuples ➢ A tuple is an ordered set of fields ➢ A field is any type of data
  • 12. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Basic data types: ○ Boolean: True, False ○ Int and Long: 1, 2, 3, 4, 5 ○ Float and Double: 2.3, 1.4, 4.5 ○ Chararray: ‘Hello’, ‘I am a string’ ○ DateTime: 2014-09-11T12:20:14.1234+00:00 ○ … more but you won’t probably use them very often
  • 13. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Tuple: A catch-all data type
  • 14. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Bag:
  • 15. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Sounds complicated… but it’s not! ➢ Bag: ➢ And relations? Just the most outer (distributed) bags
  • 16. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Loading data? ➢ Loading data? No, first let’s meet our friend Grunt ➢ Interactive pig shell → Nice for debugging/experimenting ➢ pig -x local or pig -x mapred
  • 17. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Loading data? ➢ Data source: Local or HDFS (usually!) ➢ LOAD instruction: ○ Data is automatically loaded in a distributed relation Students = LOAD ‘student_path’ USING PigStorage( ‘t’, ‘-noschema’ ) AS (student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int); Relation Name Path to HD/HDFS Connector Field separator Tuple schema
  • 18. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Loading data? ➢ Data source: Local or HDFS (usually!) ➢ LOAD instruction: ○ Data is automatically loaded in a distributed relation Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ ); Relation Name Path to HD/HDFS Connector Field separator Load schema from .pig_schema
  • 19. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Checking relations’ content ➢ DUMP instruction: ○ Prints the content of a relation at standard output DUMP Students; (1,John,Doe,M,18) (2,Mary,Doe,F,20) (3,Lara,Croft,F,25) (4,Sherlock,Holmes,M,36) (5,John,Watson,M,38) (6,Sarah,Kerrigan,F,21) (7,Bruce,Wayne,M,32) (8,Tony,Stark,M,33) (9,Princess,Peach,F,21) (10,Peter,Parker,M,23) grunt> Relation Name
  • 20. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Checking relations’ content ➢ DESCRIBE instruction: ○ Prints the schema of the relation at standard output DESCRIBE Students; Students: {student_id: long,name: chararray,surname: chararray,gender: chararray,age: int} grunt> Relation Name
  • 21. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Checking relations’ content ➢ ILLUSTRATE instruction: ○ Prints the schema of the relation and a tuple example at standard output ILLUSTRATE Students; ---------------------------------------------------------------------------- --------------------------------------- | Students | student_id:long | name:chararray | surname:chararray | gender:chararray | age:int | ---------------------------------------------------------------------------- --------------------------------------- | | 9 | Princess | Peach | F | 21 | ---------------------------------------------------------------------------- --------------------------------------- grunt> Relation Name
  • 22. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FOREACH instruction: ○ Generate new relations by projecting data of a relation StudentsProj= FOREACH Students GENERATE student_id, name, age; Relation Name Base relation Projected data
  • 23. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FOREACH instruction: ○ Generate new relations by projecting data of a relation StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name, surname) AS full_name, age; Relation Name Base relation Projected data We can generate new data too!!
  • 24. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FOREACH instruction: ○ Let us execute the instruction and… it seems that nothing happens! ○ We had some tracing output with LOAD, DUMP, and ILLUSTRATE… ○ Any ideas on this issue?
  • 25. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ Pig employs lazy evaluation ➢ Computation only when: ○ LOAD, ILLUSTRATE, DUMP, STORE ➢ Pig keeps a DAG on MR jobs needed to compute relations (optimized!)
  • 26. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Extend Student relation to add a field that determines if the students is under 25 years (1,John,Doe,M,18,true) (2,Mary,Doe,F,20,true) (3,Lara,Croft,F,25,false) (4,Sherlock,Holmes,M,36,false) (5,John,Watson,M,38,false) (6,Sarah,Kerrigan,F,21,true) ... Exercise: Who is under 25?
  • 27. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ FILTER instruction: ○ Generate a new relation by filtering data on a relation StudentsFilt= FILTER Students BY age > 24 AND age < 34; DUMP StudentsFilt; (3,Lara,Croft,F,25) (7,Bruce,Wayne,M,32) (8,Tony,Stark,M,33) Relation Name Base relation Condition to fulfill
  • 28. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ SPLIT instruction: ○ Splits a relation into multiple relations based on conditions SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale OTHERWISE; DUMP StudentsMale; (1,John,Doe,M,18) (4,Sherlock,Holmes,M,36) (5,John,Watson,M,38) (7,Bruce,Wayne,M,32) (8,Tony,Stark,M,33) (10,Peter,Parker,M,23) Base relation New relation Condition to fulfill by new relation. Otherwise means the rest New relation Condition to fulfill by new relation
  • 29. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ SPLIT instruction: ○ Splits a relation into multiple relations based on conditions SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF age<30, OtherStudents OTHERWISE; DUMP OtherStudents; (4,Sherlock,Holmes,M,36) (5,John,Watson,M,38) (8,Tony,Stark,M,33)
  • 30. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ GROUP instruction: ○ Creates tuples with the key and a of bag tuples with the same key values StudentsGr = GROUP Students BY gender; DUMP StudentsGr; (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2, Mary,Doe,F,20)}) (M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John, Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)}) DESCRIBE StudentsGr; StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray, surname: chararray,gender: chararray,age: int)}} Base relation New relation Use these fields’ values to make groups New schema!
  • 31. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ GROUP instruction: ○ We can use multiple relations. Creates one bag per relation StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY gender; DUMP StudentsGr; (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{}) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John, Watson,M,38),(4,Sherlock,Holmes,M,36)}) DESCRIBE StudentsGr; StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)},OtherStudents: {(student_id: long,name: chararray,surname: chararray,gender: chararray,age: int)}} Base relation New relation Use these fields’ values to make groups New schema! Base relation
  • 32. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ Nested FOREACH: ○ Operate on data in bags inside a relation and then project StudentsNested = FOREACH StudentsGr{ Information = FOREACH Students GENERATE name, surname; GENERATE group AS gender, Information AS student_information; } DUMP StudentsNested; (F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)}) (M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock, Holmes),(John,Doe)}) Base relation New relation Bag inside base relation Finally project New bag
  • 33. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Operating on relations ➢ (inner) JOIN instruction: ○ Our classic database operator for relations! StudentsGrades= JOIN Students BY student_id, Grades BY student_id; DUMP StudentsGrades; (1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5) (1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6) (2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7) … DESCRIBE StudentsGrades; StudentsGrades: {Students::student_id: long,Students::name: chararray, Students::surname: chararray,Students::gender: chararray,Students::age: int, Grades::student_id: long,Grades::course: chararray,Grades::mark: double} Base relation 1 New relation Use these fields’ values to group New schema! Base relation
  • 34. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ (left) JOIN instruction: ○ Our classic database operator for relations! Operating on relations StudentsGrades= JOIN Students BY student_id LEFT, Grades BY student_id; DUMP StudentsGrades; (6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5) (7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5) (8,Tony,Stark,M,33,8,Math,6.7) … DESCRIBE StudentsGrades; StudentsGrades: {Students::student_id: long,Students::name: chararray, Students::surname: chararray,Students::gender: chararray,Students::age: int, Grades::student_id: long,Grades::course: chararray,Grades::mark: double} Left relation New relation Do not forget this one! New schema! Right relation
  • 35. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ CROSS instruction: ○ Cartesian product of two or more relations Operating on relations StudentsCr= CROSS Students, Grades; DUMP StudentsCr; (10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0) (10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5) (10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7) (10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3) … DESCRIBE StudentsCr; StudentsCr: {Students::student_id: long,Students::name: chararray,Students:: surname: chararray,Students::gender: chararray,Students::age: int,Grades:: student_id: long,Grades::course: chararray,Grades::mark: double} Relation 1 New relation Relation 2 New schema!
  • 36. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ UNION instruction: ○ Joins in the same relation multiple relations Operating on relations StudentsUnion= UNION Students, Grades; DUMP StudentsUnion; (1,John,Doe,M,18) (1,Math,5.6) (2,Mary,Doe,F,20) (2,Math,8.9) (3,Lara,Croft,F,25) (3,Math,7.1) … DESCRIBE StudentsUnion; Schema for StudentsUnion unknown. Relation 1 New relation Relation 2 Union does not preserve schemas!
  • 37. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ DISTINCT instruction: ○ Only preserves unique tuples Operating on relations Courses= FOREACH Grades GENERATE course AS course; UniqueCourses= DISTINCT Courses; DUMP UniqueCourses; (Math) (Biology) (Physics) (Engineering) New relation
  • 38. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ ORDER BY instruction: ○ Sorts relations by a specific criteria Operating on relations SortedGrades= ORDER Grades BY mark DESC; DUMP SortedGrades; (2,Biology,10.0) (10,Engineering,10.0) (10,Math,10.0) (5,Biology,10.0) (5,Engineering,9.0) (7,Physics,8.9) … Base relation New relation field(s) used to sort Sort criteria: DESC (descendant) or ASC (ascendant)
  • 39. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ LIMIT instruction: ○ Truncates relation’s size Operating on relations BestGrades= LIMIT SortedGrades 3; DUMP BestGrades; (10,Math,10.0) (10,Engineering,10.0) (2,Biology,10.0) Base relation New relation Maximum number of tuples
  • 40. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ RANK instruction: ○ Appends position of each tuple in the relation Operating on relations RankedGrades= RANK SortedGrades; DUMP RankedGrades; (1,2,Biology,10.0) (2,10,Engineering,10.0) (3,10,Math,10.0) (4,5,Biology,10.0) (5,5,Engineering,9.0) … DESCRIBE RankedGrades; RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray, mark: double} Base relation New relation Rank number!
  • 41. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ RANK instruction: ○ We can also sort and rank! Operating on relations RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC; DUMP RankedGrades; (1,1,Engineering,7.7) (2,1,Math,5.6) (3,1,Biology,4.5) (4,1,Physics,2.3) (5,2,Biology,10.0) … DESCRIBE RankedGrades; RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray, mark: double} Base relation New relation fields to sort
  • 42. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ SAMPLE instruction: ○ Sample the relation! Operating on relations SampledGrades= SAMPLE Grades 0.05; DUMP SampledGrades; (4,Engineering,8.0) Base relation New relation proportion to sample
  • 43. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Get the 3 top grades for each student (1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)}) (2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)}) (3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)}) (4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)}) (5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)}) (6,{(,)}) ... Exercise: Top grades
  • 44. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ CUBE instruction: ○ Is this really useful? Yes! Many aggregates with just one operation Operating on relations CubedGrades= CUBE Grades BY CUBE(student_id,course); CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark); DUMP CubedGrades; ((,Math),7.188888888888889) ((,Biology),7.8) ((,Physics),5.375) ((,Engineering),6.877777777777778) ((,),6.729032258064516) ((2,Math),8.9) ((2,Biology),10.0) ((2,),8.075) …
  • 45. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ CUBE/ROLLUP instruction: ○ Like standard CUBE but nulls values are introduced from right to left Operating on relations RolledGrades= CUBE Grades BY ROLLUP(course,student_id); RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube. mark); DUMP RolledGrades; ((Math,),7.188888888888889) ((Math,2),8.9) ((Math,3),7.1) ((Math,4),2.3) ((Math,5),6.7) ((Math,7),8.5) ((Math,8),6.7) ((Math,9),8.9) … order matters!
  • 46. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ ASSERT instruction: ○ Assert that the whole relation fulfills a condition ○ Useful for debugging Operating on relations ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’; Base relation Error message condition
  • 47. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ STORE instruction: ○ Stores the relation into the local FS or HDFS (usually!) ○ Useful for debugging Finally, storing data! STORE BestGrades INTO ‘best_grades_path’ USING PigStorage( ‘t’, ‘-noschema’ ); Relation path to store data Connector Field separator
  • 48. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Problems solved?!
  • 49. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ ASSERT ➢ GROUP ➢ CROSS ➢ CUBE ➢ DISTINCT ➢ FILTER ➢ FOREACH ➢ GROUP Only these operations? ➢ JOIN ➢ LIMIT ➢ LOAD ➢ ORDER, RANK ➢ SAMPLE ➢ SPLIT ➢ UNION ➢ DUMP, ILLUSTRATE, DESCRIBE
  • 50. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Transform data in data projections ➢ Built-in functions: ○ math functions, string functions, datetime functions, casting functions, etc. ➢ User defined functions: ○ Our own functions written in Java, Python, Ruby, Javascript, etc. Functions & user defined functions
  • 51. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag functions: ○ AVG/MAX/MIN/SUM: compute the average/max/min/sum of a bag of numeric values Functions & user defined functions GradesGr = GROUP Grades BY course; GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades. mark) AS avg_mark; DUMP GradesAvg; (Math,7.188888888888889) (Biology,7.8) (Physics,5.375000000000001) (Engineering,6.877777777777777) Employ only this field in bag/tuple
  • 52. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag functions: ○ COUNT: number of elements (not null) in a bag Functions & user defined functions GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT (Grades) AS number_students; DUMP GradesCount; (Math,9) (Biology,5) (Physics,8) (Engineering,9)
  • 53. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ FLATTEN: behavior depends on input Functions & user defined functions DUMP GradesCount; (Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9), (3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)}) (Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5), (9,Biology,7.8)}) ... GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN (Grades.mark) AS mark; DUMP GradesFlat; (Math,6.7) (Math,5.6) (Math,10.0) …
  • 54. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ FLATTEN: behavior depends on input Functions & user defined functions GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course, mark) AS tuple_mark; DUMP GradesTuple (1,(Math,5.6)) (2,(Math,8.9)) (3,(Math,7.1)) (4,(Math,2.3)) ... GradesUntupled= FOREACH GradesTuple GENERATE student_id AS student_id, FLATTEN(tuple_mark); DUMP GradesUntupled; (1,Math,5.6) (2,Math,8.9) (3,Math,7.1) (4,Math,2.3) …
  • 55. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ SUBTRACT: Tuples on first bag not in the second Functions & user defined functions SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE; StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender; DUMP StudentsCoGr (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)}) StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT( StudentsUnder25, StudentsUnder20 ); DUMP StudentsSub; (F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)}) (M,{(10,Peter,Parker,M,23)})
  • 56. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Bag/Tuple functions: ○ DIFF: Non overlapping tuples on two bags Functions & user defined functions DUMP StudentsCoGr (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)}) StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF (StudentsUnder25, StudentsUnder20); DUMP StudentsDiff; (F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)}) (M,{(10,Peter,Parker,M,23)})
  • 57. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Math functions: ○ Common math functions for numeric values: ■ ABS ■ EXP ■ FLOOR ■ LOG ■ RANDOM ■ ROUND ■ SQRT ■ ... Functions & user defined functions
  • 58. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ String functions: ○ Transform chararrays: ■ ENDSWITH ■ LOWER ■ UPPER ■ SUBSTRING ■ TRIM ■ REPLACE ■ ... Functions & user defined functions
  • 59. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Datetime functions: ○ Get information on dates and timestamps: ■ AddDuration ■ CurrentTime ■ ToDate ■ ToString ■ ToUnixTime ■ DaysBetween ■ ... Functions & user defined functions
  • 60. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image public class SHUFFLE extends EvalFunc<DataBag> { @Override public DataBag exec( Tuple input ) throws IOException { if ( input == null ) throw new IOException("Invalid input: null"); if( input.size() != 1 ) throw new IOException("Expected one argument"); if( input.get( 0 ) == null ) return null; TupleFactory tf = TupleFactory.getInstance(); DataBag bag = (DataBag) input.get( 0 ); List<Tuple> l = new ArrayList<Tuple>(); for( Tuple t : bag ) l.add( t ); Collections.shuffle( l ); DataBag resBag = B BagFactory.getInstance(). newDefaultBag( l ); return resBag; } User defined functions @Override public Schema outputSchema( Schema input ) { try { return new Schema( input.getField( 0 ) ); } catch( Exception e ){ return null; } } }
  • 61. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Library of useful UDFs released 2010 ➢ Created by LinkedIn engineering team: ○ Stats: variance, quantiles, median, etc. ○ Bags: concat, append, preped, etc. ○ Sampling ○ Page rank ○ Session estimation ➢ Last major release: 1.2.0 (Dec, 2013) http://datafu.incubator.apache.org/ More functions: Datafu Pig
  • 62. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image How to use UDF libraries REGISTER lib/datafu-1.2.0.jar DEFINE BagConcat datafu.pig.bags.BagConcat(); DUMP StudentsCoGr (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{}) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)}) StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat (StudentsUnder25,StudentsUnder20); DUMP StudentBagConcat (F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)}) (M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)}) Indicate UDF to be included and name
  • 63. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Scripting REGISTER lib/datafu-1.2.0.jar DEFINE BagConcat datafu.pig.bags.BagConcat(); Students= LOAD ‘$student_file’ USING PigStorage( ‘t’, ‘-noschema’ ) AS ( student_id: Long, name: Chararray, surname: Chararray, gender: Chararray, age: Int) SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20, OtherStudents OTHERWISE; StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY gender; StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat (StudentsUnder25,StudentsUnder20); STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘t’, ‘-schema’ ); A s d a Libraries and Udfs Loaddata TransformdataStoredata parameter parameter
  • 64. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Calling a script pig -x mapred -f myscript.pig -param student_file=students.csv -param output=myoutput_path parameter definition execution mode script file
  • 65. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Not limited to plain text ➢ Multiple supported format: Json, Avro, Accumulo, etc. ➢ Connectors to data sources: MongoDb, Cassandra, HBase, etc. More on load/store
  • 66. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Detect pairs of products bought together (e.g., chairs and tables) ➢ Goal: recommend related products ➢ Association score: Exercise: Product association
  • 67. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Purchases: purchases.tsv ➢ Products: products.tsv Product association product_id user_id price date 1 23 14.5 2014-03-03 4 15 11.2 2014-08-09 88 3 48.3 2011-01-01 ... product_id status 1 ok 5 ko 99 ok ...
  • 68. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image Time to work!
  • 69. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ Clear and simple syntax ➢ Interactive client ➢ Transparent M/R jobs ➢ Integration with Java and others Final notes: Pros & cons ➢ Not as flexible as Hadoop ➢ Oriented towards ETL, not AI ➢ No loops
  • 70. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image ➢ http://pig.apache.org/ ➢ Programming pig. Alan Gates. Ed. O’Reilly ➢ StackOverflow Extra information