This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to the use of Apache Pig as an ETL tool over Hadoop.
1. Apache Pig
Making data transformation easy
Víctor Sánchez Anguix
Universitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital
Image
Course 2014/2015
2. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
3. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
Complex problem
4. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Map Reduce Problem Solving
➢ Need to solve complex problem
➢ More complex atomic operations than M/R
➢ Java is not a data oriented language → Low
productivity
➢ Any solutions?
5. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig to the rescue!
6. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Hadoop
public class DeliveryFileMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, Text>{
private String cellNumber,deliveryCode,fileTag="DR~";
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output, Reporter reporter) throws
IOException
{
String line = value.toString();
String splitarray[] = line.split(",");
cellNumber = splitarray[0].trim();
deliveryCode = splitarray[1].trim();
output.collect(new Text(cellNumber), new Text
(fileTag+deliveryCode));
}
}
** Extracted from http://kickstarthadoop.blogspot.com.
es/2011/09/joins-with-plain-map-reduce.html
public class SmsReducer extends MapReduceBase implements
Reducer<Text, Text, Text, Text> {
private String customerName,deliveryReport;
private static Map<String,String> DeliveryCodesMap= new
HashMap<String,String>();
public void configure(JobConf job){
loadDeliveryStatusCodes();
}
public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException{
while (values.hasNext()){
String currValue = values.next().toString();
String valueSplitted[] = currValue.split("~");
if(valueSplitted[0].equals("CD"))
customerName=valueSplitted[1].trim();
else if(valueSplitted[0].equals("DR"))
deliveryReport = DeliveryCodesMap.get
(valueSplitted[1].trim());
}
if(customerName!=null && deliveryReport!=null)
output.collect(new Text(customerName), new Text
(deliveryReport));
else if(customerName==null)
output.collect(new Text("customerName"), new Text
(deliveryReport));
else if(deliveryReport==null)
output.collect(new Text(customerName), new Text
("deliveryReport"));
}
7. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
8. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Join in Apache Pig
A = JOIN A BY keyA, B BY keyB;
9. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Pig overview
➢ Framework layer over HDFS and Hadoop
➢ Developed by Yahoo at 2006
➢ Users: Yahoo, Linkedin, Twitter, IBM, etc.
➢ Last major release: 0.14.0 (November 2014)
http://pig.apache.org/
10. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Apache Hadoop vs. Apache Pig
➢ M/R as atomic
operations
➢ Java is not data
oriented
➢ M/R inner flexibility
➢ Efficiency
➢ ETL operations: Join,
Filter, Group, etc.
➢ Pig Latin: Data
scripting language
➢ UDF with Java (and
others)
➢ Transform to M/R
overhead
11. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Pig Programming Model: Data
➢ Pig operations operate on relations
➢ A relation is a bag
➢ A bag is a collection of tuples
➢ A tuple is an ordered set of fields
➢ A field is any type of data
12. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Basic data types:
○ Boolean: True, False
○ Int and Long: 1, 2, 3, 4, 5
○ Float and Double: 2.3, 1.4, 4.5
○ Chararray: ‘Hello’, ‘I am a string’
○ DateTime: 2014-09-11T12:20:14.1234+00:00
○ … more but you won’t probably use them very often
13. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Tuple: A catch-all data type
14. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
15. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Sounds complicated… but it’s not!
➢ Bag:
➢ And relations? Just the most outer
(distributed) bags
16. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Loading data? No, first let’s meet our friend
Grunt
➢ Interactive pig shell → Nice for
debugging/experimenting
➢ pig -x local or pig -x mapred
17. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)
➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Students = LOAD ‘student_path’ USING PigStorage( ‘t’, ‘-noschema’ ) AS
(student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int);
Relation
Name
Path to
HD/HDFS
Connector Field
separator
Tuple
schema
18. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Loading data?
➢ Data source: Local or HDFS (usually!)
➢ LOAD instruction:
○ Data is automatically loaded in a distributed relation
Grades = LOAD ‘grade_path’ USING PigStorage( ‘,’, ‘-schema’ );
Relation
Name
Path to
HD/HDFS
Connector Field
separator
Load schema from
.pig_schema
19. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DUMP instruction:
○ Prints the content of a relation at standard output
DUMP Students;
(1,John,Doe,M,18)
(2,Mary,Doe,F,20)
(3,Lara,Croft,F,25)
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(6,Sarah,Kerrigan,F,21)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
(9,Princess,Peach,F,21)
(10,Peter,Parker,M,23)
grunt>
Relation
Name
20. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ DESCRIBE instruction:
○ Prints the schema of the relation at standard output
DESCRIBE Students;
Students: {student_id: long,name: chararray,surname: chararray,gender:
chararray,age: int}
grunt>
Relation
Name
21. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Checking relations’ content
➢ ILLUSTRATE instruction:
○ Prints the schema of the relation and a tuple example
at standard output
ILLUSTRATE Students;
----------------------------------------------------------------------------
---------------------------------------
| Students | student_id:long | name:chararray | surname:chararray |
gender:chararray | age:int |
----------------------------------------------------------------------------
---------------------------------------
| | 9 | Princess | Peach |
F | 21 |
----------------------------------------------------------------------------
---------------------------------------
grunt>
Relation
Name
22. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, name, age;
Relation
Name
Base
relation
Projected
data
23. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Generate new relations by projecting data of a relation
StudentsProj= FOREACH Students GENERATE student_id, CONCAT(name,
surname) AS full_name, age;
Relation
Name
Base
relation
Projected
data
We can generate
new data too!!
24. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FOREACH instruction:
○ Let us execute the instruction and… it seems that
nothing happens!
○ We had some tracing output with LOAD, DUMP, and
ILLUSTRATE…
○ Any ideas on this issue?
25. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Pig employs lazy evaluation
➢ Computation only when:
○ LOAD, ILLUSTRATE, DUMP, STORE
➢ Pig keeps a DAG on MR jobs needed to
compute relations (optimized!)
26. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Extend Student relation to add a field that
determines if the students is under 25 years
(1,John,Doe,M,18,true)
(2,Mary,Doe,F,20,true)
(3,Lara,Croft,F,25,false)
(4,Sherlock,Holmes,M,36,false)
(5,John,Watson,M,38,false)
(6,Sarah,Kerrigan,F,21,true)
...
Exercise: Who is under 25?
27. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ FILTER instruction:
○ Generate a new relation by filtering data on a relation
StudentsFilt= FILTER Students BY age > 24 AND age < 34;
DUMP StudentsFilt;
(3,Lara,Croft,F,25)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
Relation
Name
Base
relation
Condition to fulfill
28. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:
○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsMale IF gender == ‘M’, StudentsFemale
OTHERWISE;
DUMP StudentsMale;
(1,John,Doe,M,18)
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(7,Bruce,Wayne,M,32)
(8,Tony,Stark,M,33)
(10,Peter,Parker,M,23)
Base
relation
New
relation
Condition to fulfill by
new relation.
Otherwise means the
rest
New
relation
Condition to fulfill by
new relation
29. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ SPLIT instruction:
○ Splits a relation into multiple relations based on
conditions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder30 IF
age<30, OtherStudents OTHERWISE;
DUMP OtherStudents;
(4,Sherlock,Holmes,M,36)
(5,John,Watson,M,38)
(8,Tony,Stark,M,33)
30. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:
○ Creates tuples with the key and a of bag tuples with
the same key values
StudentsGr = GROUP Students BY gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(3,Lara,Croft,F,25),(2,
Mary,Doe,F,20)})
(M,{(10,Peter,Parker,M,23),(8,Tony,Stark,M,33),(7,Bruce,Wayne,M,32),(5,John,
Watson,M,38),(4,Sherlock,Holmes,M,36),(1,John,Doe,M,18)})
DESCRIBE StudentsGr;
StudentsGr: {group: chararray,Students: {(student_id: long,name: chararray,
surname: chararray,gender: chararray,age: int)}}
Base
relation
New
relation
Use these fields’
values to make groups
New
schema!
31. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ GROUP instruction:
○ We can use multiple relations. Creates one bag per
relation
StudentsGr = GROUP StudentsUnder25 BY gender, OtherStudents BY
gender;
DUMP StudentsGr;
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(8,Tony,Stark,M,33),(5,John,
Watson,M,38),(4,Sherlock,Holmes,M,36)})
DESCRIBE StudentsGr;
StudentsCoGr: {group: chararray,StudentsUnder25: {(student_id: long,name:
chararray,surname: chararray,gender: chararray,age: int)},OtherStudents:
{(student_id: long,name: chararray,surname: chararray,gender: chararray,age:
int)}}
Base
relation
New
relation
Use these fields’
values to make groups
New
schema!
Base
relation
32. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ Nested FOREACH:
○ Operate on data in bags inside a relation and then
project
StudentsNested = FOREACH StudentsGr{
Information = FOREACH Students GENERATE name, surname;
GENERATE group AS gender, Information AS
student_information;
}
DUMP StudentsNested;
(F,{(Princess,Peach),(Sarah,Kerrigan),(Lara,Croft),(Mary,Doe)})
(M,{(Peter,Parker),(Tony,Stark),(Bruce,Wayne),(John,Watson),(Sherlock,
Holmes),(John,Doe)})
Base
relation
New
relation
Bag inside base
relation
Finally
project
New bag
33. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Operating on relations
➢ (inner) JOIN instruction:
○ Our classic database operator for relations!
StudentsGrades= JOIN Students BY student_id, Grades BY
student_id;
DUMP StudentsGrades;
(1,John,Doe,M,18,1,Physics,2.3) (1,John,Doe,M,18,1,Biology,4.5)
(1,John,Doe,M,18,1,Engineering,7.7) (1,John,Doe,M,18,1,Math,5.6)
(2,Mary,Doe,F,20,2,Engineering,6.7) (2,Mary,Doe,F,20,2,Physics,6.7)
…
DESCRIBE StudentsGrades;
StudentsGrades: {Students::student_id: long,Students::name: chararray,
Students::surname: chararray,Students::gender: chararray,Students::age: int,
Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Base
relation 1
New
relation
Use these fields’
values to group
New
schema!
Base
relation
34. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ (left) JOIN instruction:
○ Our classic database operator for relations!
Operating on relations
StudentsGrades= JOIN Students BY student_id LEFT, Grades BY
student_id;
DUMP StudentsGrades;
(6,Sarah,Kerrigan,F,21,,,) (7,Bruce,Wayne,M,32,7,Engineering,8.5)
(7,Bruce,Wayne,M,32,7,Physics,8.9) (7,Bruce,Wayne,M,32,7,Math,8.5)
(8,Tony,Stark,M,33,8,Math,6.7)
…
DESCRIBE StudentsGrades;
StudentsGrades: {Students::student_id: long,Students::name: chararray,
Students::surname: chararray,Students::gender: chararray,Students::age: int,
Grades::student_id: long,Grades::course: chararray,Grades::mark: double}
Left
relation
New
relation
Do not forget this one!
New
schema!
Right
relation
35. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CROSS instruction:
○ Cartesian product of two or more relations
Operating on relations
StudentsCr= CROSS Students, Grades;
DUMP StudentsCr;
(10,Peter,Parker,M,23,10,Physics,3.3) (10,Peter,Parker,M,23,9,Physics,5.0)
(10,Peter,Parker,M,23,7,Physics,8.9) (10,Peter,Parker,M,23,5,Physics,4.5)
(10,Peter,Parker,M,23,4,Physics,6.6) (10,Peter,Parker,M,23,3,Physics,5.7)
(10,Peter,Parker,M,23,2,Physics,6.7) (10,Peter,Parker,M,23,1,Physics,2.3)
…
DESCRIBE StudentsCr;
StudentsCr: {Students::student_id: long,Students::name: chararray,Students::
surname: chararray,Students::gender: chararray,Students::age: int,Grades::
student_id: long,Grades::course: chararray,Grades::mark: double}
Relation 1
New
relation
Relation 2
New
schema!
36. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ UNION instruction:
○ Joins in the same relation multiple relations
Operating on relations
StudentsUnion= UNION Students, Grades;
DUMP StudentsUnion;
(1,John,Doe,M,18) (1,Math,5.6)
(2,Mary,Doe,F,20) (2,Math,8.9)
(3,Lara,Croft,F,25) (3,Math,7.1)
…
DESCRIBE StudentsUnion;
Schema for StudentsUnion unknown.
Relation 1
New
relation
Relation 2
Union does not
preserve schemas!
37. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ DISTINCT instruction:
○ Only preserves unique tuples
Operating on relations
Courses= FOREACH Grades GENERATE course AS course;
UniqueCourses= DISTINCT Courses;
DUMP UniqueCourses;
(Math)
(Biology)
(Physics)
(Engineering)
New
relation
38. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ORDER BY instruction:
○ Sorts relations by a specific criteria
Operating on relations
SortedGrades= ORDER Grades BY mark DESC;
DUMP SortedGrades;
(2,Biology,10.0)
(10,Engineering,10.0)
(10,Math,10.0)
(5,Biology,10.0)
(5,Engineering,9.0)
(7,Physics,8.9)
…
Base
relation
New
relation
field(s) used to sort
Sort criteria: DESC
(descendant) or ASC
(ascendant)
39. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ LIMIT instruction:
○ Truncates relation’s size
Operating on relations
BestGrades= LIMIT SortedGrades 3;
DUMP BestGrades;
(10,Math,10.0)
(10,Engineering,10.0)
(2,Biology,10.0)
Base
relation
New
relation
Maximum number of
tuples
40. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:
○ Appends position of each tuple in the relation
Operating on relations
RankedGrades= RANK SortedGrades;
DUMP RankedGrades;
(1,2,Biology,10.0)
(2,10,Engineering,10.0)
(3,10,Math,10.0)
(4,5,Biology,10.0)
(5,5,Engineering,9.0)
…
DESCRIBE RankedGrades;
RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,
mark: double}
Base
relation
New
relation
Rank
number!
41. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RANK instruction:
○ We can also sort and rank!
Operating on relations
RankedGrades= RANK SortedGrades BY student_id ASC, mark DESC;
DUMP RankedGrades;
(1,1,Engineering,7.7)
(2,1,Math,5.6)
(3,1,Biology,4.5)
(4,1,Physics,2.3)
(5,2,Biology,10.0)
…
DESCRIBE RankedGrades;
RankedGrades: {rank_SortedGrades: long,student_id: long,course: chararray,
mark: double}
Base
relation
New
relation
fields to
sort
42. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ SAMPLE instruction:
○ Sample the relation!
Operating on relations
SampledGrades= SAMPLE Grades 0.05;
DUMP SampledGrades;
(4,Engineering,8.0)
Base
relation
New
relation
proportion to sample
43. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Get the 3 top grades for each student
(1,{(Engineering,7.7),(Math,5.6),(Biology,4.5)})
(2,{(Biology,10.0),(Math,8.9),(Engineering,6.7)})
(3,{(Math,7.1),(Physics,5.7),(Engineering,4.3)})
(4,{(Engineering,8.0),(Biology,6.7),(Physics,6.6)})
(5,{(Biology,10.0),(Engineering,9.0),(Math,6.7)})
(6,{(,)})
...
Exercise: Top grades
44. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE instruction:
○ Is this really useful? Yes! Many aggregates with just
one operation
Operating on relations
CubedGrades= CUBE Grades BY CUBE(student_id,course);
CubedGrades= FOREACH CubedGrades GENERATE group, AVG(cube.mark);
DUMP CubedGrades;
((,Math),7.188888888888889)
((,Biology),7.8)
((,Physics),5.375)
((,Engineering),6.877777777777778)
((,),6.729032258064516)
((2,Math),8.9)
((2,Biology),10.0)
((2,),8.075)
…
45. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ CUBE/ROLLUP instruction:
○ Like standard CUBE but nulls values are introduced
from right to left
Operating on relations
RolledGrades= CUBE Grades BY ROLLUP(course,student_id);
RolledGrades= FOREACH RolledGrades GENERATE group, AVG(cube.
mark);
DUMP RolledGrades;
((Math,),7.188888888888889)
((Math,2),8.9)
((Math,3),7.1)
((Math,4),2.3)
((Math,5),6.7)
((Math,7),8.5)
((Math,8),6.7)
((Math,9),8.9)
…
order matters!
46. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT instruction:
○ Assert that the whole relation fulfills a condition
○ Useful for debugging
Operating on relations
ASSERT Grades BY mark > 0.0, ‘marks should be greater than 0’;
Base
relation
Error
message
condition
47. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ STORE instruction:
○ Stores the relation into the local FS or HDFS (usually!)
○ Useful for debugging
Finally, storing data!
STORE BestGrades INTO ‘best_grades_path’ USING
PigStorage( ‘t’, ‘-noschema’ );
Relation
path to store
data
Connector Field
separator
48. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Problems solved?!
49. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ ASSERT
➢ GROUP
➢ CROSS
➢ CUBE
➢ DISTINCT
➢ FILTER
➢ FOREACH
➢ GROUP
Only these operations?
➢ JOIN
➢ LIMIT
➢ LOAD
➢ ORDER, RANK
➢ SAMPLE
➢ SPLIT
➢ UNION
➢ DUMP, ILLUSTRATE,
DESCRIBE
50. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Transform data in data projections
➢ Built-in functions:
○ math functions, string functions, datetime functions,
casting functions, etc.
➢ User defined functions:
○ Our own functions written in Java, Python, Ruby,
Javascript, etc.
Functions & user defined functions
51. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:
○ AVG/MAX/MIN/SUM: compute the
average/max/min/sum of a bag of numeric values
Functions & user defined functions
GradesGr = GROUP Grades BY course;
GradesAvg= FOREACH GradesGr GENERATE group AS course, AVG(Grades.
mark) AS avg_mark;
DUMP GradesAvg;
(Math,7.188888888888889)
(Biology,7.8)
(Physics,5.375000000000001)
(Engineering,6.877777777777777)
Employ
only this
field in
bag/tuple
52. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag functions:
○ COUNT: number of elements (not null) in a bag
Functions & user defined functions
GradesCount= FOREACH GradesGr GENERATE group AS course, COUNT
(Grades) AS number_students;
DUMP GradesCount;
(Math,9)
(Biology,5)
(Physics,8)
(Engineering,9)
53. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ FLATTEN: behavior depends on input
Functions & user defined functions
DUMP GradesCount;
(Math,{(8,Math,6.7),(1,Math,5.6),(10,Math,10.0),(9,Math,8.9),(2,Math,8.9),
(3,Math,7.1),(4,Math,2.3),(5,Math,6.7),(7,Math,8.5)})
(Biology,{(5,Biology,10.0),(4,Biology,6.7),(2,Biology,10.0),(1,Biology,4.5),
(9,Biology,7.8)})
...
GradesFlat= FOREACH GradesGr GENERATE group AS course, FLATTEN
(Grades.mark) AS mark;
DUMP GradesFlat;
(Math,6.7)
(Math,5.6)
(Math,10.0)
…
54. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ FLATTEN: behavior depends on input
Functions & user defined functions
GradesTuple = FOREACH Grades GENERATE student_id, TOTUPLE(course,
mark) AS tuple_mark;
DUMP GradesTuple
(1,(Math,5.6))
(2,(Math,8.9))
(3,(Math,7.1))
(4,(Math,2.3))
...
GradesUntupled= FOREACH GradesTuple GENERATE student_id AS
student_id, FLATTEN(tuple_mark);
DUMP GradesUntupled;
(1,Math,5.6)
(2,Math,8.9)
(3,Math,7.1)
(4,Math,2.3)
…
55. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ SUBTRACT: Tuples on first bag not in the second
Functions & user defined functions
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF
age<20, OtherStudents OTHERWISE;
StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20
BY gender;
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentsSub = FOREACH StudentsCoGr GENERATE group, SUBTRACT(
StudentsUnder25, StudentsUnder20 );
DUMP StudentsSub;
(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})
(M,{(10,Peter,Parker,M,23)})
56. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Bag/Tuple functions:
○ DIFF: Non overlapping tuples on two bags
Functions & user defined functions
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{)
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentsDiff = FOREACH StudentsCoGr GENERATE group, DIFF
(StudentsUnder25, StudentsUnder20);
DUMP StudentsDiff;
(F,{(2,Mary,Doe,F,20),(6,Sarah,Kerrigan,F,21),(9,Princess,Peach,F,21)})
(M,{(10,Peter,Parker,M,23)})
57. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Math functions:
○ Common math functions for numeric values:
■ ABS
■ EXP
■ FLOOR
■ LOG
■ RANDOM
■ ROUND
■ SQRT
■ ...
Functions & user defined functions
58. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ String functions:
○ Transform chararrays:
■ ENDSWITH
■ LOWER
■ UPPER
■ SUBSTRING
■ TRIM
■ REPLACE
■ ...
Functions & user defined functions
59. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Datetime functions:
○ Get information on dates and timestamps:
■ AddDuration
■ CurrentTime
■ ToDate
■ ToString
■ ToUnixTime
■ DaysBetween
■ ...
Functions & user defined functions
60. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
public class SHUFFLE extends EvalFunc<DataBag> {
@Override
public DataBag exec( Tuple input ) throws
IOException {
if ( input == null )
throw new IOException("Invalid input:
null");
if( input.size() != 1 )
throw new IOException("Expected one
argument");
if( input.get( 0 ) == null )
return null;
TupleFactory tf = TupleFactory.getInstance();
DataBag bag = (DataBag) input.get( 0 );
List<Tuple> l = new ArrayList<Tuple>();
for( Tuple t : bag )
l.add( t );
Collections.shuffle( l );
DataBag resBag = B BagFactory.getInstance().
newDefaultBag( l );
return resBag;
}
User defined functions
@Override
public Schema outputSchema( Schema input ) {
try {
return new Schema( input.getField( 0 ) );
} catch( Exception e ){
return null;
}
}
}
61. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Library of useful UDFs released 2010
➢ Created by LinkedIn engineering team:
○ Stats: variance, quantiles, median, etc.
○ Bags: concat, append, preped, etc.
○ Sampling
○ Page rank
○ Session estimation
➢ Last major release: 1.2.0 (Dec, 2013)
http://datafu.incubator.apache.org/
More functions: Datafu Pig
62. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
How to use UDF libraries
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
DUMP StudentsCoGr
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)},{})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18)},{(1,John,Doe,M,18)})
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat
(StudentsUnder25,StudentsUnder20);
DUMP StudentBagConcat
(F,{(9,Princess,Peach,F,21),(6,Sarah,Kerrigan,F,21),(2,Mary,Doe,F,20)})
(M,{(10,Peter,Parker,M,23),(1,John,Doe,M,18),(1,John,Doe,M,18)})
Indicate UDF to be included
and name
63. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Scripting
REGISTER lib/datafu-1.2.0.jar
DEFINE BagConcat datafu.pig.bags.BagConcat();
Students= LOAD ‘$student_file’ USING PigStorage( ‘t’, ‘-noschema’ ) AS (
student_id: Long, name: Chararray, surname: Chararray, gender: Chararray,
age: Int)
SPLIT Students INTO StudentsUnder25 IF age<25, StudentsUnder20 IF age<20,
OtherStudents OTHERWISE;
StudentsCoGr = GROUP StudentsUnder25 BY gender, StudentsUnder20 BY
gender;
StudentBagConcat = FOREACH StudentsCoGr GENERATE group, BagConcat
(StudentsUnder25,StudentsUnder20);
STORE StudentBagConcat INTO ‘$output’ USING PigStorage( ‘t’, ‘-schema’ );
A
s
d
a
Libraries and Udfs
Loaddata
TransformdataStoredata
parameter
parameter
64. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Calling a script
pig -x mapred -f myscript.pig -param student_file=students.csv -param
output=myoutput_path
parameter definition
execution mode script file
65. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Not limited to plain text
➢ Multiple supported format: Json, Avro,
Accumulo, etc.
➢ Connectors to data sources: MongoDb,
Cassandra, HBase, etc.
More on load/store
66. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Detect pairs of products bought together (e.g.,
chairs and tables)
➢ Goal: recommend related products
➢ Association score:
Exercise: Product association
67. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Purchases: purchases.tsv
➢ Products: products.tsv
Product association
product_id user_id price date
1 23 14.5 2014-03-03
4 15 11.2 2014-08-09
88 3 48.3 2011-01-01
...
product_id status
1 ok
5 ko
99 ok
...
68. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Time to work!
69. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Clear and simple
syntax
➢ Interactive client
➢ Transparent M/R
jobs
➢ Integration with
Java and others
Final notes: Pros & cons
➢ Not as flexible as
Hadoop
➢ Oriented towards
ETL, not AI
➢ No loops
70. Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ http://pig.apache.org/
➢ Programming pig. Alan Gates. Ed. O’Reilly
➢ StackOverflow
Extra information