SlideShare une entreprise Scribd logo
1  sur  23
Orbitz Worldwide 
When You Can’t Start From Scratch 
Building a Data Pipeline with the Tools You Have 
Oct 1, 2014
Agenda 
• Introduction 
• Motivation 
• Consumption 
• Storage at Rest 
• Transport 
• Dead Simple Data Collection 
• Key Takeaways 
page 1
About Us 
• Steve Hoffman 
• Senior Principal Engineer - Operations 
• @bacoboy 
• Author of Apache Flume Book (bitly.com/flumebook) 
- Conference Discount until Oct 8 
- Print books: qLyVgEb5d 
- eBook - gON73ZL77 
• Ken Dallmeyer 
• Lead Engineer – Machine Learning 
• We work at Orbitz – www.orbitz.com 
• Leading Online Travel Agency for 14 years 
• Always Hiring! @OrbitzTalent 
page 2
Motivation 
• What we have: 
• Big Website -> Mountains of Logs 
• What we want: 
• Finding customer insight in logs 
• What we do: 
• Spending disproportionate amount 
of time scrubbing logs into 
parsable data 
• Multiple 1-off transports into 
Hadoop 
page 3
Logs != Data 
• Logs 
• Humans read them 
• Short lifespan – what’s broken now? 
• Data 
• Programs read them 
• Long lifespan – Find trends over a period of time 
• Developer changes logs ‘cause its useful to them -> breaks your MR 
job. 
page 4
Look Familiar? 
20140922-100028.260|I|loyalty-003|loyalty-1.23- 
0|ORB|3F3BA823C747FF17~wl000000000000000422c3b14201cdd0|3a 
3f3b95||com.orbitz.loyalty.LoyaltyDataServiceImpl:533|Loya 
lty+Member+ABC123ZZ+has+already+earned+USD18.55+and+is+eli 
gible+for+USD31.45 
{"timestamp":"1411398028260","server":"loyalty-003", 
"pos":"ORB","sessionID":"3F3BA823C747FF17", 
"requestID":"wl000000000000000422c3b14201cdd0", 
"loyalityID":"ABC123ZZ”,"loyalityEarnedAmount":"USD18.55", 
"loyalityEligibleAmount":"USD31.45"} 
page 5
Are we asking the correct question? 
• Not “How should I store data?” 
• But, How do people consume the data? 
• Through Queries? Hive 
• Through Key-Value lookups? Hbase 
• Custom code? MapReduce 
• Existing data warehouse? 
• Web UI/Command Line/Excel(gasp)? 
Start with consumption and work backwards 
page 6
Consumption 
• Hive Tables turned out to be the Orbitz common denominator 
• We like Hive because 
• SQL ~= HQL – people understand tables/columns 
• Its a lightweight queryable datasource 
• Something easy to change without a lot of overhead 
• Can join with other people’s hive tables 
• BUT… 
• Each table was its own MapReduce job 
• Too much time spent hunting/scrubbing data than actually using it 
page 7
Storage at Rest 
• How can we generalize to our data feed to be readable by Hive? 
• Options: 
• Character delimited text files 
- But brittle to change 
- Cannot remove fields 
- Order matters 
• Avro records 
- Schema defines table 
- Tight coupling with transport to HDFS handoff or verbose passing schema 
- Changes mean re-encoding old data to match new schema 
• HBase 
- Good for flexibility 
- Key selection is very important and hard to change 
- Bad for ad-hoc non-key querying 
page 8
Storage at Rest 
• Our solution: 
• Use Avro with a Map<String, String> schema for storage 
• A custom Hive SerDe to map Hive columns to keys in the map. 
• Storage is independent from Consumption 
• New keys just sit until Hive mapping updated 
• Deleted keys return NULL if not there 
• Only Requirements: 
• Bucket Name (aka table name) 
• Timestamp (UTC) 
page 9
Storage at Rest 
• Stored in 
• hdfs://server/root/#{bucket}/#{YYYY}/#{MM}/#{DD}/#{HH}/log.XXXXX.avro 
• Avro Schema: Map<String,String> 
• Create external hive table: 
CREATE EXTERNAL TABLE foo ( 
col1 STRING, 
col2 INT 
) 
PARTITIONED BY ( 
dt STRING, 
hour STRING 
) 
ROW FORMAT SERDE 'com.orbitz.hadoop.hive.avro.AvroSerDe’ 
STORED AS 
INPUTFORMAT 
'com.orbitz.hadoop.hive.avro.AvroContainerInputFormat’ 
OUTPUTFORMAT 
'com.orbitz.hadoop.hive.avro.AvroContainerOutputFormat’ 
LOCATION ‘/root/foo’ 
TBLPROPERTIES( 
'column.mapping' = ’col1,col2’ 
page 10 );
Care and Feeding 
• Issues 
• Hive partitions aren’t automatically added. (Vote for HIVE-6589). 
- A cron job to add a new partition every hour. 
• Nice to Have 
• Would be nice to extend schema rather than set properties 
- col1 STRING KEYED BY ‘some_other_key’ 
page 11
Transport 
• Lots of Options at the time 
• Flume 
• Syslog variants 
• Logstash 
• Newer options 
• Storm 
• Spark 
• Kite SDK 
• And probably so many more 
page 12
Transport 
• We chose Flume, since at the time it was best option 
• HDFS aware 
• Did time-bucketing 
• Provided a buffering tier 
- Inevitable Hadoop maintenance 
- Isolates us from Hadoop versions and incompatibilities (less of an issue today) 
• Have ‘localhost’ agent to simplify configuration 
• Use provisioning tool to externalize configuration of where to send for the 
environments 
page 13
Transport Plan 
• Application writes generic JSON payload using Avro client to local 
Flume agent 
• Local Flume agent forwards to collector tier 
• Collector Tier to HDFS 
• However, an additional Java agent on every application server = big 
memory footprint 
page 14
Transport Updated 
• Application write JSON to local syslog agent 
• (already there doing log work /var/log/*) 
• Local syslog agent to flume collector tier 
• Flume collector to HDFS 
page 15
Care and Feeding 
• Issues 
• Hive partitions aren’t automatically added. (Vote for HIVE-6589). 
- A cron job to add a new partition every hour. 
• Flume streaming data creates lots of little files (Bad for NameNode) 
- A cron job to combine many tiny poorly compresed files into 1 better compressed 
avro file once per hour (similar to in functionality to HBase compaction) 
- Create custom serializer to write Map<String,String> instead of default Flume 
Avro record format. 
• Syslog 
- Need to pass single line of data in syslog format. Multiple lines, non-ascii, etc. 
would cause problems. Just need to make sure JSON coming in has special 
characters escaped out. 
page 16
Dead Simple Data Collection 
• Want a low barrier to entry. Think log4j or another simple API 
public sendData(Map<String,String> myData); 
But Beware of creating a new standard… 
page 17 
http://xkcd.com/927/
Dead Simple Data Collection 
• Thought about using Flume log4j Appender, but would have to wrap 
JSON payload creation anyway 
• Logs != Data 
• log.warn(DATA) ?? 
• We already were using ERMA which was close enough to this. May 
not be right choice for you. 
• Create custom ERMA monitor processor and renderer to create the 
payload for syslog 
- Make sure it was RFC5424 
- Assemble JSON payload 
- Add “default fields” like timestamp, session id, etc. 
page 18
Dead Simple Data Collection 
• But what about more complicated data structures? 
• Flatten Keys with dot notation 
• Hash? 
• {a:{b:6}}  {a.b:6} 
• Arrays? 
• {ts:4, data:[4,6,7]} {ts:4, data.0:4, data.1:6, 
data.2:7, data.length=3} 
• It depends… Again think consumption first – Hive tables are flat 
page 19
Care and Feeding 
• Issues 
• Hive partitions aren’t automatically added. (Vote for HIVE-6589). 
- A cron job to add a new partition every hour. 
• Flume streaming data creates lots of little files (Bad for NameNode) 
- A cron job to combine many tiny poorly compresed files into 1 better compressed 
avro file once per hour (similar to in functionality to HBase compaction) 
- Create custom serializer to write Map<String,String> instead of default Flume 
Avro record format. 
• Syslog 
- Need to pass single line of data in syslog format. Multiple lines, non-ascii, etc. 
would cause problems. Just need to make sure JSON coming in has special 
characters escaped out. 
• Application 
- Whatever data emitter we choose, needs to be async 
• Scaling and Monitoring 
- Be aware that as we add more applications, we will need to scale 
the Collectors and Hadoop page 20
Key Takeaways 
• How you consume the data should drive your solution 
• Decouple Storage from API and Transport 
• If 100% persistence then use a DATABASE. 
• Use in-memory when possible 
• much faster than disk = less hardware you have to buy === value of the 
data/ what is really lost if you lost an hour/day/all? how soon to recover 
• Minimize transforms at source, en-route, and destination 
• Minimize hops from Source to Destination 
• Data as a Minimal Viable Product, not a data warehouse – grow 
organically as your applications will. 
page 21
Thanks 
Thanks! 
& 
Questions? 
page 22

Contenu connexe

Dernier

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 

Dernier (20)

Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 

En vedette

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

En vedette (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Big Data Everywhere Chicago: When You Can't Start From Scratch -- Building a Data Pipelines with the Tools You Have. An Orbitz Case Study. (Orbitz)

  • 1. Orbitz Worldwide When You Can’t Start From Scratch Building a Data Pipeline with the Tools You Have Oct 1, 2014
  • 2. Agenda • Introduction • Motivation • Consumption • Storage at Rest • Transport • Dead Simple Data Collection • Key Takeaways page 1
  • 3. About Us • Steve Hoffman • Senior Principal Engineer - Operations • @bacoboy • Author of Apache Flume Book (bitly.com/flumebook) - Conference Discount until Oct 8 - Print books: qLyVgEb5d - eBook - gON73ZL77 • Ken Dallmeyer • Lead Engineer – Machine Learning • We work at Orbitz – www.orbitz.com • Leading Online Travel Agency for 14 years • Always Hiring! @OrbitzTalent page 2
  • 4. Motivation • What we have: • Big Website -> Mountains of Logs • What we want: • Finding customer insight in logs • What we do: • Spending disproportionate amount of time scrubbing logs into parsable data • Multiple 1-off transports into Hadoop page 3
  • 5. Logs != Data • Logs • Humans read them • Short lifespan – what’s broken now? • Data • Programs read them • Long lifespan – Find trends over a period of time • Developer changes logs ‘cause its useful to them -> breaks your MR job. page 4
  • 6. Look Familiar? 20140922-100028.260|I|loyalty-003|loyalty-1.23- 0|ORB|3F3BA823C747FF17~wl000000000000000422c3b14201cdd0|3a 3f3b95||com.orbitz.loyalty.LoyaltyDataServiceImpl:533|Loya lty+Member+ABC123ZZ+has+already+earned+USD18.55+and+is+eli gible+for+USD31.45 {"timestamp":"1411398028260","server":"loyalty-003", "pos":"ORB","sessionID":"3F3BA823C747FF17", "requestID":"wl000000000000000422c3b14201cdd0", "loyalityID":"ABC123ZZ”,"loyalityEarnedAmount":"USD18.55", "loyalityEligibleAmount":"USD31.45"} page 5
  • 7. Are we asking the correct question? • Not “How should I store data?” • But, How do people consume the data? • Through Queries? Hive • Through Key-Value lookups? Hbase • Custom code? MapReduce • Existing data warehouse? • Web UI/Command Line/Excel(gasp)? Start with consumption and work backwards page 6
  • 8. Consumption • Hive Tables turned out to be the Orbitz common denominator • We like Hive because • SQL ~= HQL – people understand tables/columns • Its a lightweight queryable datasource • Something easy to change without a lot of overhead • Can join with other people’s hive tables • BUT… • Each table was its own MapReduce job • Too much time spent hunting/scrubbing data than actually using it page 7
  • 9. Storage at Rest • How can we generalize to our data feed to be readable by Hive? • Options: • Character delimited text files - But brittle to change - Cannot remove fields - Order matters • Avro records - Schema defines table - Tight coupling with transport to HDFS handoff or verbose passing schema - Changes mean re-encoding old data to match new schema • HBase - Good for flexibility - Key selection is very important and hard to change - Bad for ad-hoc non-key querying page 8
  • 10. Storage at Rest • Our solution: • Use Avro with a Map<String, String> schema for storage • A custom Hive SerDe to map Hive columns to keys in the map. • Storage is independent from Consumption • New keys just sit until Hive mapping updated • Deleted keys return NULL if not there • Only Requirements: • Bucket Name (aka table name) • Timestamp (UTC) page 9
  • 11. Storage at Rest • Stored in • hdfs://server/root/#{bucket}/#{YYYY}/#{MM}/#{DD}/#{HH}/log.XXXXX.avro • Avro Schema: Map<String,String> • Create external hive table: CREATE EXTERNAL TABLE foo ( col1 STRING, col2 INT ) PARTITIONED BY ( dt STRING, hour STRING ) ROW FORMAT SERDE 'com.orbitz.hadoop.hive.avro.AvroSerDe’ STORED AS INPUTFORMAT 'com.orbitz.hadoop.hive.avro.AvroContainerInputFormat’ OUTPUTFORMAT 'com.orbitz.hadoop.hive.avro.AvroContainerOutputFormat’ LOCATION ‘/root/foo’ TBLPROPERTIES( 'column.mapping' = ’col1,col2’ page 10 );
  • 12. Care and Feeding • Issues • Hive partitions aren’t automatically added. (Vote for HIVE-6589). - A cron job to add a new partition every hour. • Nice to Have • Would be nice to extend schema rather than set properties - col1 STRING KEYED BY ‘some_other_key’ page 11
  • 13. Transport • Lots of Options at the time • Flume • Syslog variants • Logstash • Newer options • Storm • Spark • Kite SDK • And probably so many more page 12
  • 14. Transport • We chose Flume, since at the time it was best option • HDFS aware • Did time-bucketing • Provided a buffering tier - Inevitable Hadoop maintenance - Isolates us from Hadoop versions and incompatibilities (less of an issue today) • Have ‘localhost’ agent to simplify configuration • Use provisioning tool to externalize configuration of where to send for the environments page 13
  • 15. Transport Plan • Application writes generic JSON payload using Avro client to local Flume agent • Local Flume agent forwards to collector tier • Collector Tier to HDFS • However, an additional Java agent on every application server = big memory footprint page 14
  • 16. Transport Updated • Application write JSON to local syslog agent • (already there doing log work /var/log/*) • Local syslog agent to flume collector tier • Flume collector to HDFS page 15
  • 17. Care and Feeding • Issues • Hive partitions aren’t automatically added. (Vote for HIVE-6589). - A cron job to add a new partition every hour. • Flume streaming data creates lots of little files (Bad for NameNode) - A cron job to combine many tiny poorly compresed files into 1 better compressed avro file once per hour (similar to in functionality to HBase compaction) - Create custom serializer to write Map<String,String> instead of default Flume Avro record format. • Syslog - Need to pass single line of data in syslog format. Multiple lines, non-ascii, etc. would cause problems. Just need to make sure JSON coming in has special characters escaped out. page 16
  • 18. Dead Simple Data Collection • Want a low barrier to entry. Think log4j or another simple API public sendData(Map<String,String> myData); But Beware of creating a new standard… page 17 http://xkcd.com/927/
  • 19. Dead Simple Data Collection • Thought about using Flume log4j Appender, but would have to wrap JSON payload creation anyway • Logs != Data • log.warn(DATA) ?? • We already were using ERMA which was close enough to this. May not be right choice for you. • Create custom ERMA monitor processor and renderer to create the payload for syslog - Make sure it was RFC5424 - Assemble JSON payload - Add “default fields” like timestamp, session id, etc. page 18
  • 20. Dead Simple Data Collection • But what about more complicated data structures? • Flatten Keys with dot notation • Hash? • {a:{b:6}}  {a.b:6} • Arrays? • {ts:4, data:[4,6,7]} {ts:4, data.0:4, data.1:6, data.2:7, data.length=3} • It depends… Again think consumption first – Hive tables are flat page 19
  • 21. Care and Feeding • Issues • Hive partitions aren’t automatically added. (Vote for HIVE-6589). - A cron job to add a new partition every hour. • Flume streaming data creates lots of little files (Bad for NameNode) - A cron job to combine many tiny poorly compresed files into 1 better compressed avro file once per hour (similar to in functionality to HBase compaction) - Create custom serializer to write Map<String,String> instead of default Flume Avro record format. • Syslog - Need to pass single line of data in syslog format. Multiple lines, non-ascii, etc. would cause problems. Just need to make sure JSON coming in has special characters escaped out. • Application - Whatever data emitter we choose, needs to be async • Scaling and Monitoring - Be aware that as we add more applications, we will need to scale the Collectors and Hadoop page 20
  • 22. Key Takeaways • How you consume the data should drive your solution • Decouple Storage from API and Transport • If 100% persistence then use a DATABASE. • Use in-memory when possible • much faster than disk = less hardware you have to buy === value of the data/ what is really lost if you lost an hour/day/all? how soon to recover • Minimize transforms at source, en-route, and destination • Minimize hops from Source to Destination • Data as a Minimal Viable Product, not a data warehouse – grow organically as your applications will. page 21
  • 23. Thanks Thanks! & Questions? page 22

Notes de l'éditeur

  1. Orbitz.com – the flagship brand of Orbitz Worldwide – has been in operation since 2001. DUNCLLC (American Airlines joined too) Public OWW – 2003 Cendant – 2004 Travelport, Blackstone - 2006 Public (OWW) – 2007
  2. Ken’s Quote: “I work on the Machine Learning team, but I spend more time writing ETL and cleaning data then actual machine learning!”
  3. Prepopulate with basic stuff: time, session id, whatever
  4. ERMA = Extremely Reusable Monitoring API)