SlideShare une entreprise Scribd logo
1  sur  28
Functional Big Data
Agenda 
MapReduce 
Google 
Scaling Out 
Key Value Store 
Chaining 
Fault Tolerance 
Functional Example 
Business Problem 
Design 
Processes 
Schema 
Big Data Guidelines
MapReduce
Google MapReduce 
+ Paper published in 2004 
+ Implemented in 2003 
+ Production use at Google 
+ Built for Google 
+ Not open sourced
Google in 2004 
+ Clusters of 100s or 1000s of servers 
o Linux 
o dual-processor x86 
o 2-4 GB memory 
o 100BaseT or GigE 
o inexpensive IDE hard drives 
+ Servers fail every day 
+ Network maintenance is constant
Scaling Out 
+ Scaling up (faster computer) doesn’t get far 
+ Scaling out is the only next step 
+ Hundreds/thousands of modest computers 
outperform the biggest single computers 
+ Scaling one to a few is hard 
+ Scaling a few to many is easy 
+ Scaling many to massive is (almost) trivial
Concurrency
Intermediate Data 
+ Input data is split between the workers 
+ Map workers create key/value pairs 
+ Reduce workers read in all intermediate 
data and sort by key 
+ Reduce workers then iterate over the sorted 
data producing a result for each key
Key Value Store
Rinse and Repeat 
+ Often the results of one MapReduce are 
used as input to another 
+ Building on a powerful basic functional 
model complex data processing can be 
accomplished
Chaining
Fault Tolerance 
+ Likelihood of failure rises with number of 
servers and processing time 
+ Resiliency is a necessity at scale 
+ Scheduler/Supervisor (master) reassigns 
failed jobs and ensures reduce workers find 
the (right) data
Scheduling
Supervision
Functional Example
Example Business Problem 
Scenario: 
A mobile operator wants to know if an instant 
messaging (IM) service would be useful to 
current subscribers. 
Question: 
What percentage of text messages (SMS) 
are part of a conversation?
Challenge 
✓ 10 million subscribers 
✓ average of 100 SMS a month per subscriber 
✓ ∴ one billion SMS each month 
✓ call detail records (CDR) include SMS but also 
voice and data events 
✓ ∴ 20 billion (20,000,000,000) records/month
Requirements 
+ Identify SMS conversations 
o messages sent or received with one other party 
o interval between messages < 10 minutes 
o at least three messages exchanged 
+ Provide result as 
o ratio of conversational to non-conversational SMS 
o per subscriber 
o per month
Process Design
Filter 
+ Read events from CDR files 
o records are in chronological order 
o read files in chronological order 
+ Discard non-SMS events 
+ Distribute SMS events to Map processes 
o Consistent distribution by subscriber
Hashing 
+ To analyze interval between 
messages one process must 
handle all events for a 
particular subscriber 
+ Simple Hash: 
o M = last four digits of subscriber’s 
mobile number 
o N = number of processes available 
o Pid = M rem N
Map 
+ Read subscriber’s stored data 
+ Find other party in set 
+ Increment total count of messages 
+ Is previous message < 10 minutes? 
o Is next previous message < 10m before previous? 
 Increment conversational messages count 
+ Update previous and next previous times
Schema Design
Interim Data 
+ We are using an in memory key value store 
+ The key is the subscriber number 
+ The value is a set of OtherParty 
+ OtherParty data structure contains counts 
+ When the map is complete we transfer the 
data to disk for persistence
Reduce 
+ Collect intermediate data 
from disk copies 
+ Iterate through all parties for 
each subscriber 
+ Total all party counts 
+ Provide result as percentage 
of conversational messages 
to total messages
Big Data Guidelines 
+ Find opportunities for concurrency 
+ Choose the right containers for your data 
+ Use memory as effectively as possible 
+ Minimize copying data 
+ Avoid any unnecessary overhead 
+ Anything you are going to do hundreds of 
billions of times should be efficient!
Thank you.
SLASSCOM TECH TALKS 
https://www.facebook.com/SlasscomTechnologyForum 
http://www.slasscom.lk/events 
https://twitter.com/slasscom 
www.slideshare.net/slasscomtechforum

Contenu connexe

Similaire à MapReduce Agenda for Functional Big Data Analysis

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataCloudera, Inc.
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing systemAyisha Kowsar
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private BankingJérôme Kehrli
 
The BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformThe BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformDan Moore
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataStylight
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)Dave Cortright
 
Big Data
Big DataBig Data
Big DataNGDATA
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunk
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthdaveconnors
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics
 

Similaire à MapReduce Agenda for Functional Big Data Analysis (20)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Big Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and StoringBig Data with Hadoop – For Data Management, Processing and Storing
Big Data with Hadoop – For Data Management, Processing and Storing
 
Transaction processing system
Transaction processing systemTransaction processing system
Transaction processing system
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
The BUsiness of Windows Azure Platform
The BUsiness of Windows Azure PlatformThe BUsiness of Windows Azure Platform
The BUsiness of Windows Azure Platform
 
Big Data
Big DataBig Data
Big Data
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016Big Data Architectures @ JAX / BigDataCon 2016
Big Data Architectures @ JAX / BigDataCon 2016
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Smart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat TranSmart App@Pivotal by Dat Tran
Smart App@Pivotal by Dat Tran
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)IT overview for nonprofits by Dave Cortright (IT4NP)
IT overview for nonprofits by Dave Cortright (IT4NP)
 
Big Data
Big DataBig Data
Big Data
 
SplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCSSplunkLive! Dallas Nov 2012 - Metro PCS
SplunkLive! Dallas Nov 2012 - Metro PCS
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Cassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per monthCassandra & puppet, scaling data at $15 per month
Cassandra & puppet, scaling data at $15 per month
 
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 

Dernier

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 

Dernier (20)

Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

MapReduce Agenda for Functional Big Data Analysis

  • 2. Agenda MapReduce Google Scaling Out Key Value Store Chaining Fault Tolerance Functional Example Business Problem Design Processes Schema Big Data Guidelines
  • 4. Google MapReduce + Paper published in 2004 + Implemented in 2003 + Production use at Google + Built for Google + Not open sourced
  • 5. Google in 2004 + Clusters of 100s or 1000s of servers o Linux o dual-processor x86 o 2-4 GB memory o 100BaseT or GigE o inexpensive IDE hard drives + Servers fail every day + Network maintenance is constant
  • 6. Scaling Out + Scaling up (faster computer) doesn’t get far + Scaling out is the only next step + Hundreds/thousands of modest computers outperform the biggest single computers + Scaling one to a few is hard + Scaling a few to many is easy + Scaling many to massive is (almost) trivial
  • 8. Intermediate Data + Input data is split between the workers + Map workers create key/value pairs + Reduce workers read in all intermediate data and sort by key + Reduce workers then iterate over the sorted data producing a result for each key
  • 10. Rinse and Repeat + Often the results of one MapReduce are used as input to another + Building on a powerful basic functional model complex data processing can be accomplished
  • 12. Fault Tolerance + Likelihood of failure rises with number of servers and processing time + Resiliency is a necessity at scale + Scheduler/Supervisor (master) reassigns failed jobs and ensures reduce workers find the (right) data
  • 16. Example Business Problem Scenario: A mobile operator wants to know if an instant messaging (IM) service would be useful to current subscribers. Question: What percentage of text messages (SMS) are part of a conversation?
  • 17. Challenge ✓ 10 million subscribers ✓ average of 100 SMS a month per subscriber ✓ ∴ one billion SMS each month ✓ call detail records (CDR) include SMS but also voice and data events ✓ ∴ 20 billion (20,000,000,000) records/month
  • 18. Requirements + Identify SMS conversations o messages sent or received with one other party o interval between messages < 10 minutes o at least three messages exchanged + Provide result as o ratio of conversational to non-conversational SMS o per subscriber o per month
  • 20. Filter + Read events from CDR files o records are in chronological order o read files in chronological order + Discard non-SMS events + Distribute SMS events to Map processes o Consistent distribution by subscriber
  • 21. Hashing + To analyze interval between messages one process must handle all events for a particular subscriber + Simple Hash: o M = last four digits of subscriber’s mobile number o N = number of processes available o Pid = M rem N
  • 22. Map + Read subscriber’s stored data + Find other party in set + Increment total count of messages + Is previous message < 10 minutes? o Is next previous message < 10m before previous?  Increment conversational messages count + Update previous and next previous times
  • 24. Interim Data + We are using an in memory key value store + The key is the subscriber number + The value is a set of OtherParty + OtherParty data structure contains counts + When the map is complete we transfer the data to disk for persistence
  • 25. Reduce + Collect intermediate data from disk copies + Iterate through all parties for each subscriber + Total all party counts + Provide result as percentage of conversational messages to total messages
  • 26. Big Data Guidelines + Find opportunities for concurrency + Choose the right containers for your data + Use memory as effectively as possible + Minimize copying data + Avoid any unnecessary overhead + Anything you are going to do hundreds of billions of times should be efficient!
  • 28. SLASSCOM TECH TALKS https://www.facebook.com/SlasscomTechnologyForum http://www.slasscom.lk/events https://twitter.com/slasscom www.slideshare.net/slasscomtechforum

Notes de l'éditeur

  1. In order to successfully handle really big data requires massive concurrency and in the real world this requires fault tolerance.
  2. Google didn’t invent map and reduce but they were the first to apply the paradigm in a general way on a massive scale.
  3. … or, more probably, a number of results. By dividing the work we can assign it to many servers. This concurrency is what allows scale.
  4. Here is an example of something which Google do as part of their core business. Google places web sites which are linked to by many other web sites higher in search results (PageRank). To determine this a map reads web pages found by crawlers and creates key/value pairs. These are written in memory and then pushed out in blocks to disk. A reduce reads these disk blocks and sorts all the intermediate data by key. The reduce function then iterates over all the pairs for a key and outputs one result for each key.
  5. The results from one MapReduce can, and often are, provided as input for further MapReduce runs.
  6. Something like RAID, maybe Reduced Array of Inexpensive Servers (RAIS)? The can and do fail individually without the system failing.
  7. The user process forks all of the other processes which will be used including a master process. The master then assigns those processes work to perform, either map or reduce roles.
  8. The master process monitors each worker by sending a ping periodically. When it detects that a server has failed (or is no longer reachable) it will reassign that server’s work to another worker. After this reassignment each of the reduce workers will be notified to ignore the failed server and instead get the interim data from the newly assigned server.
  9. This is a contrived example.
  10. That’s billion with a ‘B’. In Canada that’s 1,000 million.
  11. There is an obvious hole in this pseudo code, the first two messages of the conversation are not included in the conversational totals. I could have accommodated that but I left it out to keep the example as simple possible.