MapReduce Agenda for Functional Big Data Analysis

•

0 j'aime•266 vues

This presentation was done by Vance Shipley (CTO at Wavenet) at the SLASSCOM Tech Talk - 'Smart Data Engineering' on 26th November 2014.

Logiciels

Agenda
MapReduce
Google
Scaling Out
Key Value Store
Chaining
Fault Tolerance
Functional Example
Business Problem
Design
Processes
Schema
Big Data Guidelines

Google MapReduce
+ Paper published in 2004
+ Implemented in 2003
+ Production use at Google
+ Built for Google
+ Not open sourced

Google in 2004
+ Clusters of 100s or 1000s of servers
o Linux
o dual-processor x86
o 2-4 GB memory
o 100BaseT or GigE
o inexpensive IDE hard drives
+ Servers fail every day
+ Network maintenance is constant

Scaling Out
+ Scaling up (faster computer) doesn’t get far
+ Scaling out is the only next step
+ Hundreds/thousands of modest computers
outperform the biggest single computers
+ Scaling one to a few is hard
+ Scaling a few to many is easy
+ Scaling many to massive is (almost) trivial

Intermediate Data
+ Input data is split between the workers
+ Map workers create key/value pairs
+ Reduce workers read in all intermediate
data and sort by key
+ Reduce workers then iterate over the sorted
data producing a result for each key

Rinse and Repeat
+ Often the results of one MapReduce are
used as input to another
+ Building on a powerful basic functional
model complex data processing can be
accomplished

Fault Tolerance
+ Likelihood of failure rises with number of
servers and processing time
+ Resiliency is a necessity at scale
+ Scheduler/Supervisor (master) reassigns
failed jobs and ensures reduce workers find
the (right) data

Example Business Problem
Scenario:
A mobile operator wants to know if an instant
messaging (IM) service would be useful to
current subscribers.
Question:
What percentage of text messages (SMS)
are part of a conversation?

Challenge
✓ 10 million subscribers
✓ average of 100 SMS a month per subscriber
✓ ∴ one billion SMS each month
✓ call detail records (CDR) include SMS but also
voice and data events
✓ ∴ 20 billion (20,000,000,000) records/month

Requirements
+ Identify SMS conversations
o messages sent or received with one other party
o interval between messages < 10 minutes
o at least three messages exchanged
+ Provide result as
o ratio of conversational to non-conversational SMS
o per subscriber
o per month

Filter
+ Read events from CDR files
o records are in chronological order
o read files in chronological order
+ Discard non-SMS events
+ Distribute SMS events to Map processes
o Consistent distribution by subscriber

Hashing
+ To analyze interval between
messages one process must
handle all events for a
particular subscriber
+ Simple Hash:
o M = last four digits of subscriber’s
mobile number
o N = number of processes available
o Pid = M rem N

Map
+ Read subscriber’s stored data
+ Find other party in set
+ Increment total count of messages
+ Is previous message < 10 minutes?
o Is next previous message < 10m before previous?
 Increment conversational messages count
+ Update previous and next previous times

Interim Data
+ We are using an in memory key value store
+ The key is the subscriber number
+ The value is a set of OtherParty
+ OtherParty data structure contains counts
+ When the map is complete we transfer the
data to disk for persistence

Reduce
+ Collect intermediate data
from disk copies
+ Iterate through all parties for
each subscriber
+ Total all party counts
+ Provide result as percentage
of conversational messages
to total messages

Big Data Guidelines
+ Find opportunities for concurrency
+ Choose the right containers for your data
+ Use memory as effectively as possible
+ Minimize copying data
+ Avoid any unnecessary overhead
+ Anything you are going to do hundreds of
billions of times should be efficient!

SLASSCOM TECH TALKS
https://www.facebook.com/SlasscomTechnologyForum
http://www.slasscom.lk/events
https://twitter.com/slasscom
www.slideshare.net/slasscomtechforum

Contenu connexe

Similaire à MapReduce Agenda for Functional Big Data Analysis

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataCloudera, Inc.

Big Data with Hadoop – For Data Management, Processing and StoringIRJET Journal

Transaction processing systemAyisha Kowsar

Big data in Private BankingJérôme Kehrli

BigData Hadoop Kumari Surabhi

The BUsiness of Windows Azure PlatformDan Moore

Big DataPriyanka Tuteja

Big Data ArchitectureGuido Schmutz

Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz

Lean Enterprise, Microservices and Big DataStylight

Introduction Big DataFrank Kienle

Smart App@Pivotal by Dat TranVMware Tanzu Korea

A Big Data ConceptDharmesh Tank

IT overview for nonprofits by Dave Cortright (IT4NP)Dave Cortright

Big DataNGDATA

SplunkLive! Dallas Nov 2012 - Metro PCSSplunk

Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation

Cassandra & puppet, scaling data at $15 per monthdaveconnors

Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)MIT College Of Engineering,Pune

AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics

Similaire à MapReduce Agenda for Functional Big Data Analysis (20)

Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData

Big Data with Hadoop – For Data Management, Processing and Storing

Transaction processing system

Big data in Private Banking

BigData Hadoop

The BUsiness of Windows Azure Platform

Big Data

Big Data Architecture

Big Data Architectures @ JAX / BigDataCon 2016

Lean Enterprise, Microservices and Big Data

Introduction Big Data

Smart App@Pivotal by Dat Tran

A Big Data Concept

IT overview for nonprofits by Dave Cortright (IT4NP)

Big Data

SplunkLive! Dallas Nov 2012 - Metro PCS

Data Engineer's Lunch #85: Designing a Modern Data Stack

Cassandra & puppet, scaling data at $15 per month

Big Data Analytics(Intro,Hadoop Map Reduce,Mahout,K-means clustering,H-base)

AquaQ Analytics Kx Event - Data Direct Networks Presentation

Dernier

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

Post Quantum Cryptography – The Impact on Identityteam-WIBU

Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629

What is Advanced Excel and what are some best practices for designing and cre...Technogeeks

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

VK Business Profile - provides IT solutions and Web Developmentvyaparkranti

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Large Language Models for Test Case Evolution and RepairLionel Briand

Precise and Complete Requirements? An Elusive GoalLionel Briand

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

Dernier (20)

Xen Safety Embedded OSS Summit April 2024 v4.pdf

A healthy diet for your Java application Devoxx France.pdf

Ahmed Motair CV April 2024 (Senior SW Developer)

Folding Cheat Sheet #4 - fourth in a series

Post Quantum Cryptography – The Impact on Identity

Unveiling Design Patterns: A Visual Guide with UML Diagrams

Understanding Flamingo - DeepMind's VLM Architecture

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf

What is Advanced Excel and what are some best practices for designing and cre...

Salesforce Implementation Services PPT By ABSYZ

Software Project Health Check: Best Practices and Techniques for Your Product...

VK Business Profile - provides IT solutions and Web Development

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

Large Language Models for Test Case Evolution and Repair

Precise and Complete Requirements? An Elusive Goal

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

英国UN学位证,北安普顿大学毕业证书1:1制作

Implementing Zero Trust strategy with Azure

Advantages of Odoo ERP 17 for Your Business

MapReduce Agenda for Functional Big Data Analysis

1. Functional Big Data

2. Agenda MapReduce Google Scaling Out Key Value Store Chaining Fault Tolerance Functional Example Business Problem Design Processes Schema Big Data Guidelines

3. MapReduce

4. Google MapReduce + Paper published in 2004 + Implemented in 2003 + Production use at Google + Built for Google + Not open sourced

5. Google in 2004 + Clusters of 100s or 1000s of servers o Linux o dual-processor x86 o 2-4 GB memory o 100BaseT or GigE o inexpensive IDE hard drives + Servers fail every day + Network maintenance is constant

6. Scaling Out + Scaling up (faster computer) doesn’t get far + Scaling out is the only next step + Hundreds/thousands of modest computers outperform the biggest single computers + Scaling one to a few is hard + Scaling a few to many is easy + Scaling many to massive is (almost) trivial

7. Concurrency

8. Intermediate Data + Input data is split between the workers + Map workers create key/value pairs + Reduce workers read in all intermediate data and sort by key + Reduce workers then iterate over the sorted data producing a result for each key

9. Key Value Store

10. Rinse and Repeat + Often the results of one MapReduce are used as input to another + Building on a powerful basic functional model complex data processing can be accomplished

11. Chaining

12. Fault Tolerance + Likelihood of failure rises with number of servers and processing time + Resiliency is a necessity at scale + Scheduler/Supervisor (master) reassigns failed jobs and ensures reduce workers find the (right) data

13. Scheduling

14. Supervision

15. Functional Example

16. Example Business Problem Scenario: A mobile operator wants to know if an instant messaging (IM) service would be useful to current subscribers. Question: What percentage of text messages (SMS) are part of a conversation?

17. Challenge ✓ 10 million subscribers ✓ average of 100 SMS a month per subscriber ✓ ∴ one billion SMS each month ✓ call detail records (CDR) include SMS but also voice and data events ✓ ∴ 20 billion (20,000,000,000) records/month

18. Requirements + Identify SMS conversations o messages sent or received with one other party o interval between messages < 10 minutes o at least three messages exchanged + Provide result as o ratio of conversational to non-conversational SMS o per subscriber o per month

19. Process Design

20. Filter + Read events from CDR files o records are in chronological order o read files in chronological order + Discard non-SMS events + Distribute SMS events to Map processes o Consistent distribution by subscriber

21. Hashing + To analyze interval between messages one process must handle all events for a particular subscriber + Simple Hash: o M = last four digits of subscriber’s mobile number o N = number of processes available o Pid = M rem N

22. Map + Read subscriber’s stored data + Find other party in set + Increment total count of messages + Is previous message < 10 minutes? o Is next previous message < 10m before previous?  Increment conversational messages count + Update previous and next previous times

23. Schema Design

24. Interim Data + We are using an in memory key value store + The key is the subscriber number + The value is a set of OtherParty + OtherParty data structure contains counts + When the map is complete we transfer the data to disk for persistence

25. Reduce + Collect intermediate data from disk copies + Iterate through all parties for each subscriber + Total all party counts + Provide result as percentage of conversational messages to total messages

26. Big Data Guidelines + Find opportunities for concurrency + Choose the right containers for your data + Use memory as effectively as possible + Minimize copying data + Avoid any unnecessary overhead + Anything you are going to do hundreds of billions of times should be efficient!

27. Thank you.

28. SLASSCOM TECH TALKS https://www.facebook.com/SlasscomTechnologyForum http://www.slasscom.lk/events https://twitter.com/slasscom www.slideshare.net/slasscomtechforum

Notes de l'éditeur

In order to successfully handle really big data requires massive concurrency and in the real world this requires fault tolerance.
Google didn’t invent map and reduce but they were the first to apply the paradigm in a general way on a massive scale.
… or, more probably, a number of results. By dividing the work we can assign it to many servers. This concurrency is what allows scale.
Here is an example of something which Google do as part of their core business. Google places web sites which are linked to by many other web sites higher in search results (PageRank). To determine this a map reads web pages found by crawlers and creates key/value pairs. These are written in memory and then pushed out in blocks to disk. A reduce reads these disk blocks and sorts all the intermediate data by key. The reduce function then iterates over all the pairs for a key and outputs one result for each key.
The results from one MapReduce can, and often are, provided as input for further MapReduce runs.
Something like RAID, maybe Reduced Array of Inexpensive Servers (RAIS)? The can and do fail individually without the system failing.
The user process forks all of the other processes which will be used including a master process. The master then assigns those processes work to perform, either map or reduce roles.
The master process monitors each worker by sending a ping periodically. When it detects that a server has failed (or is no longer reachable) it will reassign that server’s work to another worker. After this reassignment each of the reduce workers will be notified to ignore the failed server and instead get the interim data from the newly assigned server.
This is a contrived example.
That’s billion with a ‘B’. In Canada that’s 1,000 million.
There is an obvious hole in this pseudo code, the first two messages of the conversation are not included in the conversational totals. I could have accommodated that but I left it out to keep the example as simple possible.

MapReduce Agenda for Functional Big Data Analysis

Recommandé

Recommandé

Contenu connexe

Similaire à MapReduce Agenda for Functional Big Data Analysis

Similaire à MapReduce Agenda for Functional Big Data Analysis (20)

Dernier

Dernier (20)

MapReduce Agenda for Functional Big Data Analysis

Notes de l'éditeur