SlideShare a Scribd company logo
1 of 34
BIG DATA CONCEPTS

Serkan ÖZAL
Middle East Technical University
Ankara/TURKEY
October 2013
Contents
 Big Data and Scalability
 NoSQL





Column Stores
Key-Value Stores
Document Stores
Graph Database Systems

 Batch Data Processing
 MapReduce
 Hadoop
 Running Analytical Queries over Offline Big Data
 Hive
 Pig

 RealTime Data Processing
 Storm

2
Big Data
 Collect.
 Store.
 Organize.
 Analyze.
 Share.

Data growth outruns the ability to manage it so we
need scalable solutions.

3
Scalability - Vertical
 Increasing server capacity.

 Adding more CPU, RAM.
 Managing is hard.
 Possible down times

4
Scalability - Horizontal
 Adding servers to existing system with little effort, aka

Elastically scalable.
 Bugs, hardware errors, things fail all the time.
 It should become cheaper. Cost efficiency.







Shared nothing.
Use of commodity/cheap hardware.
Heterogeneous systems.
Controlled Concurrency (avoid locks).
Service Oriented Architecture. Local states.
 Decentralized to reduce bottlenecks.
 Avoid Single point of failures.

 Asynchrony.
 Symmetry, you don’t have to know what is happening. All

nodes should be symmetric.
5
NoSQL
 Stands for Not Only SQL
 Class of non-relational data storage systems

 Usually do not require a fixed table schema nor

do they use the concept of joins
 Offers BASE properties instead of ACID
(Atomicity, Consistency, Isolation, Durability)
properties
 BAsically available: Nodes in the a distributed

6

environment can go down, but the whole system
shouldn’t be affected.
 Soft State (scalable): The state of the system and data
changes over time.
 Eventual Consistency: Given enough time, data will be
consistent across the distributed system.
What is Wrong With RDBMS?
 One size fits all? Not really.
 Rigid schema design.
 Harder to scale.
 Replication.

 Joins across multiple nodes? Hard.
 How does RDMS handle data growth?

Hard.

7
Advantages & Disadvantages
 Advantages
 Flexible schema
 Quicker/cheaper to set up
 Massive scalability
 Evantual consistency

higher performance &

availability

 Disadvantages
 No declarative query language

programming
 Evantual consistency
8

more

fewer guarantees
NoSQL Systems
 Several incarnations
 Column Stores
 Key-Value Stores
 Document Stores
 Graph Databases

9
Column Stores
 Each storage block contains data from only one

column
 More efficient than row (or document) store if:
 Multiple row/record/documents are inserted at the

same time so updates of column blocks can be
aggregated
 Retrievals access only some of the columns in a
row/record/document
 Example Systems
 HBase

10
Key-Value Stores
Extremely simple interface




Data model: (key, value) pairs
Operations:




Insert(key,value), Fetch(key),Update(key), Delete(key)

Implementation: efficiency, scalability, faulttolerance




Records distributed to nodes based on key



Replication



Single-record transactions, “eventual consistency”

Example systems



11

Google BigTable, Amazon Dynamo, Cassandra,
Voldemort, …
Document Stores
 Like Key-Value Stores except value is

document
Data model: (key, document) pairs
Document: JSON, XML, other semistructured
formats
Basic operations:








Insert(key,document), Fetch(key),Update(key),
Delete(key)

Also Fetch based on document contents

 Example systems
 CouchDB, MongoDB, SimpleDB, …
12
Graph Database Systems
 A graph is a collection nodes (things) and edges

(relationships) that connect pairs of nodes.
 Attach properties (key-value pairs) on nodes and

relationships
 Relationships connect two nodes and both nodes

and relationships can hold an arbitrary amount of keyvalue pairs.
 A graph database can be thought of as a key-value

store, with full support for relationships.
 Example systems
 Neo4j, FlockDB, Pregel, …

13
Batch Data Processing
 Batch data processing is an efficient way of

processing high volumes of data is where a group of
transactions is collected over a period of time.
 Data is collected, entered, processed and then the
batch results are produced
 Data is processed by worker nodes and managed by
master node.
 Analytical queries can be executed as distributed on
offline big data. User queries are converted to MapReduce jobs on cluster to process big data.
 Hive: Supports SQL like query language
 Pig: More likely a scripting language
14
Map-Reduce
 MapReduce

is a programming model and
software framework first developed by Google
(Google’s MapReduce paper was submitted in
2004)
 A method for distributing a task across multiple
nodes
 Each node processes data assigned to that node
 Consists of two developer-created phases
 Map
 Reduce

 Nice features of MapReduce systems include:
15

 reliably processing a job even when machines die
 paralellization on thousands of machines
Map
 "Map" step:

The master node takes the input,
divides it into smaller sub-problems, and
distributes them to worker nodes. A worker node
may do this again in turn, leading to a multi-level
tree structure. The worker node processes the
smaller problem, and passes the answer back to
its master node.
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit (w, 1) // emit(key, value)

16
Reduce
 "Reduce" step: After map step, the master node

collects the answers of all the sub-problems and
classifies all values with their keys. Each unique
key and its values are sent to a distinct reducer.
This means that each reducer processes a
distinct key and ita values independently from
function reduce(String word, Iterator
each other.

17

partialCounts):
// word: a word
// partialCounts: a list of aggregated partial
counts
sum = 0
for each pc in partialCounts:
sum += ParseInt(pc)
Map-Reduce Overview

18
Hadoop
 Hadoop is a scalable fault-tolerant distributed

system for data storage and processing.
 Core Hadoop has two main components:
(1) HDFS: Hadoop Distributed File System is a
reliable, self-healing, redundant, high-bandwidth,
distributed file system optimized for large files.
(2) MapReduce: Fault-tolerant distributed processing
 Operates on unsuctured and structured data
 Open source under the friendly Apache License

19
Hadoop Architecture

20
HDFS
 Inspired by Google File System
 Scalable, distributed, portable file system written in

Java for Hadoop framework
 Primary distributed storage used by Hadoop applications

 HDFS can be part of a Hadoop cluster or can be a

stand-alone general purpose distributed file system
 An HDFS cluster primarily consists of
 NameNode that manages file system metadata
 DataNode that stores actual data

 Stores very large files in blocks across machines in a

large cluster
 Reliability and fault tolerance ensured by replicating data

across multiple hosts
21
HDFS Architecture

22
Running Analytical Queries
over Offline Big Data
 Hadoop is great for large-data processing!
 But writing Java programs for everything is verbose

and slow
 Not everyone wants to (or can) write Java code
 Solution: develop higher-level data processing

languages
 Hive: HQL is like SQL
 Pig: Pig Latin is a bit like Perl

23
Hive
A system for querying and managing structured data
built on top of Hadoop
 Uses Map-Reduce for execution
 HDFS for storage – but any system that
implements Hadoop FS API
Key Building Principles:
 Structured data with rich data types (structs, lists
and maps)
 Directly query data from different formats
(text/binary) and file formats (Flat/Sequence)
 SQL as a familiar programming tool and for
standard analytics
24
 Allow embedded scripts for extensibility and for
Hive - Example
SELECT foo FROM sample WHERE ds='2012-02-24';

Create HDFS dir for the output

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample
WHERE ds='2012-02-24';
Create local dir for the output
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT *
FROM sample;
SELECT MAX(foo) FROM sample;
SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds;
Hive allows the From clause to come first !!!

Store the results into a table

FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*)
WHERE s.foo > 0 GROUP BY s.bar;
25

25
Pig









A platform for analyzing large data sets that
consists of a high-level language for expressing
data analysis programs.
Compiles down to MapReduce jobs
Developed by Yahoo!
Open-source language
Common design patterns as key words (joins,
distinct, counts)
Data flow analysis




Can be interactive mode


26

A script can map to multiple map-reduce jobs
Issue commands and get results
Pig - Example
Read file from HDFS

The input format (text, tab delimited)

Define run-time schema

raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query);
clean1 = FILTER raw BY id > 20 AND id < 100;
clean2 = FOREACH clean1 GENERATE
Filter the rows on predicates
user, time,
org.apache.pig.tutorial.sanitze(query)each row, do some transformation
For as query;
user_groups = GROUP clean2 BY (user, query);
Grouping of records
user_query_counts = FOREACH user_groups
GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time);

Compute aggregation for each group

STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(',');

Text, Comma delimited

Store the output in a file

27

27
RealTime Data Processing
 In contrast, real time data processing involves a





28

continual input, process and output of data.
Size of data is not certain at the beginning of the
processing.
Data must be processed in a small time period (or
near real time).
Storm is a most popular open source
implementation of realtime data processing.
Most importantly used by Twitter.
Storm
 Is a highly distributed realtime

computation system.
 Provides general primitives to do
realtime computation.
 Can be used with any programming
language.
 It's scalable and fault-tolerant.
29
Architecture – General
Overview

30
Architecture - Components
 Nimbus
 Master node (similar to Hadoop JobTracker)
 Responsible for distributing code around the cluster,

assigning tasks to machines, and monitoring for failures.
 ZooKeeper
 Used for cluster coordination between Nimbus and

Supervisors.
 Nimbus and Supervisors are fail-fast and stateless; all
state is kept in Zookeeper or on local disk.
 Supervisor
 Run worker processes.
 The supervisor listens for work assigned to its machine

and starts and stops worker processes as necessary
based on what Nimbus has assigned to it.
31
Main Concepts – General
Overview

32
Main Concepts - Components
 Streams
 Unbounded sequence of tuples/datas (input or

output)

 Spouts
 Source of streams
 Such as reading data from Twitter streaming

API

 Bolts
 Processes input streams, does some

processing and possibly emits new streams

 Topologies
33

 Network of spouts and bolts
Thanks for Listening

34

More Related Content

What's hot

Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)SiamAhmed16
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data ScienceArc & Codementor
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Simplilearn
 
Internet of Things: A Hands-On Approach
Internet of Things: A Hands-On ApproachInternet of Things: A Hands-On Approach
Internet of Things: A Hands-On ApproachArshdeep Bahga
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceSrishti44
 

What's hot (20)

Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)Presentation About Big Data (DBMS)
Presentation About Big Data (DBMS)
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Big Data ppt
Big Data pptBig Data ppt
Big Data ppt
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
 
Presentation on Big Data
Presentation on Big DataPresentation on Big Data
Presentation on Big Data
 
Data science - An Introduction
Data science - An IntroductionData science - An Introduction
Data science - An Introduction
 
Internet of Things: A Hands-On Approach
Internet of Things: A Hands-On ApproachInternet of Things: A Hands-On Approach
Internet of Things: A Hands-On Approach
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 

Viewers also liked

Big data
Big dataBig data
Big datahsn99
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsSherinMariamReji05
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation17aroumougamh
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Senthil Kumar
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big DataBernard Marr
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data Srinath Perera
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Sergey Lukjanov
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)Jose Luis Lopez Pino
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKhuguk
 
Docker & Apcera Better Together
Docker & Apcera Better TogetherDocker & Apcera Better Together
Docker & Apcera Better TogetherSimone Morellato
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsCase study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsHappiest Minds Technologies
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a securityTyrone Systems
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data PlatformHarshal Deo (HD)
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataKaran Desai
 

Viewers also liked (20)

Big data
Big dataBig data
Big data
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital ForensicsBig Data in Distributed Analytics,Cybersecurity And Digital Forensics
Big Data in Distributed Analytics,Cybersecurity And Digital Forensics
 
Big Data Final Presentation
Big Data Final PresentationBig Data Final Presentation
Big Data Final Presentation
 
Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview Hadoop Ecosystem Architecture Overview
Hadoop Ecosystem Architecture Overview
 
A Brief History of Big Data
A Brief History of Big DataA Brief History of Big Data
A Brief History of Big Data
 
What is big data?
What is big data?What is big data?
What is big data?
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
Atlanta OpenStack Summit: Technical Deep Dive: Big Data Computations Using El...
 
02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)High level languages for Big Data Analytics (Report)
High level languages for Big Data Analytics (Report)
 
Sgml and xml
Sgml and xmlSgml and xml
Sgml and xml
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
 
Docker & Apcera Better Together
Docker & Apcera Better TogetherDocker & Apcera Better Together
Docker & Apcera Better Together
 
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest MindsCase study: Hadoop as ELT for Leading US Retailer - Happiest Minds
Case study: Hadoop as ELT for Leading US Retailer - Happiest Minds
 
Big Data
Big DataBig Data
Big Data
 
How to tackle big data from a security
How to tackle big data from a securityHow to tackle big data from a security
How to tackle big data from a security
 
7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform7 Characteristics of a Bad (Big) Data Platform
7 Characteristics of a Bad (Big) Data Platform
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 

Similar to Big data concepts

How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedDouglas Bernardini
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionChirag Ahuja
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologieszahid-mian
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones WebSantiago Coffey
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupesh Bansal
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopSvetlin Nakov
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 

Similar to Big data concepts (20)

Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
How can Hadoop & SAP be integrated
How can Hadoop & SAP be integratedHow can Hadoop & SAP be integrated
How can Hadoop & SAP be integrated
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
HADOOP
HADOOPHADOOP
HADOOP
 
Escalando Aplicaciones Web
Escalando Aplicaciones WebEscalando Aplicaciones Web
Escalando Aplicaciones Web
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Bhupeshbansal bigdata
Bhupeshbansal bigdata Bhupeshbansal bigdata
Bhupeshbansal bigdata
 
Hadoop overview.pdf
Hadoop overview.pdfHadoop overview.pdf
Hadoop overview.pdf
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 

More from Serkan Özal

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaSerkan Özal
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Serkan Özal
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the HoodSerkan Özal
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data PresentationSerkan Özal
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceSerkan Özal
 

More from Serkan Özal (7)

Flying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS LambdaFlying Server-less on the Cloud with AWS Lambda
Flying Server-less on the Cloud with AWS Lambda
 
MySafe
MySafeMySafe
MySafe
 
Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...Improving performance of decision support queries in columnar cloud database ...
Improving performance of decision support queries in columnar cloud database ...
 
JVM Under the Hood
JVM Under the HoodJVM Under the Hood
JVM Under the Hood
 
Big data on aws
Big data on awsBig data on aws
Big data on aws
 
Ankara JUG Big Data Presentation
Ankara JUG Big Data PresentationAnkara JUG Big Data Presentation
Ankara JUG Big Data Presentation
 
AWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map ReduceAWS EMR - Amazon Elastic Map Reduce
AWS EMR - Amazon Elastic Map Reduce
 

Recently uploaded

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 

Recently uploaded (20)

Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 

Big data concepts

  • 1. BIG DATA CONCEPTS Serkan ÖZAL Middle East Technical University Ankara/TURKEY October 2013
  • 2. Contents  Big Data and Scalability  NoSQL     Column Stores Key-Value Stores Document Stores Graph Database Systems  Batch Data Processing  MapReduce  Hadoop  Running Analytical Queries over Offline Big Data  Hive  Pig  RealTime Data Processing  Storm 2
  • 3. Big Data  Collect.  Store.  Organize.  Analyze.  Share. Data growth outruns the ability to manage it so we need scalable solutions. 3
  • 4. Scalability - Vertical  Increasing server capacity.  Adding more CPU, RAM.  Managing is hard.  Possible down times 4
  • 5. Scalability - Horizontal  Adding servers to existing system with little effort, aka Elastically scalable.  Bugs, hardware errors, things fail all the time.  It should become cheaper. Cost efficiency.      Shared nothing. Use of commodity/cheap hardware. Heterogeneous systems. Controlled Concurrency (avoid locks). Service Oriented Architecture. Local states.  Decentralized to reduce bottlenecks.  Avoid Single point of failures.  Asynchrony.  Symmetry, you don’t have to know what is happening. All nodes should be symmetric. 5
  • 6. NoSQL  Stands for Not Only SQL  Class of non-relational data storage systems  Usually do not require a fixed table schema nor do they use the concept of joins  Offers BASE properties instead of ACID (Atomicity, Consistency, Isolation, Durability) properties  BAsically available: Nodes in the a distributed 6 environment can go down, but the whole system shouldn’t be affected.  Soft State (scalable): The state of the system and data changes over time.  Eventual Consistency: Given enough time, data will be consistent across the distributed system.
  • 7. What is Wrong With RDBMS?  One size fits all? Not really.  Rigid schema design.  Harder to scale.  Replication.  Joins across multiple nodes? Hard.  How does RDMS handle data growth? Hard. 7
  • 8. Advantages & Disadvantages  Advantages  Flexible schema  Quicker/cheaper to set up  Massive scalability  Evantual consistency higher performance & availability  Disadvantages  No declarative query language programming  Evantual consistency 8 more fewer guarantees
  • 9. NoSQL Systems  Several incarnations  Column Stores  Key-Value Stores  Document Stores  Graph Databases 9
  • 10. Column Stores  Each storage block contains data from only one column  More efficient than row (or document) store if:  Multiple row/record/documents are inserted at the same time so updates of column blocks can be aggregated  Retrievals access only some of the columns in a row/record/document  Example Systems  HBase 10
  • 11. Key-Value Stores Extremely simple interface   Data model: (key, value) pairs Operations:   Insert(key,value), Fetch(key),Update(key), Delete(key) Implementation: efficiency, scalability, faulttolerance   Records distributed to nodes based on key  Replication  Single-record transactions, “eventual consistency” Example systems   11 Google BigTable, Amazon Dynamo, Cassandra, Voldemort, …
  • 12. Document Stores  Like Key-Value Stores except value is document Data model: (key, document) pairs Document: JSON, XML, other semistructured formats Basic operations:      Insert(key,document), Fetch(key),Update(key), Delete(key) Also Fetch based on document contents  Example systems  CouchDB, MongoDB, SimpleDB, … 12
  • 13. Graph Database Systems  A graph is a collection nodes (things) and edges (relationships) that connect pairs of nodes.  Attach properties (key-value pairs) on nodes and relationships  Relationships connect two nodes and both nodes and relationships can hold an arbitrary amount of keyvalue pairs.  A graph database can be thought of as a key-value store, with full support for relationships.  Example systems  Neo4j, FlockDB, Pregel, … 13
  • 14. Batch Data Processing  Batch data processing is an efficient way of processing high volumes of data is where a group of transactions is collected over a period of time.  Data is collected, entered, processed and then the batch results are produced  Data is processed by worker nodes and managed by master node.  Analytical queries can be executed as distributed on offline big data. User queries are converted to MapReduce jobs on cluster to process big data.  Hive: Supports SQL like query language  Pig: More likely a scripting language 14
  • 15. Map-Reduce  MapReduce is a programming model and software framework first developed by Google (Google’s MapReduce paper was submitted in 2004)  A method for distributing a task across multiple nodes  Each node processes data assigned to that node  Consists of two developer-created phases  Map  Reduce  Nice features of MapReduce systems include: 15  reliably processing a job even when machines die  paralellization on thousands of machines
  • 16. Map  "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1) // emit(key, value) 16
  • 17. Reduce  "Reduce" step: After map step, the master node collects the answers of all the sub-problems and classifies all values with their keys. Each unique key and its values are sent to a distinct reducer. This means that each reducer processes a distinct key and ita values independently from function reduce(String word, Iterator each other. 17 partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += ParseInt(pc)
  • 19. Hadoop  Hadoop is a scalable fault-tolerant distributed system for data storage and processing.  Core Hadoop has two main components: (1) HDFS: Hadoop Distributed File System is a reliable, self-healing, redundant, high-bandwidth, distributed file system optimized for large files. (2) MapReduce: Fault-tolerant distributed processing  Operates on unsuctured and structured data  Open source under the friendly Apache License 19
  • 21. HDFS  Inspired by Google File System  Scalable, distributed, portable file system written in Java for Hadoop framework  Primary distributed storage used by Hadoop applications  HDFS can be part of a Hadoop cluster or can be a stand-alone general purpose distributed file system  An HDFS cluster primarily consists of  NameNode that manages file system metadata  DataNode that stores actual data  Stores very large files in blocks across machines in a large cluster  Reliability and fault tolerance ensured by replicating data across multiple hosts 21
  • 23. Running Analytical Queries over Offline Big Data  Hadoop is great for large-data processing!  But writing Java programs for everything is verbose and slow  Not everyone wants to (or can) write Java code  Solution: develop higher-level data processing languages  Hive: HQL is like SQL  Pig: Pig Latin is a bit like Perl 23
  • 24. Hive A system for querying and managing structured data built on top of Hadoop  Uses Map-Reduce for execution  HDFS for storage – but any system that implements Hadoop FS API Key Building Principles:  Structured data with rich data types (structs, lists and maps)  Directly query data from different formats (text/binary) and file formats (Flat/Sequence)  SQL as a familiar programming tool and for standard analytics 24  Allow embedded scripts for extensibility and for
  • 25. Hive - Example SELECT foo FROM sample WHERE ds='2012-02-24'; Create HDFS dir for the output INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT * FROM sample WHERE ds='2012-02-24'; Create local dir for the output INSERT OVERWRITE LOCAL DIRECTORY '/tmp/hive-sample-out' SELECT * FROM sample; SELECT MAX(foo) FROM sample; SELECT ds, COUNT(*), SUM(foo) FROM sample GROUP BY ds; Hive allows the From clause to come first !!! Store the results into a table FROM sample s INSERT OVERWRITE TABLE bar SELECT s.bar, count(*) WHERE s.foo > 0 GROUP BY s.bar; 25 25
  • 26. Pig       A platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs. Compiles down to MapReduce jobs Developed by Yahoo! Open-source language Common design patterns as key words (joins, distinct, counts) Data flow analysis   Can be interactive mode  26 A script can map to multiple map-reduce jobs Issue commands and get results
  • 27. Pig - Example Read file from HDFS The input format (text, tab delimited) Define run-time schema raw = LOAD 'excite.log' USING PigStorage('t') AS (user, id, time, query); clean1 = FILTER raw BY id > 20 AND id < 100; clean2 = FOREACH clean1 GENERATE Filter the rows on predicates user, time, org.apache.pig.tutorial.sanitze(query)each row, do some transformation For as query; user_groups = GROUP clean2 BY (user, query); Grouping of records user_query_counts = FOREACH user_groups GENERATE group, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); Compute aggregation for each group STORE user_query_counts INTO 'uq_counts.csv' USING PigStorage(','); Text, Comma delimited Store the output in a file 27 27
  • 28. RealTime Data Processing  In contrast, real time data processing involves a     28 continual input, process and output of data. Size of data is not certain at the beginning of the processing. Data must be processed in a small time period (or near real time). Storm is a most popular open source implementation of realtime data processing. Most importantly used by Twitter.
  • 29. Storm  Is a highly distributed realtime computation system.  Provides general primitives to do realtime computation.  Can be used with any programming language.  It's scalable and fault-tolerant. 29
  • 31. Architecture - Components  Nimbus  Master node (similar to Hadoop JobTracker)  Responsible for distributing code around the cluster, assigning tasks to machines, and monitoring for failures.  ZooKeeper  Used for cluster coordination between Nimbus and Supervisors.  Nimbus and Supervisors are fail-fast and stateless; all state is kept in Zookeeper or on local disk.  Supervisor  Run worker processes.  The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it. 31
  • 32. Main Concepts – General Overview 32
  • 33. Main Concepts - Components  Streams  Unbounded sequence of tuples/datas (input or output)  Spouts  Source of streams  Such as reading data from Twitter streaming API  Bolts  Processes input streams, does some processing and possibly emits new streams  Topologies 33  Network of spouts and bolts