SlideShare une entreprise Scribd logo
1  sur  60
Hadoop workshop
Cloud Connect Shanghai
Sep 15, 2013

Ari Flink – Operations Architect
Mac Fang – Manager, Hadoop development
Dean Zhu – Hadoop Developer
Agenda
1. Introductions (5 minutes)
2. Hadoop and Big Data Concepts (20 minutes)
3. Cisco Webex Hadoop architecture (10 minutes)
4. Cisco UCS Hadoop Common Platform Architecture (10 minutes)
5. Exercise 1 (30 minutes)
– Configure a Hadoop single node VM on a laptop

6. Hive and Impala concepts (15 minutes)
7. Exercise 2 (30 minutes)
– Analytics using Apache Hive and Cloudera Impala

8. Q & A
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

2
Hadoop and Big Data Overview
– Enterprise data management and big data
– Problems, Opportunities and Use case examples
– Hadoop architecture concepts

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

3
What is Big Data?
For our purposes, big data refers to distributed computing
architectures specifically aimed at the “3 V’s” of data: Volume,
Velocity, and Variety

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

4
Traditional Enterprise Data Management

Operational
(OLTP)
Operational
(OLTP)

ETL

Operational
(OLTP)

Online
Transactional
Processing

Extract,
Transform, and
Load

EDW

Enterprise
Data
Warehouse

BI/Reports

Business
Intelligence

(batch processing)

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

5
Traditional Business Intelligence Questions
Transactional Data (e.g. OLTP)
Real-time, but limited reporting/analytics

• What are the top 5
most active stocks
traded in the last
hour?
• How many new
purchase orders have
we received since
noon?

Cloud Connect 2013 Shanghai

Enterprise Data Warehouse
High value, structured, indexed, cleansed

• How many more
hurricane windows are
sold in Gulf-area
stores during
hurricane season vs.
the rest of the year?
• What were the top 10
most frequently backordered products over
the past year?

© 2013 Cisco and/or its affiliates. All rights reserved.

6
So what has changed?
The Explosion of Unstructured Data
10,000

1.8 trillion gigabytes of data
was created in 2011…

UNSTRUCTURED DATA

• Approx. 500 quadrillion files
(IN BILLIONS)

GB of Data

• More than 90% is unstructured
data
• Quantity doubles every 2 years
• Most unstructured data is neither
stored nor analyzed!

STRUCTURED DATA
0
2005

2010

2015
Source: Cloudera

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

7
Enterprise Data Management with Big Data
Inmemory
analytics

Operational
(OLTP)
Operational
(OLTP)

BI/Reports
ETL

MPP EDW

Operational
(OLTP)

Web

Dashboards

Big Data
(Hadoop, etc.)

ETL

Machine

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

8
Traditional Business Intelligence Questions
Transactional Data (e.g.
OLTP)

Fast data, real-time

• What are the top 5
most active stocks
traded in the last hour?
• How many new
purchase orders have
we received since
noon?

Cloud Connect 2013 Shanghai

Enterprise Data Warehouse
High value, structured,
indexed, cleansed

Big Data
Lower value, semi-structured,
multi-source, raw/”dirty”

• How many more
hurricane windows are
sold in Gulf-area
stores during hurricane
season vs. the rest of
the year?
• What were the top 10
most frequently backordered products over
the past year?

• Which products do
customers click on the
most and/or spend the
most time browsing
without buying?
• How do we optimally
set pricing for each
product in each store
for individual
customers everyday?
• Did the recent
marketing launch
generate the expected
online buzz, and did
that translate to sales?

© 2013 Cisco and/or its affiliates. All rights reserved.

9
Example: Web and Location Analytics
iPhone searches
Amazon for Vizio TV’s
in Electronics

1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET
http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone;
CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko)
Version/5.1 Mobile/9A405 Safari/7534.48.3"

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

10
Big Data and Key Infrastructure Attributes
(What big data isn’t)
 Usually not blade servers (not enough local storage)

 Usually not virtualized (hypervisor only adds overhead)
 Usually not highly oversubscribed (significant east-west traffic)
 Usually not SAN/NAS

Low-cost, DASbased, scale-out
clustered filesystem

Move the
compute to
the storage

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

$$$

11

11
Cost, Performance, and Capacity
HW:SW $ split 30:70

Enterprise
Database

Structured Data:
Relational
Database

$20K/TB

Massive Scale-Out
Column Store

$10K/TB

Unstructured Data:

Hadoop
No SQL

$300-$1K/TB

Machine Logs, Web Click
Stream, Call Data Records,
Satellite Feeds, GPS Data,
Sensor Readings, Sales Data,
Blogs, Emails, Video

HW:SW $ split 70:30

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

12
Big Data Software Architectures

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

13
Three basic big data software architectures
MPP Relational
Database

Real-time NoSQL
Fast key-value
store/retrieve

•HBase (part of
Apache
Hadoop)*
•DataStax
(Cassandra)*
•Oracle NoSQL*
•Amazon Dynamo

Cloud Connect 2013 Shanghai

Scale-out BI/DW

Batch-oriented
Hadoop
Heavy lifting, processing

•Cloudera*
•MapR*
•Intel Hadoop*
•Pivotal HD*
© 2013 Cisco and/or its affiliates. All rights reserved.

•Greenplum DB
(Pivotal DB)*
•ParAccel*
•Vertica
•Netezza
•Teradata

*Cisco Partners
14
What Is Hadoop?

Hadoop is a distributed, faulttolerant framework for storing and
analyzing data.
Its two primary components are the
Hadoop Filesystem HDFS and the
MapReduce application engine.
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

15
Hadoop Components and Operations
Hadoop Distributed File System (HDFS)
File

 Scalable & Fault Tolerant
 Filesystem is distributed, stored
across all data nodes in the cluster
 Files are divided into multiple large
blocks – 64MB default, typically
128MB – 512MB
 Data is stored reliably. Each block is
replicated 3 times by default
 Types of Node Functions
– Name Node - Manages HDFS
– Job Tracker – Manages
MapReduce Jobs
– Data Node/Task Tracker – stores
blocks/does work
Cloud Connect 2013 Shanghai

Block
1

Block
2

Block
3

Block
4

Block
5

Block
6

ToR
FEX/switch

ToR
FEX/switch

ToR
FEX/switch

Data
node 1

Data
node 6

Data
node 11

Data
node 2

Data
node 7

Data
node 12

Data
node 3

Data
node 8

Data
node 13

Data
node 4

Data
node 9

Name
Node

Data
node 5

Data
node 10

Job
Tracker

© 2013 Cisco and/or its affiliates. All rights reserved.

16
HDFS Architecture
Switch
ToR
FEX/switch

ToR
FEX/switch

ToR
FEX/switch

Data
node 1

Data
node 6

Data
node 11

2

1

Data
node 2

2
3

Data
node 7

1

Data
node 12

2

Data
node 3

3

Data
node 8

1
4

Data
node 13

3

Data
node 4

4

Data
node 9

4

Data
node 14

Data
node 5
Cloud Connect 2013 Shanghai

Data
node 10

Data
node 15

© 2013 Cisco and/or its affiliates. All rights reserved.

Name Node

/usr/sean/foo.txt:blk_1,blk_2
/usr/jacob/bar.txt:blk_3,blk_4
Data node 1:blk_1
Data node 2:blk_2, blk_3
Data node 3:blk_3
17
Rack Awareness
“Rack” 1

“Rack” 2

“Rack” 3

Data
node 1

Data
node 6

Data
node 11

2

1

Data
node 2

2
3

Data
node 7

1

Data
node 12

2

Data
node 3

3

Data
node 8

1
4

Data
node 13

3

Data
node 4

4

Data
node 9

4

Data
node 14

Data
node 5

Cloud Connect 2013 Shanghai

Data
node 10

 Rack Awareness provides Hadoop the
optional ability to group nodes together in
logical “racks” (i.e. failure domains)
 Logical “racks” may or may not correspond
to physical data center racks
 Distributes blocks across different “racks”
to avoid failure domain of a single “rack”
 It can also lessen block movement between
“racks”

Data
node 15

© 2013 Cisco and/or its affiliates. All rights reserved.

18
MapReduce Example: Word Count
Input

Map
the
quick
brown
fox

the fox
ate the
mouse
how now
brown
cow
Cloud Connect 2013 Shanghai

Shuffle & Sort

Reduce

the, 1
brown, 1
fox, 1
quick, 1

Output

Reduce

Map

brown, 2
fox, 2
how, 1
now, 1
the, 3

Reduce

ate, 1
cow, 1
mouse,
1
quick, 1

the, 1
fox, 1
the, 1

Map
quick, 1
how, 1
now, 1
brown, 1

Map

ate, 1
mouse, 1
cow, 1

© 2013 Cisco and/or its affiliates. All rights reserved.

19
MapReduce Architecture
Switch
ToR
FEX/switch

ToR
FEX/switch

M1
Task
Tracker 1

Task
Tracker 6

ToR
FEX/switch
R2

Task
Tracker 11

M2

Task
Tracker 2

Task
Tracker 7

M1

Task
Tracker 12

Job Tracker

M3

Task
M2
Tracker 3
Task
Tracker 4

M3

Task
Tracker 8
Task
Tracker 9

Task
Tracker 13
Task
Tracker 14

R1

Task
Tracker 5
Cloud Connect 2013 Shanghai

Task
Tracker 10

Task
Tracker 15

© 2013 Cisco and/or its affiliates. All rights reserved.

Job1:TT1:Mapper1,Mapper2
Job1:TT4:Mapper3,Reducer1
Job2:TT6:Reducer2
Job2:TT7:Mapper1,Mapper3

20
Cisco Webex Cloud and
Hadoop Architecture

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

21
Global Scale: 13 datacenters &
iPoPs around the globe
Dedicated network: dual path 10G
circuits between DCs
Multi-tenant: 95k sites
Datacenter / PoP

Leased network link

© 2010 Cisco and/or its affiliates. All rights reserved. rights reserved.
C97-717209-00 © 2012 Cisco and/or its affiliates. All

Real-time collaboration: voice,
desktop sharing, video, chat

22
22
People make mistakes
Hardware fails
Software fails
Even failovers sometimes fail
Datacenter / PoP
Leased network link
© 2010 Cisco and/or its affiliates. All rights reserved. rights reserved.
C97-717209-00 © 2012 Cisco and/or its affiliates. All

23
23
Unstructured/semi-structured data
Syslog

HTTP/REST

Thrift

Log4j

AMQP

Avro

Structured data
RDBMS

Application state & APIs

File

MySQL

Flume

Other
Sinks

Solr
Sink

Sqoop

Cisco UCS C240 M3
servers

HDFS
Sink

12 x 3TB = 36 TB

/ server

SolrCloud

Solr index

C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved.

HDFS

Raw data

24
Cisco UCS and Big Data

Building a big data cluster with the UCS
Common Platform Architecture (CPA)
CPA Networking
CPA Sizing and Scaling

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

25
The evolution of big data deployments
General Purpose IT Data Center

Dedicated “Pod” for Big Data

IT Infrastructure

Generic IT servers

SAP

VMware
WEB

X86 servers

Big Data

Big Data
 Experimental use of Big Data

 App team mandated infrastructure

 Deployed into IT Ops mandated
infrastructures

 Purpose built for Big Data

 “Skunk works”

 Big Data has established business
value

 Small to medium clusters
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

 Performance matters
 Large or small clusters
Hadoop Hardware Evolving in the Enterprise
Typical 2009
Hadoop node

• 1RU server
• 4 x 1TB 3.5”
spindles
• 2 x 4-core CPU
• 1 x GE
• 24 GB RAM
• Single PSU
• Running Apache
•$

Cloud Connect 2013 Shanghai

Economics favor
“fat” nodes

• 6x-9x more
data/node
• 3x-6x more
IOPS/node
• Saturated gigabit,
10GE on the rise
• Fewer total nodes
lowers
licensing/support
costs
• Increased
significance of node
and switch failure

© 2013 Cisco and/or its affiliates. All rights reserved.

Typical 2013
Hadoop node

• 2RU server
• 12 x 3TB 3.5” or 24
x 1TB 2.5” spindles
• 2 x 8-core CPU
• 1-2 x 10GE
• 128 GB RAM
• Dual PSU
• Running
commercial/licensed
distribution
• $$$
27
Cisco UCS Common Platform Architecture (CPA)
Building Blocks for Big Data

UCS Manager
UCS 6200 Series
Fabric Interconnects
Nexus 2232
Fabric Extenders

LAN, SAN, Management

UCS 240 M3
Servers
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

28
CPA Network Design for Big Data

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

29
CPA: Topology
Single wire for data and management
8 x 10GE
uplinks per
FEX= 2:1
oversub (16
servers/rack),
no
portchannel
(static pinning)

2 x 10GE links
per server for all
traffic, data and
management
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
CPA Recommended FEX Connectivity
2 FEX’s and 2 FI’s

•
•

2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32
Distribute servers across port groups to maximize buffer
performance and predictably distribute static pinning on uplinks

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
Can Hadoop really push 10GE?
It can, depending on workload, so tune for it!
 Analytic workloads tend to be
lighter on the network
 Transform workloads tend to be
heavier on the network
 Hadoop has numerous
parameters which affect network

 Take advantage of 10GE CPA:
–
–
–
–
–
–

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

mapred.reduce.slowstart.completed.maps
dfs.balance.bandwidthPerSec
mapred.reduce.parallel.copies
mapred.reduce.tasks
mapred.tasktracker.reduce.tasks.maximum
mapred.compress.map.output

32
CPA Sizing and Scaling for Big Data

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

33
Cisco UCS Reference Configurations for Big Data

Full Rack UCS Solutions
Bundle for Hadoop,
NoSQL Performance

2 x UCS 6296
2 x Nexus 2232 PP
16 x C240 M3 (SFF)

2 x UCS 6296
2 x Nexus 2232 PP
16 x C240 M3 (LFF)

2x E5-2665 (16 cores)
256GB
24 x 1TB 7.2K SAS

Cloud Connect 2013 Shanghai

Full Rack UCS Solutions
Bundle for Hadoop
Capacity

E5-2640 (12 cores)
128GB
12x 3TB 7.2K SATA

© 2013 Cisco and/or its affiliates. All rights reserved.

34
Sizing
Part science, part art
 Start with current storage requirement
– Factor in replication (typically 3x) and compression (varies by data set)
– Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL
systems
– Factor in average daily/weekly data ingest rate
– Factor in expected growth rate (i.e. increase in ingest rate over time)

 If I/O requirement known, use next table for guidance
 Most big data architectures are very linear, so more nodes = more capacity and
better performance
 Strike a balance between price/performance of individual nodes vs. total # of
nodes
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

35
CPA sizing and application guidelines
CPU

2 x E5-2690

2 x E5-2665

2 x E5-2640

256

256

128

24 x 600GB 10K

24 x 1TB 7.2K

12 x 3TB 7.2K

IO Bandwidth (GB/Sec)

2.6

2.0

1.1

Cores

256

256

192

Memory (TB)

4

4

2

Capacity (TB)

225

384

576

IO Bandwidth (GB/Sec)

41.3

31.9

16.9

MPP DB
NoSQL

Hadoop
NoSQL

Hadoop

Memory (GB)
Server
Disk Drives

Rack-Level

Applications

Best Performance
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

Best Price/TB
36
Scaling the CPA

L2/L3 Switching

Single Rack
16 servers

Single Domain
Up to 10 racks, 160 servers

Multiple Domains
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

37
Scaling the Common Platform Architecture
Multiple domains based on 16 servers per rack and 2 x 2232 FEXs

Consider intra- and inter-domain bandwidth:
Servers Per
Domain
(Pair of Fabric
Interconnects)

Available
North-Bound
10GE ports
(per fabric)

Southbound
oversubscription
(per fabric)

Northbound
oversubscription
(per fabric)

Intra-domain
server-to-server
bandwidth (per
fabric, Gbits/sec)

Inter-domain
server-to-server
bandwidth (per
fabric, Gbits/sec)

160

16

2:1

5:1

5

1

144

24

2:1

3:1

5

1.67

128

32

2:1

2:1

5

2.5

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

38
Multi-Domain CPA Customer Example
• 10 Gits/sec Intra-Domain
Server to Server NW
Bandwidth
• 5 Gbits/sec Inter-Domain
Server to Server NW
Bandwidth
• Static pinning from FEX to
FI (no port-channel)

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

39
Recommendations: UCS Domains and Racks
Single Domain Recommendation

Multi Domain Recommendation
Create one Hadoop rack per UCS Domain

Turn off or enable at physical rack level

• For simplicity and ease of
use, leave Rack Awareness
off
• Consider turning it on to limit
physical rack level fault
domain (e.g. localized
failures due to physical data
center issues – water, power,
cooling, etc.)
Cloud Connect 2013 Shanghai

• With multiple domains,
enable Rack Awareness
such that each UCS Domain
is its own Hadoop rack
• Provides HDFS data
protection across domains
• Helps minimize crossdomain traffic

© 2013 Cisco and/or its affiliates. All rights reserved.

40
Exercise 1
 Set up a single node VM cluster on the laptop
– Step 1: copy files from USB memory stick
– Step 2: Mac & Dean to fill in …
– Step 3: Mac & Dean to fill in …
– etc

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

41
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

42
Hive

 An SQL-like interface to Hadoop

 Top level Apache project
– http://hive.apache.org/

 Hive history
– Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of
writing Java MapReduce
– Currently used at many companies for log processing, business intelligence and analytics

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

43
Hive Components







Shell: allows interactive queries
Driver: session handles, fetch, execute
Compiler: parse, plan, optimize
Execution engine: DAG of stages (MR, HDFS, metadata)
Metastore: schema, location in HDFS, SerDe

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
Data Model
 Tables
– Typed columns (int, float, string, boolean)
– Also, list: map (for JSON-like data)

 Partitions
– For example, range-partition tables by date

 Buckets
– Hash partitions within ranges (useful for sampling, join optimization)

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
Hive
DBMS

Hive

Language

SQL-92 standard

Subset of SQL-92 plus Hive
extensions

Updates

INSERT, UPDATE, DELETE

INSERT OVERWRITE
No UPDATE or DELETE

Transactions

Yes

No

Latency

Sub-second

Minutes to hours

Indexes

Any number of indexes,
important to performance

No indexes, data is always
scanned in parallel

Dataset size

TBs

PBs

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

46
Metastore

 Database: namespace containing a set of tables
 Holds table definitions (column types, physical layout)
 Holds partitioning information
 Can be stored in Derby, MySQL, and other relational databases

Cloud Connect 2013
Source: cc-licensedShanghai Cloudera
slide by

© 2013 Cisco and/or its affiliates. All rights reserved.
Hive components

Hive

SerDe

InputFormat

Hadoop cluster
Cloud Connect 2013
Source: cc-licensedShanghai Cloudera
slide by

© 2013 Cisco and/or its affiliates. All rights reserved.

MetaStore
Hive MetaStore

BeelineCLI

HiveServer2

HiveCLI

MetaStore

Impala

RDBMS

HCatalog

Pig

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
Hive Physical Layout

Warehouse directory in HDFS
– E.g., /user/hive/warehouse

Tables stored in subdirectories of warehouse
– Partitions form subdirectories of tables

Actual data stored in HDFS files
– E.g. text, SequenceFile, RCfile, Avro
– Arbitrary format with a custom SerDe

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
External and Hive managed tables
 Hive managed tables
– Data moved to location /user/hive/warehouse
– Can be stored in a more efficient format than text e.g. RCFile
– If you drop the table, the raw data is lost
hive> CREATE TABLE test(id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE;

 External tables
– Can overlay multiple tables all pointing to the same raw data
– To create external table, simply point to the location of data while creating the tables
hive> CREATE TABLE test (id INT, name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY 'n'
STORED AS TEXTFILE
LOCATION '/home/test/data';
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.
Hive: Example
 Hive looks similar to an SQL database
 Relational join on two tables:
– Table of word counts from Shakespeare collection
– Table of word counts from the bible
SELECT s.word, s.freq, k.freq FROM shakespeare s
JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq
>= 1
ORDER BY s.freq DESC LIMIT 5;

the
I
and
to
of
Cloud Connect 2013 Shanghai

25848
23031
19671
18038
16700

62394
8854
38985
13526
34654
© 2013 Cisco and/or its affiliates. All rights reserved.
Impala

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

53
Impala
 General purpose MPP SQL query engine for Hadoop
– Query latency milliseconds to hours, interactive data exploration
– Runs on the existing Hadoop cluster on existing HDFS files and hardware

 High performance
– C++
– Direct access to HDFS and Hbase data, no MapReduce

 Unified platform
– Use existing Hive metadata and query language (HiveQL)
– Submit queries via ODBC or Thrift API

 Performance
– Disk throughput limited by hw to 100MB/sec
– 3 .. 90 x faster than Hive, depending on the type of the query
Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

54
Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App

HDFS NN

ODBC

StateStored

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

55
Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App
ODBC

Impalad keep contact to
StateStored to update their state
and to receive metadata for query
planning

HDFS NN

StateStored

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

56
Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App

HDFS NN

ODBC

StateStore
Query coordinator initiates
execution on remote impalad’s

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

57
Impala Details
Unified metadata
HiveQL interface
Hive Metastore
SQL App

HDFS NN

ODBC

StateStore
Intermediate results are streamed between impalad’s
and query results are streamed back to client

impalad

impalad

impalad

Query Planner

Query Planner

Query Planner

Query Coordinator

Query Coordinator

Query Coordinator

Query Exec Engine

Query Exec Engine

Query Exec Engine

HDFS DN HBase

HDFS DN HBase

HDFS DN HBase

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

58
Exercise 2
 Analytics with Hive and Impala
– Step 1: copy test dataset from USB memory stick
– Step 2: Mac & Dean to fill in …
– Step 3: Mac & Dean to fill in …
– etc

Cloud Connect 2013 Shanghai

© 2013 Cisco and/or its affiliates. All rights reserved.

59
Hadoop workshop

Contenu connexe

Tendances

Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsImply
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...DataStax Academy
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?Hortonworks
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hiveDavid Kaiser
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupCaserta
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storagehybrid cloud
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoopRommel Garcia
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol HARMAN Services
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopDataWorks Summit
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 

Tendances (20)

Why data warehouses cannot support hot analytics
Why data warehouses cannot support hot analyticsWhy data warehouses cannot support hot analytics
Why data warehouses cannot support hot analytics
 
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
C* Summit EU 2013: Leveraging the Power of Cassandra: Operational Reporting a...
 
Why hadoop for data science?
Why hadoop for data science?Why hadoop for data science?
Why hadoop for data science?
 
Overview of stinger interactive query for hive
Overview of stinger   interactive query for hiveOverview of stinger   interactive query for hive
Overview of stinger interactive query for hive
 
What is hadoop
What is hadoopWhat is hadoop
What is hadoop
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing MeetupReal Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
Real Time Interactive Queries IN HADOOP: Big Data Warehousing Meetup
 
Emergent Distributed Data Storage
Emergent Distributed Data StorageEmergent Distributed Data Storage
Emergent Distributed Data Storage
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
Interactive query in hadoop
Interactive query in hadoopInteractive query in hadoop
Interactive query in hadoop
 
Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
Introduction to Microsoft Azure HD Insight by Dattatrey Sindhol
 
The convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on HadoopThe convergence of reporting and interactive BI on Hadoop
The convergence of reporting and interactive BI on Hadoop
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 

En vedette

Revista Culturism nr. 116 (2/2001)
Revista Culturism nr. 116 (2/2001)Revista Culturism nr. 116 (2/2001)
Revista Culturism nr. 116 (2/2001)Redis Nutritie
 
Manual de Comunicaciones
Manual de ComunicacionesManual de Comunicaciones
Manual de ComunicacionesLu Esqueche
 
Informe Anual Integrado Ferrovial 2014
Informe Anual Integrado Ferrovial 2014Informe Anual Integrado Ferrovial 2014
Informe Anual Integrado Ferrovial 2014Ferrovial
 
LINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍAS
LINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍASLINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍAS
LINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍASLuciano Renteria
 
Sales Cloud Battle Card
Sales Cloud Battle CardSales Cloud Battle Card
Sales Cloud Battle CardRamez ElHallak
 
Conservación de-los-mares-mexicanos
Conservación de-los-mares-mexicanosConservación de-los-mares-mexicanos
Conservación de-los-mares-mexicanosana_pontenova
 
LinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV
 
Línea del tiempo de 1521 a 1810)a
Línea del tiempo de 1521 a 1810)aLínea del tiempo de 1521 a 1810)a
Línea del tiempo de 1521 a 1810)ajosehdzr
 
Company Profile -Webizz_India
Company Profile -Webizz_IndiaCompany Profile -Webizz_India
Company Profile -Webizz_IndiaRahul Pardeshi
 
Proyecto de vida de avigail baena
Proyecto de vida de avigail baenaProyecto de vida de avigail baena
Proyecto de vida de avigail baenamaximanana
 

En vedette (20)

Revista Culturism nr. 116 (2/2001)
Revista Culturism nr. 116 (2/2001)Revista Culturism nr. 116 (2/2001)
Revista Culturism nr. 116 (2/2001)
 
Manual de Comunicaciones
Manual de ComunicacionesManual de Comunicaciones
Manual de Comunicaciones
 
SmilingOne Foundation Annual Report 2012/13
SmilingOne Foundation Annual Report 2012/13SmilingOne Foundation Annual Report 2012/13
SmilingOne Foundation Annual Report 2012/13
 
Directores 2
Directores 2Directores 2
Directores 2
 
Informe Anual Integrado Ferrovial 2014
Informe Anual Integrado Ferrovial 2014Informe Anual Integrado Ferrovial 2014
Informe Anual Integrado Ferrovial 2014
 
Reo air chokes
Reo air chokesReo air chokes
Reo air chokes
 
LINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍAS
LINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍASLINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍAS
LINEAMIENTOS DE OPERACIÓN PARA TUTORÍAS Y ASESORÍAS
 
Sales Cloud Battle Card
Sales Cloud Battle CardSales Cloud Battle Card
Sales Cloud Battle Card
 
Alimentazione e sport
Alimentazione e sportAlimentazione e sport
Alimentazione e sport
 
Epica griega
Epica griegaEpica griega
Epica griega
 
Conservación de-los-mares-mexicanos
Conservación de-los-mares-mexicanosConservación de-los-mares-mexicanos
Conservación de-los-mares-mexicanos
 
LinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providersLinkedTV - an added value enrichment solution for AV content providers
LinkedTV - an added value enrichment solution for AV content providers
 
LOGISTICA Y DISTRIBUCIÓN
LOGISTICA Y DISTRIBUCIÓNLOGISTICA Y DISTRIBUCIÓN
LOGISTICA Y DISTRIBUCIÓN
 
Línea del tiempo de 1521 a 1810)a
Línea del tiempo de 1521 a 1810)aLínea del tiempo de 1521 a 1810)a
Línea del tiempo de 1521 a 1810)a
 
Elasticsearch Workshop
Elasticsearch WorkshopElasticsearch Workshop
Elasticsearch Workshop
 
Perfeccion Cristiana - Los Dos Pablos de Romanos 7
Perfeccion Cristiana - Los Dos Pablos de Romanos 7Perfeccion Cristiana - Los Dos Pablos de Romanos 7
Perfeccion Cristiana - Los Dos Pablos de Romanos 7
 
Company Profile -Webizz_India
Company Profile -Webizz_IndiaCompany Profile -Webizz_India
Company Profile -Webizz_India
 
Proyecto de vida de avigail baena
Proyecto de vida de avigail baenaProyecto de vida de avigail baena
Proyecto de vida de avigail baena
 
Proyecto Cuidadoras en Red
Proyecto Cuidadoras en RedProyecto Cuidadoras en Red
Proyecto Cuidadoras en Red
 
36 07 cantar de los cantares www.gftaognosticaespiritual.org
36 07 cantar  de los cantares www.gftaognosticaespiritual.org36 07 cantar  de los cantares www.gftaognosticaespiritual.org
36 07 cantar de los cantares www.gftaognosticaespiritual.org
 

Similaire à Hadoop workshop

Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaJeffrey T. Pollock
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccionFran Navarro
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data InfrastructureTrivadis
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeongYousun Jeong
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Sumeet Singh
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreAlluxio, Inc.
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)オラクルエンジニア通信
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?samthemonad
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream AnalyticsMarco Parenzan
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the CloudKellyn Pot'Vin-Gorman
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesDataWorks Summit
 

Similaire à Hadoop workshop (20)

Webinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafkaWebinar future dataintegration-datamesh-and-goldengatekafka
Webinar future dataintegration-datamesh-and-goldengatekafka
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Big data oracle_introduccion
Big data oracle_introduccionBig data oracle_introduccion
Big data oracle_introduccion
 
Big Data Infrastructure
Big Data InfrastructureBig Data Infrastructure
Big Data Infrastructure
 
Uotm workshop
Uotm workshopUotm workshop
Uotm workshop
 
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku  - hadoop ecosystem - @Epitech Paris - janvier 2014
Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
Hadoop Summit San Jose 2015: What it Takes to Run Hadoop at Scale Yahoo Persp...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Unified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AIUnified Data API for Distributed Cloud Analytics and AI
Unified Data API for Distributed Cloud Analytics and AI
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & MoreMeetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
Meetup at AI NextCon 2019: In-Stream data process, Data Orchestration & More
 
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)Oracle Cloud Infrastructure Data Science 概要資料(20200406)
Oracle Cloud Infrastructure Data Science 概要資料(20200406)
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Azure Stream Analytics
Azure Stream AnalyticsAzure Stream Analytics
Azure Stream Analytics
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the Cloud
 
Data Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud EraData Orchestration for the Hybrid Cloud Era
Data Orchestration for the Hybrid Cloud Era
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
What it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! PerspectivesWhat it takes to run Hadoop at Scale: Yahoo! Perspectives
What it takes to run Hadoop at Scale: Yahoo! Perspectives
 
BlueData DataSheet
BlueData DataSheetBlueData DataSheet
BlueData DataSheet
 

Dernier

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Dernier (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Hadoop workshop

  • 1. Hadoop workshop Cloud Connect Shanghai Sep 15, 2013 Ari Flink – Operations Architect Mac Fang – Manager, Hadoop development Dean Zhu – Hadoop Developer
  • 2. Agenda 1. Introductions (5 minutes) 2. Hadoop and Big Data Concepts (20 minutes) 3. Cisco Webex Hadoop architecture (10 minutes) 4. Cisco UCS Hadoop Common Platform Architecture (10 minutes) 5. Exercise 1 (30 minutes) – Configure a Hadoop single node VM on a laptop 6. Hive and Impala concepts (15 minutes) 7. Exercise 2 (30 minutes) – Analytics using Apache Hive and Cloudera Impala 8. Q & A Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 2
  • 3. Hadoop and Big Data Overview – Enterprise data management and big data – Problems, Opportunities and Use case examples – Hadoop architecture concepts Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 3
  • 4. What is Big Data? For our purposes, big data refers to distributed computing architectures specifically aimed at the “3 V’s” of data: Volume, Velocity, and Variety Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 4
  • 5. Traditional Enterprise Data Management Operational (OLTP) Operational (OLTP) ETL Operational (OLTP) Online Transactional Processing Extract, Transform, and Load EDW Enterprise Data Warehouse BI/Reports Business Intelligence (batch processing) Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 5
  • 6. Traditional Business Intelligence Questions Transactional Data (e.g. OLTP) Real-time, but limited reporting/analytics • What are the top 5 most active stocks traded in the last hour? • How many new purchase orders have we received since noon? Cloud Connect 2013 Shanghai Enterprise Data Warehouse High value, structured, indexed, cleansed • How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year? • What were the top 10 most frequently backordered products over the past year? © 2013 Cisco and/or its affiliates. All rights reserved. 6
  • 7. So what has changed? The Explosion of Unstructured Data 10,000 1.8 trillion gigabytes of data was created in 2011… UNSTRUCTURED DATA • Approx. 500 quadrillion files (IN BILLIONS) GB of Data • More than 90% is unstructured data • Quantity doubles every 2 years • Most unstructured data is neither stored nor analyzed! STRUCTURED DATA 0 2005 2010 2015 Source: Cloudera Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 7
  • 8. Enterprise Data Management with Big Data Inmemory analytics Operational (OLTP) Operational (OLTP) BI/Reports ETL MPP EDW Operational (OLTP) Web Dashboards Big Data (Hadoop, etc.) ETL Machine Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 8
  • 9. Traditional Business Intelligence Questions Transactional Data (e.g. OLTP) Fast data, real-time • What are the top 5 most active stocks traded in the last hour? • How many new purchase orders have we received since noon? Cloud Connect 2013 Shanghai Enterprise Data Warehouse High value, structured, indexed, cleansed Big Data Lower value, semi-structured, multi-source, raw/”dirty” • How many more hurricane windows are sold in Gulf-area stores during hurricane season vs. the rest of the year? • What were the top 10 most frequently backordered products over the past year? • Which products do customers click on the most and/or spend the most time browsing without buying? • How do we optimally set pricing for each product in each store for individual customers everyday? • Did the recent marketing launch generate the expected online buzz, and did that translate to sales? © 2013 Cisco and/or its affiliates. All rights reserved. 9
  • 10. Example: Web and Location Analytics iPhone searches Amazon for Vizio TV’s in Electronics 1336083635.130 10.8.8.158 TCP_MISS/200 8400 GET http://www.amazon.com/gp/aw/s/ref=is_box_?k=Visio+tv… "Mozilla/5.0 (iPhone; CPU iPhone OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Version/5.1 Mobile/9A405 Safari/7534.48.3" Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 10
  • 11. Big Data and Key Infrastructure Attributes (What big data isn’t)  Usually not blade servers (not enough local storage)  Usually not virtualized (hypervisor only adds overhead)  Usually not highly oversubscribed (significant east-west traffic)  Usually not SAN/NAS Low-cost, DASbased, scale-out clustered filesystem Move the compute to the storage Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. $$$ 11 11
  • 12. Cost, Performance, and Capacity HW:SW $ split 30:70 Enterprise Database Structured Data: Relational Database $20K/TB Massive Scale-Out Column Store $10K/TB Unstructured Data: Hadoop No SQL $300-$1K/TB Machine Logs, Web Click Stream, Call Data Records, Satellite Feeds, GPS Data, Sensor Readings, Sales Data, Blogs, Emails, Video HW:SW $ split 70:30 Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 12
  • 13. Big Data Software Architectures Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 13
  • 14. Three basic big data software architectures MPP Relational Database Real-time NoSQL Fast key-value store/retrieve •HBase (part of Apache Hadoop)* •DataStax (Cassandra)* •Oracle NoSQL* •Amazon Dynamo Cloud Connect 2013 Shanghai Scale-out BI/DW Batch-oriented Hadoop Heavy lifting, processing •Cloudera* •MapR* •Intel Hadoop* •Pivotal HD* © 2013 Cisco and/or its affiliates. All rights reserved. •Greenplum DB (Pivotal DB)* •ParAccel* •Vertica •Netezza •Teradata *Cisco Partners 14
  • 15. What Is Hadoop? Hadoop is a distributed, faulttolerant framework for storing and analyzing data. Its two primary components are the Hadoop Filesystem HDFS and the MapReduce application engine. Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 15
  • 16. Hadoop Components and Operations Hadoop Distributed File System (HDFS) File  Scalable & Fault Tolerant  Filesystem is distributed, stored across all data nodes in the cluster  Files are divided into multiple large blocks – 64MB default, typically 128MB – 512MB  Data is stored reliably. Each block is replicated 3 times by default  Types of Node Functions – Name Node - Manages HDFS – Job Tracker – Manages MapReduce Jobs – Data Node/Task Tracker – stores blocks/does work Cloud Connect 2013 Shanghai Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 ToR FEX/switch ToR FEX/switch ToR FEX/switch Data node 1 Data node 6 Data node 11 Data node 2 Data node 7 Data node 12 Data node 3 Data node 8 Data node 13 Data node 4 Data node 9 Name Node Data node 5 Data node 10 Job Tracker © 2013 Cisco and/or its affiliates. All rights reserved. 16
  • 17. HDFS Architecture Switch ToR FEX/switch ToR FEX/switch ToR FEX/switch Data node 1 Data node 6 Data node 11 2 1 Data node 2 2 3 Data node 7 1 Data node 12 2 Data node 3 3 Data node 8 1 4 Data node 13 3 Data node 4 4 Data node 9 4 Data node 14 Data node 5 Cloud Connect 2013 Shanghai Data node 10 Data node 15 © 2013 Cisco and/or its affiliates. All rights reserved. Name Node /usr/sean/foo.txt:blk_1,blk_2 /usr/jacob/bar.txt:blk_3,blk_4 Data node 1:blk_1 Data node 2:blk_2, blk_3 Data node 3:blk_3 17
  • 18. Rack Awareness “Rack” 1 “Rack” 2 “Rack” 3 Data node 1 Data node 6 Data node 11 2 1 Data node 2 2 3 Data node 7 1 Data node 12 2 Data node 3 3 Data node 8 1 4 Data node 13 3 Data node 4 4 Data node 9 4 Data node 14 Data node 5 Cloud Connect 2013 Shanghai Data node 10  Rack Awareness provides Hadoop the optional ability to group nodes together in logical “racks” (i.e. failure domains)  Logical “racks” may or may not correspond to physical data center racks  Distributes blocks across different “racks” to avoid failure domain of a single “rack”  It can also lessen block movement between “racks” Data node 15 © 2013 Cisco and/or its affiliates. All rights reserved. 18
  • 19. MapReduce Example: Word Count Input Map the quick brown fox the fox ate the mouse how now brown cow Cloud Connect 2013 Shanghai Shuffle & Sort Reduce the, 1 brown, 1 fox, 1 quick, 1 Output Reduce Map brown, 2 fox, 2 how, 1 now, 1 the, 3 Reduce ate, 1 cow, 1 mouse, 1 quick, 1 the, 1 fox, 1 the, 1 Map quick, 1 how, 1 now, 1 brown, 1 Map ate, 1 mouse, 1 cow, 1 © 2013 Cisco and/or its affiliates. All rights reserved. 19
  • 20. MapReduce Architecture Switch ToR FEX/switch ToR FEX/switch M1 Task Tracker 1 Task Tracker 6 ToR FEX/switch R2 Task Tracker 11 M2 Task Tracker 2 Task Tracker 7 M1 Task Tracker 12 Job Tracker M3 Task M2 Tracker 3 Task Tracker 4 M3 Task Tracker 8 Task Tracker 9 Task Tracker 13 Task Tracker 14 R1 Task Tracker 5 Cloud Connect 2013 Shanghai Task Tracker 10 Task Tracker 15 © 2013 Cisco and/or its affiliates. All rights reserved. Job1:TT1:Mapper1,Mapper2 Job1:TT4:Mapper3,Reducer1 Job2:TT6:Reducer2 Job2:TT7:Mapper1,Mapper3 20
  • 21. Cisco Webex Cloud and Hadoop Architecture Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 21
  • 22. Global Scale: 13 datacenters & iPoPs around the globe Dedicated network: dual path 10G circuits between DCs Multi-tenant: 95k sites Datacenter / PoP Leased network link © 2010 Cisco and/or its affiliates. All rights reserved. rights reserved. C97-717209-00 © 2012 Cisco and/or its affiliates. All Real-time collaboration: voice, desktop sharing, video, chat 22 22
  • 23. People make mistakes Hardware fails Software fails Even failovers sometimes fail Datacenter / PoP Leased network link © 2010 Cisco and/or its affiliates. All rights reserved. rights reserved. C97-717209-00 © 2012 Cisco and/or its affiliates. All 23 23
  • 24. Unstructured/semi-structured data Syslog HTTP/REST Thrift Log4j AMQP Avro Structured data RDBMS Application state & APIs File MySQL Flume Other Sinks Solr Sink Sqoop Cisco UCS C240 M3 servers HDFS Sink 12 x 3TB = 36 TB / server SolrCloud Solr index C97-717209-00 © 2012 Cisco and/or its affiliates. All rights reserved. HDFS Raw data 24
  • 25. Cisco UCS and Big Data Building a big data cluster with the UCS Common Platform Architecture (CPA) CPA Networking CPA Sizing and Scaling Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 25
  • 26. The evolution of big data deployments General Purpose IT Data Center Dedicated “Pod” for Big Data IT Infrastructure Generic IT servers SAP VMware WEB X86 servers Big Data Big Data  Experimental use of Big Data  App team mandated infrastructure  Deployed into IT Ops mandated infrastructures  Purpose built for Big Data  “Skunk works”  Big Data has established business value  Small to medium clusters Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.  Performance matters  Large or small clusters
  • 27. Hadoop Hardware Evolving in the Enterprise Typical 2009 Hadoop node • 1RU server • 4 x 1TB 3.5” spindles • 2 x 4-core CPU • 1 x GE • 24 GB RAM • Single PSU • Running Apache •$ Cloud Connect 2013 Shanghai Economics favor “fat” nodes • 6x-9x more data/node • 3x-6x more IOPS/node • Saturated gigabit, 10GE on the rise • Fewer total nodes lowers licensing/support costs • Increased significance of node and switch failure © 2013 Cisco and/or its affiliates. All rights reserved. Typical 2013 Hadoop node • 2RU server • 12 x 3TB 3.5” or 24 x 1TB 2.5” spindles • 2 x 8-core CPU • 1-2 x 10GE • 128 GB RAM • Dual PSU • Running commercial/licensed distribution • $$$ 27
  • 28. Cisco UCS Common Platform Architecture (CPA) Building Blocks for Big Data UCS Manager UCS 6200 Series Fabric Interconnects Nexus 2232 Fabric Extenders LAN, SAN, Management UCS 240 M3 Servers Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 28
  • 29. CPA Network Design for Big Data Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 29
  • 30. CPA: Topology Single wire for data and management 8 x 10GE uplinks per FEX= 2:1 oversub (16 servers/rack), no portchannel (static pinning) 2 x 10GE links per server for all traffic, data and management Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 31. CPA Recommended FEX Connectivity 2 FEX’s and 2 FI’s • • 2232 FEX has 4 buffer groups: ports 1-8, 9-16, 17-24, 25-32 Distribute servers across port groups to maximize buffer performance and predictably distribute static pinning on uplinks Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 32. Can Hadoop really push 10GE? It can, depending on workload, so tune for it!  Analytic workloads tend to be lighter on the network  Transform workloads tend to be heavier on the network  Hadoop has numerous parameters which affect network  Take advantage of 10GE CPA: – – – – – – Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. mapred.reduce.slowstart.completed.maps dfs.balance.bandwidthPerSec mapred.reduce.parallel.copies mapred.reduce.tasks mapred.tasktracker.reduce.tasks.maximum mapred.compress.map.output 32
  • 33. CPA Sizing and Scaling for Big Data Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 33
  • 34. Cisco UCS Reference Configurations for Big Data Full Rack UCS Solutions Bundle for Hadoop, NoSQL Performance 2 x UCS 6296 2 x Nexus 2232 PP 16 x C240 M3 (SFF) 2 x UCS 6296 2 x Nexus 2232 PP 16 x C240 M3 (LFF) 2x E5-2665 (16 cores) 256GB 24 x 1TB 7.2K SAS Cloud Connect 2013 Shanghai Full Rack UCS Solutions Bundle for Hadoop Capacity E5-2640 (12 cores) 128GB 12x 3TB 7.2K SATA © 2013 Cisco and/or its affiliates. All rights reserved. 34
  • 35. Sizing Part science, part art  Start with current storage requirement – Factor in replication (typically 3x) and compression (varies by data set) – Factor in 20-30% free space for temp (Hadoop) or up to 50% for some NoSQL systems – Factor in average daily/weekly data ingest rate – Factor in expected growth rate (i.e. increase in ingest rate over time)  If I/O requirement known, use next table for guidance  Most big data architectures are very linear, so more nodes = more capacity and better performance  Strike a balance between price/performance of individual nodes vs. total # of nodes Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 35
  • 36. CPA sizing and application guidelines CPU 2 x E5-2690 2 x E5-2665 2 x E5-2640 256 256 128 24 x 600GB 10K 24 x 1TB 7.2K 12 x 3TB 7.2K IO Bandwidth (GB/Sec) 2.6 2.0 1.1 Cores 256 256 192 Memory (TB) 4 4 2 Capacity (TB) 225 384 576 IO Bandwidth (GB/Sec) 41.3 31.9 16.9 MPP DB NoSQL Hadoop NoSQL Hadoop Memory (GB) Server Disk Drives Rack-Level Applications Best Performance Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. Best Price/TB 36
  • 37. Scaling the CPA L2/L3 Switching Single Rack 16 servers Single Domain Up to 10 racks, 160 servers Multiple Domains Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 37
  • 38. Scaling the Common Platform Architecture Multiple domains based on 16 servers per rack and 2 x 2232 FEXs Consider intra- and inter-domain bandwidth: Servers Per Domain (Pair of Fabric Interconnects) Available North-Bound 10GE ports (per fabric) Southbound oversubscription (per fabric) Northbound oversubscription (per fabric) Intra-domain server-to-server bandwidth (per fabric, Gbits/sec) Inter-domain server-to-server bandwidth (per fabric, Gbits/sec) 160 16 2:1 5:1 5 1 144 24 2:1 3:1 5 1.67 128 32 2:1 2:1 5 2.5 Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 38
  • 39. Multi-Domain CPA Customer Example • 10 Gits/sec Intra-Domain Server to Server NW Bandwidth • 5 Gbits/sec Inter-Domain Server to Server NW Bandwidth • Static pinning from FEX to FI (no port-channel) Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 39
  • 40. Recommendations: UCS Domains and Racks Single Domain Recommendation Multi Domain Recommendation Create one Hadoop rack per UCS Domain Turn off or enable at physical rack level • For simplicity and ease of use, leave Rack Awareness off • Consider turning it on to limit physical rack level fault domain (e.g. localized failures due to physical data center issues – water, power, cooling, etc.) Cloud Connect 2013 Shanghai • With multiple domains, enable Rack Awareness such that each UCS Domain is its own Hadoop rack • Provides HDFS data protection across domains • Helps minimize crossdomain traffic © 2013 Cisco and/or its affiliates. All rights reserved. 40
  • 41. Exercise 1  Set up a single node VM cluster on the laptop – Step 1: copy files from USB memory stick – Step 2: Mac & Dean to fill in … – Step 3: Mac & Dean to fill in … – etc Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 41
  • 42. Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 42
  • 43. Hive  An SQL-like interface to Hadoop  Top level Apache project – http://hive.apache.org/  Hive history – Created at Facebook to allow people to quickly and easily leverage Hadoop without the effort of writing Java MapReduce – Currently used at many companies for log processing, business intelligence and analytics Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 43
  • 44. Hive Components      Shell: allows interactive queries Driver: session handles, fetch, execute Compiler: parse, plan, optimize Execution engine: DAG of stages (MR, HDFS, metadata) Metastore: schema, location in HDFS, SerDe Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 45. Data Model  Tables – Typed columns (int, float, string, boolean) – Also, list: map (for JSON-like data)  Partitions – For example, range-partition tables by date  Buckets – Hash partitions within ranges (useful for sampling, join optimization) Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 46. Hive DBMS Hive Language SQL-92 standard Subset of SQL-92 plus Hive extensions Updates INSERT, UPDATE, DELETE INSERT OVERWRITE No UPDATE or DELETE Transactions Yes No Latency Sub-second Minutes to hours Indexes Any number of indexes, important to performance No indexes, data is always scanned in parallel Dataset size TBs PBs Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 46
  • 47. Metastore  Database: namespace containing a set of tables  Holds table definitions (column types, physical layout)  Holds partitioning information  Can be stored in Derby, MySQL, and other relational databases Cloud Connect 2013 Source: cc-licensedShanghai Cloudera slide by © 2013 Cisco and/or its affiliates. All rights reserved.
  • 48. Hive components Hive SerDe InputFormat Hadoop cluster Cloud Connect 2013 Source: cc-licensedShanghai Cloudera slide by © 2013 Cisco and/or its affiliates. All rights reserved. MetaStore
  • 49. Hive MetaStore BeelineCLI HiveServer2 HiveCLI MetaStore Impala RDBMS HCatalog Pig Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 50. Hive Physical Layout Warehouse directory in HDFS – E.g., /user/hive/warehouse Tables stored in subdirectories of warehouse – Partitions form subdirectories of tables Actual data stored in HDFS files – E.g. text, SequenceFile, RCfile, Avro – Arbitrary format with a custom SerDe Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 51. External and Hive managed tables  Hive managed tables – Data moved to location /user/hive/warehouse – Can be stored in a more efficient format than text e.g. RCFile – If you drop the table, the raw data is lost hive> CREATE TABLE test(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE;  External tables – Can overlay multiple tables all pointing to the same raw data – To create external table, simply point to the location of data while creating the tables hive> CREATE TABLE test (id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/home/test/data'; Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved.
  • 52. Hive: Example  Hive looks similar to an SQL database  Relational join on two tables: – Table of word counts from Shakespeare collection – Table of word counts from the bible SELECT s.word, s.freq, k.freq FROM shakespeare s JOIN bible k ON (s.word = k.word) WHERE s.freq >= 1 AND k.freq >= 1 ORDER BY s.freq DESC LIMIT 5; the I and to of Cloud Connect 2013 Shanghai 25848 23031 19671 18038 16700 62394 8854 38985 13526 34654 © 2013 Cisco and/or its affiliates. All rights reserved.
  • 53. Impala Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 53
  • 54. Impala  General purpose MPP SQL query engine for Hadoop – Query latency milliseconds to hours, interactive data exploration – Runs on the existing Hadoop cluster on existing HDFS files and hardware  High performance – C++ – Direct access to HDFS and Hbase data, no MapReduce  Unified platform – Use existing Hive metadata and query language (HiveQL) – Submit queries via ODBC or Thrift API  Performance – Disk throughput limited by hw to 100MB/sec – 3 .. 90 x faster than Hive, depending on the type of the query Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 54
  • 55. Impala Details Unified metadata HiveQL interface Hive Metastore SQL App HDFS NN ODBC StateStored impalad impalad impalad Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 55
  • 56. Impala Details Unified metadata HiveQL interface Hive Metastore SQL App ODBC Impalad keep contact to StateStored to update their state and to receive metadata for query planning HDFS NN StateStored impalad impalad impalad Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 56
  • 57. Impala Details Unified metadata HiveQL interface Hive Metastore SQL App HDFS NN ODBC StateStore Query coordinator initiates execution on remote impalad’s impalad impalad impalad Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 57
  • 58. Impala Details Unified metadata HiveQL interface Hive Metastore SQL App HDFS NN ODBC StateStore Intermediate results are streamed between impalad’s and query results are streamed back to client impalad impalad impalad Query Planner Query Planner Query Planner Query Coordinator Query Coordinator Query Coordinator Query Exec Engine Query Exec Engine Query Exec Engine HDFS DN HBase HDFS DN HBase HDFS DN HBase Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 58
  • 59. Exercise 2  Analytics with Hive and Impala – Step 1: copy test dataset from USB memory stick – Step 2: Mac & Dean to fill in … – Step 3: Mac & Dean to fill in … – etc Cloud Connect 2013 Shanghai © 2013 Cisco and/or its affiliates. All rights reserved. 59

Notes de l'éditeur

  1. Summary slides after each model Hadoop, NoSQL and MPP. 3 bullets on actual implementation to tie back. To later section.
  2. Sean to 27
  3. Hadoop optimized for large streaming reads, not for low latency or fast writesHDFS optimized for fewer, larger files (> 100 MB), 128MB block size or higherFiles are write-once currently (append support available in 0.21, but mainly for HBase; otherwise not recommended)Blocks are replicated 3x by default, on three different data nodesNameNode stores file metadata in fsimage.txt:/usr/sean/foo.txt:blk_1,blk_2,blk_3 – but but it doesn't know which data nodes own those blocks until they report inBlocks are just files on the underlying filesystem (ext3, etc.) - blk_1234No metadata on the slave node that describes the data contained on that slave (or any other)When NameNode starts up, it starts in safe mode, and won't leave safe mode until it knows where at least one copy 99.999% of blocks are (configurable) based on block reports, then waits 30 seconds and exits safe modeNameNode block map is solely based on slave block reports, always cached in memory, nothing persistentAll data nodes heartbeat into NameNode every 3 seconds; NameNode will evict if no heartbeat after 5 minutes, and re-replicate “lost” blocks if no heartbeat after 10 minutesAs blocks are written, checksums are calculated and stored with the block (blk_1234.meta). Upon read it compares the calculated checksum with stored checksumTo avoid bit rot, a daemon runs to check the checksum every 3 weeks after a given block was written
  4. Hadoop optimized for large streaming reads, not for low latency or fast writesHDFS optimized for fewer, larger files (> 100 MB), 128MB block size or higherFiles are write-once currently (append support available in 0.21, but mainly for HBase; otherwise not recommended)Blocks are replicated 3x by default, on three different data nodesNameNode stores file metadata in fsimage.txt:/usr/sean/foo.txt:blk_1,blk_2,blk_3 – but but it doesn't know which data nodes own those blocks until they report inBlocks are just files on the underlying filesystem (ext3, etc.) - blk_1234No metadata on the slave node that describes the data contained on that slave (or any other)When NameNode starts up, it starts in safe mode, and won't leave safe mode until it knows where at least one copy 99.999% of blocks are (configurable) based on block reports, then waits 30 seconds and exits safe modeNameNode block map is solely based on slave block reports, always cached in memory, nothing persistentAll data nodes heartbeat into NameNode every 3 seconds; NameNode will evict if no heartbeat after 5 minutes, and re-replicate “lost” blocks if no heartbeat after 10 minutesAs blocks are written, checksums are calculated and stored with the block (blk_1234.meta). Upon read it compares the calculated checksum with stored checksumTo avoid bit rot, a daemon runs to check the checksum every 3 weeks after a given block was written
  5. JobTracker assigns map or reduce tasks to TaskTracker slaves (data nodes) with available “slots”. For map tasks, JobTracker attempts to assign work on local blocks to avoid expensive shipping of blocks across the networkEach task (mapper or reducer) runs in its own child JVM on the slave node. TaskTracker process kicks off its child tasks based on preconfigured number of task slotsEach child task JVM eats up a chunk of RAM, placing a limit on total # of slotsRule of thumb: 25-30% of space set aside for temp storage, outside of HDFS, to hold intermediate map output data before sending to reducersIf a child JVM dies, TaskTracker will remove it and report to the JobTracker that it died; JobTracker will attempt to reassign that task to a different TaskTrackerIf any specific task fails 4 times, the whole job failsIf a TaskTracker reports a high # of failed tasks, it'll get blacklisted for that jobIf a TaskTracker gets blacklisted for multiple jobs, it gets put on a global blacklist for 24 hours
  6. As of Feb 2013
  7. As of Feb 2013
  8. CEP: Complex Event Processing
  9. Big data projects often start out co-mingled within existing general purpose data center infrastructure, but eventually outgrow it and need to move to a dedicated “pod”. This is usually where we come in.