Big Data Fundamentals 6.6.18

1© Cloudera, Inc. All rights reserved.
Big data fundamentals
Understanding the optimizationchoices in big data components

Presentation goals
 Teach you something
 Help you see the potential of Big Data beyond Map Reduce
 Be fair to Cloudera’s competitors
 Inspire you to learn more
If something doesn’t make sense, please ask.

Notification
• The information in this document is proprietary to Cloudera. No part of this document may be reproduced,
copied or transmitted in any form for any purpose without the express prior written permission of Cloudera.
• This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.

Agenda
• Open source software
• Data storage and stewardship
• Data integration
• Data engineering
• Data analytics
• Life after Lambda architectures and IoT
• Data science at scale
• Big data in the clouds
• Cybersecurity as a Big Data problem
• Cluster management and security
• Customer success stories
• Question and answers

Open source software
Optimizing to benefit from community innovation

Free evaluation
Install, test, inspect, and evaluate open
source code in perpetuity, with no
financial obligation
Freedom from lock-in
Multiple vendors supporting same core
technology makes it easier to move
Scalable innovation
The collective work of a global,
passionate community keeps the code
base evolving
3 Reasons open source is good for companies
[1] [2] [3]
These benefits derive from use of the permissive
Apache License

Not business focus
Company assets should be working on
core competency
Real cost hard to measure
Time developers spend solving
problems or adding features often isn’t
visible
Multiple projects
Each project is managed by a separate
committee and there is not necessarily
an overriding design
3 Reasons open source adds risk for companies
[1
]
[2
]
[3]

“Open source software is free
like a puppy is free”
- Scott McNealy
CEO Sun Microsystems

What if you got a dog for a
reason?
• Can take years to mature
• Months of intensive training (when your
attention should be elsewhere)
• Dog becomes very bonded to the
handler (and vice versa)
• Poor training results in a misbehaving
dog
Developers don’t want to be tied to one system
You don’t want your developers tied to one system

What is a distribution?

Each Apache project has its own
dependencies and release cycle.
Getting them to work together
requires effort and thorough testing.
Code in Open Source changes
constantly. Cloudera provides a
new feature release every quarter
that is tested and supported.
Distribution Vendors should employ
Open Source Committers that can
make sure fixes are added to the
Open Source base.
Benefits of using a distribution
Stability Regular upgrades 24x7 Support and bug
fixes

With a Distribution, you can start
developing applications right away.
Building an environment from
scratch would take months.
With a distribution, you know what it
will cost and you know that it will
work. Building an environment from
scratch provides no such
guarantees.
Building an environment from
scratch would require the focus of a
few of your best developers. Get
them working on the real problem.
More benefits of using a distribution
Faster to market Minimize risk Focus on business problems

The big data ecosystem vendors
(Spark) (Kafka)
Comprehensive distributions
Single+ project specialists
Proprietary + Hadoop in the gaps
(Cassandra)
Google Cloud Dataproc

Apache software foundation
ASF board of directors
Project management committee chair – ensures the project complies with ASF requirements
PMC members – decide the architecture, feature set and direction of the project, usually are also
Committers
Committers – have write access to the code, although contributions are approved by the PMC
Developers (aka contributors) – anyone may propose changes to the code or
documentation, but those changes have to be picked up and used by a committer
Users – provide feedback, bug reports and feature suggestions
appoints
For each project

Apache project requirements
• Must be Apache licensed (may include compatibly licensed elements)
• Free to download and use for any purpose
• Branding requirements and restrictions
• Source code must be open and available on the ASF website
• Must provide sufficient documentation to use the project on website
• Releases must follow the ASF PMC voting policies
• Corporations may not directly contribute – only individuals
• Must govern themselves independently of undue commercial influence
• Must not discourage new contributions from competing vendors
• Low diversity may incur ‘extra scrutiny’ from the board
However, there are NO requirements to:
• Have more than one commercial entity involved (random community members are ok)
• Contribute to an existing project when there is overlap in functionality (competitive projects are ok)
• Contribute modifications or enhancements back to the project
• Employ Committers or PMC members if you are a commercial vendor

Cloudera’s commitment to our customers
Anything that stores your data
Any APIs your applications call
Uses open source code
Our contributions and fixes go back
to open source first
When possible, use projects
supported by multiple commercial
vendors
Keeping your cluster running
Cloudera express edition
No limit to number of servers
Managing your applications
Employ* committers, if not PMC
members, on the projects we
support
* People manage their own careers. Temporary gaps may exist
High availability features
Ensure your success
Open source
License expiration won’t stop
the cluster
Free to use forever Provide enterprise value
RBAC over your data
24x7 support
Minimize your risk
Rolling upgrades
Data governance and lineage
Automated backup and recovery
Full disk encryption
Multi-tenant usage reports

Data storage and stewardship
Optimizing for inexpensive, reliable storage accessed by
multiple execution engines

Anatomy of a big data cluster Masters
Workers Gateway(s)
Cloudera
Manager
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
HMaster
CM Agent
HUE Server
Zookeeper
Name Node
YARN
Kudu Master
⭐️ Zookeeper
Secondary Name
Node
Impala Catalog
Store
Kudu Master⭐️
HMaster
CM Agent
Sentry Server
Zookeeper
HiveServer
Impala Statestore
Kudu Master
HMaster
CM Agent
Oozie Server
CM Agent
CDSW
User App
User App
Metadata
Database(s)
CM Agent
CDSW
CDSW Session
CDSW Session
CDSW Session
CDSW Session
CDSW Session
Cloud Plugin
Cloudera
Director
(optional)

HDFS
Name Node
Secondary
Name Node
Standby
Name Node
Data NodeA Data NodeB Data NodeC Data NodeD
FileQ
BX BY BZ
BX1 BX2 BX3
BY1 BY3 BY2
BZ3BZ2 BZ1
Rack1 Rack2 Rack3
Default block size
= 256 MB

HDFS Snapshots
…
user
hive
tables
sales
subscriptions
Data1.parquet
Data2.parquet
.snapshot
Data Node
BX1
Name Node
BY1 BZ1
BY2 BX2 BZ2
BY1 BX2 BY2
BX1 BZ1 BZ2
BX1 BY1 BY2
BX2 BZ1 BZ2snap1
Data1.parquet
Data2.parquet

Public cloud blob storage
Public clouds are offering low cost, highly available storage
Designed for access inside and outside of Hadoop
Amazon Simple Storage Service (S3)
Uses ‘bucket’ paradigm
Requires S3 Guard (Apache Open Source) to achieve consistency
Use protocol s3a://<bucket name>/<filename>
• Microsoft Azure Data Lake Store (ADLS)
‘Feels’ more like a normal (POSIX) file system
Use protocol adl://<directory>/<directory>/filename

Compute over storage
SparkImpala MapReduceSearch
Hive Pig
ADLS
KuduHDFS
Compute
Storage
Filesystem
S3
HBase

Schema on write or ‘structured data’
1. Define schema
2. Create table(s)
3. Map known fields
4. Discard unknown fields

Schema on read or ‘unstructured data’
1. Write whole record(s) to
filesystem (compressed)
3. Query engine applies
schema to data
2. Register schema with metastore

Popular file format options
XML, JSON Files
Can’t be both split and compressed
Text/Delimited/CSV/JSON Records
Usable everywhere
Schema on read
Poor performance, poor compression
Avro
Contain schema, but also allow schema on read
Usable inside and outside of Hadoop
Parquet
Columnar, splitable, query performance benefits, excellent compression
Support schema evolution (adding columns)
Skips columns well during scans
ORC (not supported by Cloudera, HDP Hive Only)
Similar to Parquet but with higher compression but poor data skip
Hortonworks working on ACID transactions, secondary indexes
File type Example size
Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro w/ snappy compression 750 MB
Parquet w/ snappy compression 300 MB

Raw and formatted data copies
• Keep the raw version if there is an opportunity that information
will be lost in the translation
• Use Columnar storage on formatted data to improve analytic
performance immensely
• Think about a metadata tagging policy (e.g. Cloudera
Navigator) to assist with Data stewardship

Big data pipelines
Data ingestion Data engineering Data stewardship Data science Data analytics
Move
Cleanse
Conform
Transform
Enrich
Store
Secure
Govern
Tag
Model
Score
Enrich
Predict
BI
Online
APIs
Capture
Stream

Which do you want?
Data lake Data hub

Data lake to a data hub
• Comprehensive, planned and enforced data hierarchy
• Carefully administered versioning and retention policies
• Comprehensive, unified security, governance and
lineage
• Encourage and support metadata
• Establish standards for data, metadata and analytic
models
• Maximize reuse of data without making copies
• Balanced with security and performance concerns – don’t be an
ideologue!
• Plan staffing around new roles

Data integration
Optimizing for data ingestion with volume, velocity and variety

Apache Flume
HDFS
Flume Agent
Flume Agent(s)
Compress
Flume Agent Flume Agent Flume Agent
Flume Agent Flume Agent
Filter Transform
Flume Agent
Encrypt
Flume Agent
•Pre-process data before storing
• Such as transform, scrub or
enrich
• Store in any format
• Text, compressed, binary, or
custom sink
•Collect data as it is produced
• Files, syslogs, stdout or
custom source
•Process in place
• Such as encrypt or
compress
• Write in parallel
• Scalable throughput

Apache Kafka
Broker1
TopicA- Partition0
Broker2
TopicA- Partition1
Broker3
TopicA- Partition2
Producer
Producer
ConsumerA
Consumer
Consumer Group
ConsumerB
Producers push to Kafka Consumers pull from Kafka

Kafka redundancy
Broker3
TopicA- Partition2
TopicA- Partition0 -Replica
Broker3
TopicA- Partition1
Broker3
TopicA- Partition0

Apache Sqoop
RDBMS
HDFS
▪ Rapidly moves large amounts of data
between relational databases and HDFS
– Import tables (or partial tables)
from an RDBMS intoHDFS
– Export data from HDFS to a database table
▪ Uses JDBC to connect to thedatabase
– Works with virtually all standard RDBMSs
▪ Custom “connectors” for some RDBMSs provide much higher throughput
– Available forcertain databases, such as Teradata and Oracle

Data engineering
Optimizing for parallel processing of big data with minimum code

Directed acyclic graph (DAG)

Directed acyclic graph (DAG)
✔
✖

Resilient Distributed Dataset (RDD)
An RDD is an immutable distributed collection of elements of your data, partitioned
across nodes in your cluster that can be operated in parallel with an API that
offers transformations and actions.
map (function)
filter (predicate)
sortBy (function)
join (RDD2)

Apache Spark
RDDA RDDB RDDC
RDDD RDDE RDDF
RDDG
map groupBy
filtermap
join

Spark stages
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap

Spark stages
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
RDDG
join

Spark caching
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
RDDG
join

Evolution of the RDD API
DataFrame (Spark 1.3)
• Untyped, for R and Python
• Adds concept of ‘Schema’ to describe the data
• Uses RDDs underneath
• Allows Spark engine to perform some optimizations
• Avoids use of Java serialization, uses off heap storage
• Required different API than RDDs
RDD (Spark 1.0)
• Can be strongly typed in Java, Scala
• Uses RDDs underneath
• Catch compile-time errors
Dataset (Spark 2.x)
• Unified API
• Typed and untyped

Spark Streaming
TCP/IP

Other DAG/streaming processors
(not supported by Cloudera)

Spark ecosystem
Spark core
Spark SQL
Spark
Streaming
Spark ML GraphX
Standalone Mesos
(not included in CDH)
Yarn

Spark SQL
+ Static typing (optional)
+ Storage and processing efficiencies

ETL into EDW
Data
sources
ETL
EDW
Archive
Data
marts
Canned
reports
Dashboards/
analytic
applications
Non-SQL
workloads
Self-service
BI/ad hocEDW

EL-T into EDW
Data
sources
EL
EDW
Archive
Data
marts
Canned
reports
Dashboards/
analytic
applications
Non-SQL
workloads
Self-service
BI/ad hoc
T

Modern data warehouse landscape
Data
sources
Analytic
database
Operational
database
Data Science &
engineering
Shared data
layer
Modern Data Platform
Fixed
reports
Dashboards/
analytic applications
Non-SQL
workloads
Self-service
BI/ad hoc
Flexible
reporting
EDW

Cloudera’s featured data engineering partners
Hadoop Native Solution

Data analytics
Optimizing the engine to match the use case

Apache Hive
Hive Metastore
HDFS BLOB OtherStorage
Location
Schema
SerDe
File format
HiveServer2
Thrift Service Beeline CLI
JDBC
ODBC
Driver
Compiler
Executor
Driver
Compiler
Executor
SessionA SessionB
or

Apache Hive
✓ Spins up processes under the control of YARN
- shares resources well on the cluster
- but there is a lot of overhead to create these processes
✓ Can handle the failure of a machine during the query
- but recovery takes many seconds
✓ Will overflow join data to HDFS
- can handle very large joins
- but HDFS writes data 3 times, so this takes time
Don’t forget
who won the
race, Bucko!
Hive on Spark (Cloudera, MapR, Databricks) ✓Improves speed due to efficiencies of Spark
Live Long and Process (Hortonworks) ✓Improves speed by using pre-allocated JVMs w/ caching
Presto (Facebook) ✓Improves speed by optimizing data transfers for SQL and
using data streaming instead of HDFS for intermediate data
But all of these solutions are still JVM based

Apache Impala
✓ Written in C++
- avoids issues of the JVM
✓ Uses the Hive metastore
- better integration for security and administration
✓ Uses pre-allocated processes on worker nodes
- no process spin up time
- but still builds an execution plan for each query
✓ Employs algorithms from MPP databases
But I left you
in the dust at
the starting
line, Grandpa!
If a machine fails during a query that only takes 1 second to run, you will just retry the query.
Adopted by:
(the fastest of the antelopes)

So which engine should I choose?
"If the only tool you have is a hammer, you tend to
see every problem as a nail."
- Abraham Maslow
Psychologist
Author of ‘Maslow’s Hierarchy of Needs’
SparkImpala MapReduceSearch
Hive Pig
ADLS
KuduHDFS
Filesystem
S3
HBase

Other SQL engines
LLAPStinger.next
CubeHive ++
aka Live Long And Process
For JSON lovers
Tied to proprietary front/backLayer over HBaseSQL engines ‘from scratch’
Low Latency Analytical Processing
(not supported by Cloudera)
IBM Big SQL
OLAP

How to interpret benchmark tests
Standard test? How many of the queries were
run?
What is the criterion for excluding a query?
Single-user or multi-user? Data size?
Allow modifications to the queries?
"There are three kinds of lies:
lies, damned lies, and statistics."
-Benjamin Disraeli
Prime Minister of Britain

Life after lambda architectures and IoT
Optimizing for time series and changing data

Updates or analytics
using
Analytics(Scans)
Online (Random Access)slow
slowfast
fast
(but not both at the same time)
Write once, read many. No updates, but can append (sort of)
Optimized for batch inserts and scans
Read, write, update individual rows
Optimized row-based access, sparse columns

Lambda architectures
(named for the simple shape)

Lambda architectures
(not so simple in practice)
Source: http://horicky.blogspot.com/2014/08/lambda-architecture-principles.html

Kudu design goals
using
Analytics(Scans)
Online (Random Access)slow
slowfast
fast
High throughput for big scans
Goal: Close to Parquet on HDFS
Low-latency for short accesses (primary key indexes
and quorum design)
Goal: 1ms read/write on SSD
Database-like semantics (initially single-row ACID)
Relational data model
SQL query
“NoSQL” style scan/insert/update (Java client)

Why are updates important?
Right to forget
ETL mistakes/corrections
Analytic enrichment

Life without Lambda
with
BI
Online

Kudu use cases
Kudu is best for use cases requiring a
simultaneous combination of sequential and
random reads and writes
● Time series data
○ Examples: Stream market data; fraud detection
and prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine data analysis
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups

Data science
Optimizing to detect complex patterns over time

Ask bigger questions

Data science is a big data problem
“It’s not who has the best algorithm that wins. It’s
who has the most data.”
Banko and Brill, 2001

Notebooks
What was our revenue last year?
RDBMS
$14,325,874,321.07
What will our revenue be next year?
• Assumptions
• Algorithms
• Source Data
• Methodology
Your code tells a story
• Tell it with pictures & results
• Allow someone to re-run the numbers
• Pass it to someone who may use it as
the basis for a new/different story

Notebook challenges
Access
For sensitive data, secure clusters are
difficult to access. And IT typically doesn’t
want random packages installed on a secure
cluster.
Popular open source tools don’t easily
connect to these environments, or always
support Hadoop data formats.
Scale
Laptops rarely have capacity for
medium, let alone big data. This leads
to a lot of sampling.
Popular frameworks don’t easily
parallelize on a cluster. Typically code
has to get rewritten for production.
Developer Experience
Notebooks, while awesome, don’t easily
support virtual environment and
dependency management, especially for
teams. This makes sharing and
reproducibility hard.
Notebooks are also challenging to “put
into production.”

‘Dependency hell’
Or ’I am my own Grandpa’
X (1.0.0)
Y (1.0.0)
MyApp
X (1.0.0)
Y (1.0.0)
MyApp
X (1.1.0)
Upgrade
Dependency Graph for Hadoop Java Client
www.visioneye.com

Cloudera Data Science Workbench
 Team-based
 R, Python, Scala
 SDLC
 Secure
 Containerized
 Integrated into the cluster

The importance of an open ecosystem
Open ecosystem Black box

Containers
Hardware
Host OS
Hypervisor
(Optional)
GuestOS
GuestOS
GuestOS
Libs Libs Libs
AppA1 AppA2 AppB
VM
Hardware
Host OSContainer
Daemon
Libs Libs
AppA1
AppA3
AppA2
AppB1
AppB3
AppB2
AppB4
Container
Containers
• Use less memory than VMs
• You get to use more of the machine you pay for
• Provide isolation between apps
• Can share libraries between similar apps
• Provide abstraction of the OS, not of the HW
• Get you out of ‘Dependency Hell’ against other applications

Scaling data science for big data
Master(s)
Workers
Gateway(s)
Name Node
YARN
CDSW
CDSW
CDSW Session
CDSW Session
CDSW Session
CDSW Session
CDSW Session
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Web browser
login
Start session
CDSW Session
CDSW Session
Kubernetes

Machine learning pipeline in Spark
Load learning data frame
Clean/process data
Extract and transform features
Vectorize features Save model
Scoring results Test m odel
Fit and access model
Load test data frame
Test resultsLoad scoring data frame
Score DataSave Results

Big Data in the Clouds
Optimizing for a variety of operational choices

My organization
is moving to the cloud,
why should we
consider ?

Traditional applications
80
Data
Exploration
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
SQL & BI
Analytics
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Operational
Real-Time DB
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
DATA CATALOG
ETL & Data
Processing
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
Custom
Functions
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
DATA CATALOG
Many data silos, each with its own proprietary tools and infrastructure
Different vendors, products, and services on-premises versus in cloud
A fragmented approach is difficult, expensive, and risky

Multiple compute engines, same data
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES

Common metadata
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
METADATA
SERVICES

Data Silos 2.0
DW Cluster
DW Service
Source Data
A B
D
C

Deployed anywhere
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
CORE
SERVICES
STORAGE
SERVICES
METADATA
SERVICES
DEPLOYMENT
OPTIONS

But that still
wasn’t quite right ….

How do we deal with hybrid clouds?
• Shared catalog
• Unified security
• Consistent governance
• Easy workload management
• Flexible ingest and replication

Cloudera Enterprise
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
DEPLOYMENT
OPTIONS
The modern platform for machine learning and analytics optimized for the cloud

Deployment & management options
Bare Metal Private Cloud Cloud IaaS Cloud PaaS
Applications Applications Applications Applications
Clusters Clusters Clusters Clusters
Operating System Operating System Operating System Operating System
Network Network Network Network
Storage Storage Storage Storage
Servers Servers Servers Servers
Customer managed Vendor managed
Manager Director Altus

• Easy
• Agile
• Unified

Altus service architecture
● Runs in Cloudera’s secured and monitored environment
● Manages CDH clusters in customer cloud account
● Customer data does not pass* to Cloudera
* Workload Analytics requires opt-in log data transfer to Cloudera

Keep your encryption keys outside of the cloud

Cloudera usage based pricing option
Pay per use Node based pricing
 Cheaper for transient clusters
 Cheaper for small machine types
 Pay as you go or discounted credits
 Cheaper for persistent or long-running clusters
 Volume & enterprise discounts

Hot-Warm-Cold Data
Store partitions from the same table in different storage types
m4.4xlarge m4.4xlarge i2.2xlarge
serve serve
preload
serve preloadserve
d2.4xlarge
serve
0 1 3 14
Days of ‘Hot’ Data
AWS Instance premium – 200% AWS Instance premium – 320%
preload
S3
S3
EBS
S3 S3

BDR to Blob Storage
 Minimum Storage Cost
 No Backup Cluster Costs
(servers or subscription)
 RPO unaffected
 Cloud provider manages
regional locality
✗RTO longer
user
sales
contracts
North America
.snapshots
snap 4-21-17
Contract1.txt
Contract2.txt
Contract1.txt
Contract2.txt
AWS S3
ADLS

Cybersecurity
Optimizing to detect complex attacks over longer periods of time

Cybersecurity is a big data problem
Popular cyber platforms can
not cost effectively scale to
the volume and variety of
modern data
Only partial view of the
enterprise limits analytics and
slows investigations
Difficult to deploy advanced
machine learning detection
capabilities
Explosion of data Limited enterprise visibility Limited analytic processing
DataAccess
1%50%100%
DataVolume
10PB1PB1TB
IF (X) AND (Y)
THEN (Z)
Time
User
Network
Endpoint
Archived
data
Emerging
data

Open Data Models:
Enterprise Visibility, Support For Multiple Workloads
Endpoint User
Network
DIVERSE DATA SOURCES SINGLE ACCESS
Source: Momentum Partners Cybersecurity Snapshot April 2016

Detect advanced threats faster
with full compliment of analytic
frameworks for all cyber
workloads
Faster time to incident
investigation and response with
comprehensive enterprise
visibility
Change the economics of
cybersecurity with an open
source platform that supports
multiple LOB workloads
The value of Apache Spot

Many applications on one shared data set and architecture
Visualization & machine
learning applications can share
common data set &
infrastructure
CustomPackaged
Spot community is developing
out machine learning (e.g.
network threat detection)
Open Source
Build custom applications &
analytics using Cloudera
without having to buy new
infrastructure

But I already have Splunk …
Go Beyond Splunk’s SPL
• Share enriched data across
multiple analytic processing
engines
• Simple search, SQL, Python,
R, Scala
Data flexibility
• Faster, more agile, full-
fidelity data acquisition
• Data portability: Open data
model and open storage
Cost-effective scalability
• Elastic scale on-prem or in
the cloud
• Cloud-native pay-per-use and
transience
• Proven at big data scale
Hybrid
• Runs across multi-clouds &
on-prem
• Multi-storage over S3, HDFS,
Kudu, Isilon, etc
¢¢¢

Management
Optimizing for reliable uptime and optimal resource utilization

Big data and the administrator
Get up and running
Monitor and maintain
Troubleshoot and resolve
Grow and adapt

Get up and running
Cloudera manager
service
Cloudera archives
Cloudera manager
agent
Packages
Templates
RoleC
RoleB
RoleA
Cluster member

Monitor and maintain
Services
Hosts
Applications Resources

Troubleshoot and resolve
Add your own customized
charts
See performance and resource utilization at a glance
Select historical time period for charts

Grow and adapt
• Utilization by tenant
• Project future needs
• Prioritize pre-emption

Backup and disaster recovery (BDR)
 Distributed (uses distcp)
 Work done by target cluster
 Secure (can have different
encryption keys on each side,
encrypted in motion)
 Bandwidth Limited (optional)
user
sales
contracts
North America
.snapshots
EMEA
snap 4-21-17
Contract1.txt
Contract2.txt
Contract1.txt
Contract2.txt
Contract3.txt
user
sales
contracts
North America
EMEA
Contract1.txt
Contract2.txt
Contract3.txt
.snapshots
snap 4-21-17
Contract3.txt
Federated clusters

Information security
Optimizing for minimum risk

Big data security
Authentication, authorization, audit and compliance
Access
Defining what users
and applications can
do with data
Technical concepts:
Permissions
Authorization
Data
Protecting data in the
cluster from
unauthorized visibility
Technical concepts:
Encryption, tokenization,
Data masking
Visibility
Reporting on where
data came from and
how it’s being used
Technical concepts:
Auditing
Lineage
Cloudera Manager
Apache Sentry &
RecordService
Cloudera Navigator
Navigator Encrypt & Key
Trustee | Partners
Perimeter
Guarding access to the
cluster itself
Technical concepts:
Authentication
Network isolation

Active directory and Kerberos
Perimeter
• Manages Users, Groups, and Services
• Provides username / password
authentication
• Group membership determines service
access
Active directory
• Trusted and standard third-party
• Authenticated users receive “Tickets”
• “Tickets” gain access to services
Kerberos
User
authenticates
to AD
Authenticated
user gets
Kerberos Ticket
Ticket grants
access to Services
e.g. Impala
User [ssmith]
Password[***** ]

Apache Sentry
• Apache Sentry is an authorization
module for Hadoop
• Apache Licensed project
• Supported by multiple vendors
• Used in many industries
• Used by Hive, Impala, Search &
Spark
• Syncs with HDFS ACL
• Supports ease of administration through
role-based authorization (RBAC)
Access
Spark Bindings
Spark

Centralized role-based access control
Sentry Perm.
Read access to
Transactions.Date…
Where Country = US
Sentry Perm.
Read access to
Customers.CustomerID
… Where Country = US
Sentry Role
U.S.
Customer
Transaction
Analysis
Group
Tier 1
Customer
Support Reps
Sam Smith
Group
Tier 1
Broker
Analysts
Martha Jones
Cust. ID SSN Phone Country
6758493 329-44-9847 US
09:22:03 16-
Feb-2015
344-22-9876 EU
5768459 585-11-2345 US
Date/Time Cust. ID Trade Country
11:33:01 16-
Feb-2015
Sell US
09:22:03 16-
Feb-2015
344-
22-
9876
EU
13:45:24 16-
Feb-2015
Buy US
Access

Auditing
Track, understand, and
protect access to
sensitive data
• Auditing needs to happen automatically
• Audit logs need to be immutable
• Need to be able to drill down on events
to the original events/data
Visibility

Governance
Faceted search
Natural language
Incremental filters
Drill down links
Visibility
Used to facilitate research and the ability to find groups of similar assets
Jump to
application log

Metadata
Automatic collection
• No need to create XML files or
manage manual controls
Complete aggregation
• Full coverage across all platform
components
Simple accessibility
• Integrated user interface with full-
text search

Visibility
Enterprise metadata
The foundation for data management and governance
Metadata enables you to put context and meaning to data to
answer the important questions
Technical Managed Custom
Unified metadata repository
Who are the high-value customers?
How do we define that?
How is high value calculated?
Where is customer data stored and used?
Is the data reliable and accurate?

Lineage
• Where did the data come from?
• Who ran the process that created
the data?
• What code was used to generate
the values?
• Which files and columns were
used to derive the values?
Visibility

Is it encrypted?
Data written to HDFS✓
Metadata in RDBMS✗
Spill-over files✗
Data

Cloudera navigator encrypt
Transparent layer between application
and file system
• Compliance-ready
• Massively scalable
• High performance: Optimized for Intel
• Separation of duties
• Key management with Navigator Key
Trustee
Data

Cloudera Navigator Key Trustee
“Virtual safe-deposit box” for managing encryption keys or
other Hadoop security artifact
• Separates Keys from Encrypted Data
• Centralized Management with Audit
Controls
• Integration with HSMs
• Roadmap: Management of SSL
certificates, SSH keys, tokens,
passwords, Kerberos Keytab Files,
and more
Data

Redacted Log Files
SELET * FROM customers
WHERE ssn=‘123-45-6789’
hive.server2.logging.operation.log.location
HUE Saved Queries
Audit Logs
• Credit card numbers
• Social security numbers
• Email addresses
• Server host names / IP

Thank you
The modern platform for machine learning and
analytics, optimized for the cloud

Big Data Fundamentals 6.6.18

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Big Data Fundamentals 6.6.18

Similaire à Big Data Fundamentals 6.6.18 (20)

Plus de Cloudera, Inc.

Plus de Cloudera, Inc. (20)

Dernier

Dernier (20)

Big Data Fundamentals 6.6.18