SlideShare une entreprise Scribd logo
1  sur  122
1© Cloudera, Inc. All rights reserved.
Big data fundamentals
Understanding the optimizationchoices in big data components
2© Cloudera, Inc. All rights reserved.
Presentation goals
 Teach you something
 Help you see the potential of Big Data beyond Map Reduce
 Be fair to Cloudera’s competitors
 Inspire you to learn more
If something doesn’t make sense, please ask.
3© Cloudera, Inc. All rights reserved.
Notification
• The information in this document is proprietary to Cloudera. No part of this document may be reproduced,
copied or transmitted in any form for any purpose without the express prior written permission of Cloudera.
• This document is a preliminary version and not subject to your license agreement or any other agreement
with Cloudera. This document contains only intended strategies, developments and functionalities of
Cloudera products and is not intended to be binding upon Cloudera to any particular course of business,
product strategy and/or development. Please note that this document is subject to change and may be
changed by Cloudera at any time without notice.
• Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant
the accuracy or completeness of the information, text, graphics, links or other items contained within this
material. This document is provided without a warranty of any kind, either express or implied, including but
not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement.
• Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect
or consequential damages that may result from the use of these materials. The limitation shall not apply in
cases of gross negligence.
4© Cloudera, Inc. All rights reserved.
Agenda
• Open source software
• Data storage and stewardship
• Data integration
• Data engineering
• Data analytics
• Life after Lambda architectures and IoT
• Data science at scale
• Big data in the clouds
• Cybersecurity as a Big Data problem
• Cluster management and security
• Customer success stories
• Question and answers
5© Cloudera, Inc. All rights reserved.
Big data fundamentals
Open source software
Optimizing to benefit from community innovation
6© Cloudera, Inc. All rights reserved.
Free evaluation
Install, test, inspect, and evaluate open
source code in perpetuity, with no
financial obligation
Freedom from lock-in
Multiple vendors supporting same core
technology makes it easier to move
Scalable innovation
The collective work of a global,
passionate community keeps the code
base evolving
3 Reasons open source is good for companies
[1] [2] [3]
These benefits derive from use of the permissive
Apache License
7© Cloudera, Inc. All rights reserved.
Not business focus
Company assets should be working on
core competency
Real cost hard to measure
Time developers spend solving
problems or adding features often isn’t
visible
Multiple projects
Each project is managed by a separate
committee and there is not necessarily
an overriding design
3 Reasons open source adds risk for companies
[1
]
[2
]
[3]
8© Cloudera, Inc. All rights reserved.
“Open source software is free
like a puppy is free”
- Scott McNealy
CEO Sun Microsystems
9© Cloudera, Inc. All rights reserved.
What if you got a dog for a
reason?
• Can take years to mature
• Months of intensive training (when your
attention should be elsewhere)
• Dog becomes very bonded to the
handler (and vice versa)
• Poor training results in a misbehaving
dog
Developers don’t want to be tied to one system
You don’t want your developers tied to one system
10© Cloudera, Inc. All rights reserved.
What is a distribution?
11© Cloudera, Inc. All rights reserved.
Each Apache project has its own
dependencies and release cycle.
Getting them to work together
requires effort and thorough testing.
Code in Open Source changes
constantly. Cloudera provides a
new feature release every quarter
that is tested and supported.
Distribution Vendors should employ
Open Source Committers that can
make sure fixes are added to the
Open Source base.
Benefits of using a distribution
Stability Regular upgrades 24x7 Support and bug
fixes
12© Cloudera, Inc. All rights reserved.
With a Distribution, you can start
developing applications right away.
Building an environment from
scratch would take months.
With a distribution, you know what it
will cost and you know that it will
work. Building an environment from
scratch provides no such
guarantees.
Building an environment from
scratch would require the focus of a
few of your best developers. Get
them working on the real problem.
More benefits of using a distribution
Faster to market Minimize risk Focus on business problems
13© Cloudera, Inc. All rights reserved.
The big data ecosystem vendors
(Spark) (Kafka)
Comprehensive distributions
Single+ project specialists
Proprietary + Hadoop in the gaps
(Cassandra)
Google Cloud Dataproc
14© Cloudera, Inc. All rights reserved.
Apache software foundation
ASF board of directors
Project management committee chair – ensures the project complies with ASF requirements
PMC members – decide the architecture, feature set and direction of the project, usually are also
Committers
Committers – have write access to the code, although contributions are approved by the PMC
Developers (aka contributors) – anyone may propose changes to the code or
documentation, but those changes have to be picked up and used by a committer
Users – provide feedback, bug reports and feature suggestions
appoints
For each project
15© Cloudera, Inc. All rights reserved.
Apache project requirements
• Must be Apache licensed (may include compatibly licensed elements)
• Free to download and use for any purpose
• Branding requirements and restrictions
• Source code must be open and available on the ASF website
• Must provide sufficient documentation to use the project on website
• Releases must follow the ASF PMC voting policies
• Corporations may not directly contribute – only individuals
• Must govern themselves independently of undue commercial influence
• Must not discourage new contributions from competing vendors
• Low diversity may incur ‘extra scrutiny’ from the board
However, there are NO requirements to:
• Have more than one commercial entity involved (random community members are ok)
• Contribute to an existing project when there is overlap in functionality (competitive projects are ok)
• Contribute modifications or enhancements back to the project
• Employ Committers or PMC members if you are a commercial vendor
16© Cloudera, Inc. All rights reserved.
Cloudera’s commitment to our customers
Anything that stores your data
Any APIs your applications call
Uses open source code
Our contributions and fixes go back
to open source first
When possible, use projects
supported by multiple commercial
vendors
Keeping your cluster running
Cloudera express edition
No limit to number of servers
Managing your applications
Employ* committers, if not PMC
members, on the projects we
support
* People manage their own careers. Temporary gaps may exist
High availability features
Ensure your success
Open source
License expiration won’t stop
the cluster
Free to use forever Provide enterprise value
RBAC over your data
24x7 support
Minimize your risk
Rolling upgrades
Data governance and lineage
Automated backup and recovery
Full disk encryption
Multi-tenant usage reports
17© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data storage and stewardship
Optimizing for inexpensive, reliable storage accessed by
multiple execution engines
18© Cloudera, Inc. All rights reserved.
Anatomy of a big data cluster Masters
Workers Gateway(s)
Cloudera
Manager
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
HBase
Region Server
Search
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
Data Node
Kudu Tablet
Server
Impala
Daemon
YARN Resource
Pool(s)
CM Agent
HMaster
CM Agent
HUE Server
Zookeeper
Name Node
YARN
Kudu Master
⭐️ Zookeeper
Secondary Name
Node
Impala Catalog
Store
Kudu Master⭐️
HMaster
CM Agent
Sentry Server
Zookeeper
HiveServer
Impala Statestore
Kudu Master
HMaster
CM Agent
Oozie Server
CM Agent
CDSW
User App
User App
Metadata
Database(s)
CM Agent
CDSW
CDSW Session
CDSW Session
CDSW Session
CDSW Session
CDSW Session
Cloud Plugin
Cloudera
Director
(optional)
19© Cloudera, Inc. All rights reserved.
HDFS
Name Node
Secondary
Name Node
Standby
Name Node
Data NodeA Data NodeB Data NodeC Data NodeD
FileQ
BX BY BZ
BX1 BX2 BX3
BY1 BY3 BY2
BZ3BZ2 BZ1
Rack1 Rack2 Rack3
Default block size
= 256 MB
20© Cloudera, Inc. All rights reserved.
HDFS Snapshots
…
user
hive
tables
sales
subscriptions
Data1.parquet
Data2.parquet
.snapshot
Data Node
BX1
Name Node
BY1 BZ1
BY2 BX2 BZ2
BY1 BX2 BY2
BX1 BZ1 BZ2
BX1 BY1 BY2
BX2 BZ1 BZ2snap1
Data1.parquet
Data2.parquet
21© Cloudera, Inc. All rights reserved.
Public cloud blob storage
Public clouds are offering low cost, highly available storage
Designed for access inside and outside of Hadoop
Amazon Simple Storage Service (S3)
Uses ‘bucket’ paradigm
Requires S3 Guard (Apache Open Source) to achieve consistency
Use protocol s3a://<bucket name>/<filename>
• Microsoft Azure Data Lake Store (ADLS)
‘Feels’ more like a normal (POSIX) file system
Use protocol adl://<directory>/<directory>/filename
22© Cloudera, Inc. All rights reserved.
Compute over storage
SparkImpala MapReduceSearch
Hive Pig
ADLS
KuduHDFS
Compute
Storage
Filesystem
S3
HBase
23© Cloudera, Inc. All rights reserved.
Schema on write or ‘structured data’
1. Define schema
2. Create table(s)
3. Map known fields
4. Discard unknown fields
24© Cloudera, Inc. All rights reserved.
Schema on read or ‘unstructured data’
1. Write whole record(s) to
filesystem (compressed)
3. Query engine applies
schema to data
2. Register schema with metastore
25© Cloudera, Inc. All rights reserved.
Popular file format options
XML, JSON Files
Can’t be both split and compressed
Text/Delimited/CSV/JSON Records
Usable everywhere
Schema on read
Poor performance, poor compression
Avro
Contain schema, but also allow schema on read
Usable inside and outside of Hadoop
Parquet
Columnar, splitable, query performance benefits, excellent compression
Support schema evolution (adding columns)
Skips columns well during scans
ORC (not supported by Cloudera, HDP Hive Only)
Similar to Parquet but with higher compression but poor data skip
Hortonworks working on ACID transactions, secondary indexes
File type Example size
Uncompressed CSV 1.8 GB
Avro 1.5 GB
Avro w/ snappy compression 750 MB
Parquet w/ snappy compression 300 MB
26© Cloudera, Inc. All rights reserved.
Raw and formatted data copies
• Keep the raw version if there is an opportunity that information
will be lost in the translation
• Use Columnar storage on formatted data to improve analytic
performance immensely
• Think about a metadata tagging policy (e.g. Cloudera
Navigator) to assist with Data stewardship
27© Cloudera, Inc. All rights reserved.
Big data pipelines
Data ingestion Data engineering Data stewardship Data science Data analytics
Move
Cleanse
Conform
Transform
Enrich
Store
Secure
Govern
Tag
Model
Score
Enrich
Predict
BI
Online
APIs
Capture
Stream
28© Cloudera, Inc. All rights reserved.
Which do you want?
Data lake Data hub
29© Cloudera, Inc. All rights reserved.
Data lake to a data hub
• Comprehensive, planned and enforced data hierarchy
• Carefully administered versioning and retention policies
• Comprehensive, unified security, governance and
lineage
• Encourage and support metadata
• Establish standards for data, metadata and analytic
models
• Maximize reuse of data without making copies
• Balanced with security and performance concerns – don’t be an
ideologue!
• Plan staffing around new roles
30© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data integration
Optimizing for data ingestion with volume, velocity and variety
31© Cloudera, Inc. All rights reserved.
Apache Flume
HDFS
Flume Agent
Flume Agent(s)
Compress
Flume Agent Flume Agent Flume Agent
Flume Agent Flume Agent
Filter Transform
Flume Agent
Encrypt
Flume Agent
•Pre-process data before storing
• Such as transform, scrub or
enrich
• Store in any format
• Text, compressed, binary, or
custom sink
•Collect data as it is produced
• Files, syslogs, stdout or
custom source
•Process in place
• Such as encrypt or
compress
• Write in parallel
• Scalable throughput
32© Cloudera, Inc. All rights reserved.
Apache Kafka
Broker1
TopicA- Partition0
Broker2
TopicA- Partition1
Broker3
TopicA- Partition2
Producer
Producer
ConsumerA
Consumer
Consumer Group
ConsumerB
Producers push to Kafka Consumers pull from Kafka
33© Cloudera, Inc. All rights reserved.
Kafka redundancy
Broker3
TopicA- Partition2
TopicA- Partition0 -Replica
TopicA- Partition1 -Replica
Broker3
TopicA- Partition1
TopicA- Partition0 -Replica
TopicA- Partition2 -Replica
Broker3
TopicA- Partition0
TopicA- Partition1 -Replica
TopicA- Partition2 -Replica
34© Cloudera, Inc. All rights reserved.
Apache Sqoop
RDBMS
HDFS
▪ Rapidly moves large amounts of data
between relational databases and HDFS
– Import tables (or partial tables)
from an RDBMS intoHDFS
– Export data from HDFS to a database table
▪ Uses JDBC to connect to thedatabase
– Works with virtually all standard RDBMSs
▪ Custom “connectors” for some RDBMSs provide much higher throughput
– Available forcertain databases, such as Teradata and Oracle
35© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data engineering
Optimizing for parallel processing of big data with minimum code
36© Cloudera, Inc. All rights reserved.
Directed acyclic graph (DAG)
37© Cloudera, Inc. All rights reserved.
Directed acyclic graph (DAG)
✔
✖
38© Cloudera, Inc. All rights reserved.
Resilient Distributed Dataset (RDD)
An RDD is an immutable distributed collection of elements of your data, partitioned
across nodes in your cluster that can be operated in parallel with an API that
offers transformations and actions.
map (function)
filter (predicate)
sortBy (function)
join (RDD2)
39© Cloudera, Inc. All rights reserved.
Apache Spark
RDDA RDDB RDDC
RDDD RDDE RDDF
RDDG
map groupBy
filtermap
join
40© Cloudera, Inc. All rights reserved.
Spark stages
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
41© Cloudera, Inc. All rights reserved.
Spark stages
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
RDDG
join
42© Cloudera, Inc. All rights reserved.
Spark caching
RDDA RDDB RDDC
RDDD RDDE RDDF
map groupBy
filtermap
RDDG
join
43© Cloudera, Inc. All rights reserved.
Evolution of the RDD API
DataFrame (Spark 1.3)
• Untyped, for R and Python
• Adds concept of ‘Schema’ to describe the data
• Uses RDDs underneath
• Allows Spark engine to perform some optimizations
• Avoids use of Java serialization, uses off heap storage
• Required different API than RDDs
RDD (Spark 1.0)
• Can be strongly typed in Java, Scala
• Uses RDDs underneath
• Catch compile-time errors
Dataset (Spark 2.x)
• Unified API
• Typed and untyped
44© Cloudera, Inc. All rights reserved.
Spark Streaming
TCP/IP
45© Cloudera, Inc. All rights reserved.
Other DAG/streaming processors
(not supported by Cloudera)
46© Cloudera, Inc. All rights reserved.
Spark ecosystem
Spark core
Spark SQL
Spark
Streaming
Spark ML GraphX
Standalone Mesos
(not included in CDH)
Yarn
47© Cloudera, Inc. All rights reserved.
Spark SQL
+ Static typing (optional)
+ Storage and processing efficiencies
48© Cloudera, Inc. All rights reserved.
ETL into EDW
Data
sources
ETL
EDW
Archive
Data
marts
Canned
reports
Dashboards/
analytic
applications
Non-SQL
workloads
Self-service
BI/ad hocEDW
49© Cloudera, Inc. All rights reserved.
EL-T into EDW
Data
sources
EL
EDW
Archive
Data
marts
Canned
reports
Dashboards/
analytic
applications
Non-SQL
workloads
Self-service
BI/ad hoc
T
50© Cloudera, Inc. All rights reserved.
Modern data warehouse landscape
Data
sources
Analytic
database
Operational
database
Data Science &
engineering
Shared data
layer
Modern Data Platform
Fixed
reports
Dashboards/
analytic applications
Non-SQL
workloads
Self-service
BI/ad hoc
Flexible
reporting
EDW
51© Cloudera, Inc. All rights reserved.
Cloudera’s featured data engineering partners
Hadoop Native Solution
52© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data analytics
Optimizing the engine to match the use case
53© Cloudera, Inc. All rights reserved.
Apache Hive
Hive Metastore
HDFS BLOB OtherStorage
Location
Schema
SerDe
File format
HiveServer2
Thrift Service Beeline CLI
JDBC
ODBC
Driver
Compiler
Executor
Driver
Compiler
Executor
SessionA SessionB
or
54© Cloudera, Inc. All rights reserved.
Apache Hive
✓ Spins up processes under the control of YARN
- shares resources well on the cluster
- but there is a lot of overhead to create these processes
✓ Can handle the failure of a machine during the query
- but recovery takes many seconds
✓ Will overflow join data to HDFS
- can handle very large joins
- but HDFS writes data 3 times, so this takes time
Don’t forget
who won the
race, Bucko!
Hive on Spark (Cloudera, MapR, Databricks) ✓Improves speed due to efficiencies of Spark
Live Long and Process (Hortonworks) ✓Improves speed by using pre-allocated JVMs w/ caching
Presto (Facebook) ✓Improves speed by optimizing data transfers for SQL and
using data streaming instead of HDFS for intermediate data
But all of these solutions are still JVM based
55© Cloudera, Inc. All rights reserved.
Apache Impala
✓ Written in C++
- avoids issues of the JVM
✓ Uses the Hive metastore
- better integration for security and administration
✓ Uses pre-allocated processes on worker nodes
- no process spin up time
- but still builds an execution plan for each query
✓ Employs algorithms from MPP databases
But I left you
in the dust at
the starting
line, Grandpa!
If a machine fails during a query that only takes 1 second to run, you will just retry the query.
Adopted by:
(the fastest of the antelopes)
56© Cloudera, Inc. All rights reserved.
So which engine should I choose?
"If the only tool you have is a hammer, you tend to
see every problem as a nail."
- Abraham Maslow
Psychologist
Author of ‘Maslow’s Hierarchy of Needs’
SparkImpala MapReduceSearch
Hive Pig
ADLS
KuduHDFS
Filesystem
S3
HBase
57© Cloudera, Inc. All rights reserved.
Other SQL engines
LLAPStinger.next
CubeHive ++
aka Live Long And Process
For JSON lovers
Tied to proprietary front/backLayer over HBaseSQL engines ‘from scratch’
Low Latency Analytical Processing
(not supported by Cloudera)
IBM Big SQL
OLAP
58© Cloudera, Inc. All rights reserved.
How to interpret benchmark tests
Standard test? How many of the queries were
run?
What is the criterion for excluding a query?
Single-user or multi-user? Data size?
Allow modifications to the queries?
"There are three kinds of lies:
lies, damned lies, and statistics."
-Benjamin Disraeli
Prime Minister of Britain
59© Cloudera, Inc. All rights reserved.
Big data fundamentals
Life after lambda architectures and IoT
Optimizing for time series and changing data
60© Cloudera, Inc. All rights reserved.
Updates or analytics
using
Analytics(Scans)
Online (Random Access)slow
slowfast
fast
(but not both at the same time)
Write once, read many. No updates, but can append (sort of)
Optimized for batch inserts and scans
Read, write, update individual rows
Optimized row-based access, sparse columns
61© Cloudera, Inc. All rights reserved.
Lambda architectures
(named for the simple shape)
62© Cloudera, Inc. All rights reserved.
Lambda architectures
(not so simple in practice)
Source: http://horicky.blogspot.com/2014/08/lambda-architecture-principles.html
63© Cloudera, Inc. All rights reserved.
Kudu design goals
using
Analytics(Scans)
Online (Random Access)slow
slowfast
fast
High throughput for big scans
Goal: Close to Parquet on HDFS
Low-latency for short accesses (primary key indexes
and quorum design)
Goal: 1ms read/write on SSD
Database-like semantics (initially single-row ACID)
Relational data model
SQL query
“NoSQL” style scan/insert/update (Java client)
64© Cloudera, Inc. All rights reserved.
Why are updates important?
Right to forget
ETL mistakes/corrections
Analytic enrichment
65© Cloudera, Inc. All rights reserved.
Life without Lambda
with
BI
Online
66© Cloudera, Inc. All rights reserved.
Kudu use cases
Kudu is best for use cases requiring a
simultaneous combination of sequential and
random reads and writes
● Time series data
○ Examples: Stream market data; fraud detection
and prevention; risk monitoring
○ Workload: Insert, updates, scans, lookups
● Machine data analysis
○ Examples: Network threat detection
○ Workload: Inserts, scans, lookups
● Online reporting
○ Examples: ODS
○ Workload: Inserts, updates, scans, lookups
67© Cloudera, Inc. All rights reserved.
Big data fundamentals
Data science
Optimizing to detect complex patterns over time
68© Cloudera, Inc. All rights reserved.
Ask bigger questions
69© Cloudera, Inc. All rights reserved.
Data science is a big data problem
“It’s not who has the best algorithm that wins. It’s
who has the most data.”
Banko and Brill, 2001
70© Cloudera, Inc. All rights reserved.
Notebooks
What was our revenue last year?
RDBMS
$14,325,874,321.07
What will our revenue be next year?
• Assumptions
• Algorithms
• Source Data
• Methodology
Your code tells a story
• Tell it with pictures & results
• Allow someone to re-run the numbers
• Pass it to someone who may use it as
the basis for a new/different story
71© Cloudera, Inc. All rights reserved.
Notebook challenges
Access
For sensitive data, secure clusters are
difficult to access. And IT typically doesn’t
want random packages installed on a secure
cluster.
Popular open source tools don’t easily
connect to these environments, or always
support Hadoop data formats.
Scale
Laptops rarely have capacity for
medium, let alone big data. This leads
to a lot of sampling.
Popular frameworks don’t easily
parallelize on a cluster. Typically code
has to get rewritten for production.
Developer Experience
Notebooks, while awesome, don’t easily
support virtual environment and
dependency management, especially for
teams. This makes sharing and
reproducibility hard.
Notebooks are also challenging to “put
into production.”
72© Cloudera, Inc. All rights reserved.
‘Dependency hell’
Or ’I am my own Grandpa’
X (1.0.0)
Y (1.0.0)
MyApp
X (1.0.0)
Y (1.0.0)
MyApp
X (1.1.0)
Upgrade
Dependency Graph for Hadoop Java Client
www.visioneye.com
73© Cloudera, Inc. All rights reserved.
Cloudera Data Science Workbench
 Team-based
 R, Python, Scala
 SDLC
 Secure
 Containerized
 Integrated into the cluster
74© Cloudera, Inc. All rights reserved.
The importance of an open ecosystem
Open ecosystem Black box
75© Cloudera, Inc. All rights reserved.
Containers
Hardware
Host OS
Hypervisor
(Optional)
GuestOS
GuestOS
GuestOS
Libs Libs Libs
AppA1 AppA2 AppB
VM
Hardware
Host OSContainer
Daemon
Libs Libs
AppA1
AppA3
AppA2
AppB1
AppB3
AppB2
AppB4
Container
Containers
• Use less memory than VMs
• You get to use more of the machine you pay for
• Provide isolation between apps
• Can share libraries between similar apps
• Provide abstraction of the OS, not of the HW
• Get you out of ‘Dependency Hell’ against other applications
76© Cloudera, Inc. All rights reserved.
Scaling data science for big data
Master(s)
Workers
Gateway(s)
Name Node
YARN
CDSW
CDSW
CDSW Session
CDSW Session
CDSW Session
CDSW Session
CDSW Session
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Data Node
YARN Resource
Pool(s)
Web browser
login
Start session
CDSW Session
CDSW Session
Kubernetes
77© Cloudera, Inc. All rights reserved.
Machine learning pipeline in Spark
Load learning data frame
Clean/process data
Extract and transform features
Vectorize features Save model
Scoring results Test m odel
Fit and access model
Load test data frame
Test resultsLoad scoring data frame
Score DataSave Results
78© Cloudera, Inc. All rights reserved.
Big data fundamentals
Big Data in the Clouds
Optimizing for a variety of operational choices
79© Cloudera, Inc. All rights reserved.
My organization
is moving to the cloud,
why should we
consider ?
80© Cloudera, Inc. All rights reserved.
Traditional applications
80
Data
Exploration
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
SQL & BI
Analytics
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Operational
Real-Time DB
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
ETL & Data
Processing
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST &
REPLICATION
DATA CATALOG
Custom
Functions
STORAG
E
SECURITY
GOVERNANCE
WORKLOAD MGMT
INGEST & REPLICATION
DATA CATALOG
Many data silos, each with its own proprietary tools and infrastructure
Different vendors, products, and services on-premises versus in cloud
A fragmented approach is difficult, expensive, and risky
81© Cloudera, Inc. All rights reserved.
Multiple compute engines, same data
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
82© Cloudera, Inc. All rights reserved.
Common metadata
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
METADATA
SERVICES
83© Cloudera, Inc. All rights reserved.
Data Silos 2.0
DW Cluster
DW Service
Source Data
A B
D
C
84© Cloudera, Inc. All rights reserved.
Deployed anywhere
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
CORE
SERVICES
STORAGE
SERVICES
METADATA
SERVICES
DEPLOYMENT
OPTIONS
85© Cloudera, Inc. All rights reserved.
But that still
wasn’t quite right ….
86© Cloudera, Inc. All rights reserved.
How do we deal with hybrid clouds?
• Shared catalog
• Unified security
• Consistent governance
• Easy workload management
• Flexible ingest and replication
87© Cloudera, Inc. All rights reserved.
Cloudera Enterprise
Data Catalog, Security, Governance, Lineage, Metadata Tags
OPERATIONAL
DATABASE
DATA
ENGINEERING
ANALYTIC
DATABASE
DATA
SCIENCE
HDFS Kudu S3 ADLS
Data Storage
CORE
SERVICES
STORAGE
SERVICES
PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES
DEPLOYMENT
OPTIONS
The modern platform for machine learning and analytics optimized for the cloud
88© Cloudera, Inc. All rights reserved.
Deployment & management options
Bare Metal Private Cloud Cloud IaaS Cloud PaaS
Applications Applications Applications Applications
Clusters Clusters Clusters Clusters
Operating System Operating System Operating System Operating System
Network Network Network Network
Storage Storage Storage Storage
Servers Servers Servers Servers
Customer managed Vendor managed
Manager Director Altus
89© Cloudera, Inc. All rights reserved.
• Easy
• Agile
• Unified
90© Cloudera, Inc. All rights reserved.
Altus service architecture
● Runs in Cloudera’s secured and monitored environment
● Manages CDH clusters in customer cloud account
● Customer data does not pass* to Cloudera
* Workload Analytics requires opt-in log data transfer to Cloudera
91© Cloudera, Inc. All rights reserved.
Keep your encryption keys outside of the cloud
92© Cloudera, Inc. All rights reserved.
Cloudera usage based pricing option
Pay per use Node based pricing
 Cheaper for transient clusters
 Cheaper for small machine types
 Pay as you go or discounted credits
 Cheaper for persistent or long-running clusters
 Volume & enterprise discounts
93© Cloudera, Inc. All rights reserved.
Hot-Warm-Cold Data
Store partitions from the same table in different storage types
m4.4xlarge m4.4xlarge i2.2xlarge
serve serve
preload
serve preloadserve
d2.4xlarge
serve
0 1 3 14
Days of ‘Hot’ Data
AWS Instance premium – 200% AWS Instance premium – 320%
preload
S3
S3
EBS
S3 S3
94© Cloudera, Inc. All rights reserved.
BDR to Blob Storage
 Minimum Storage Cost
 No Backup Cluster Costs
(servers or subscription)
 RPO unaffected
 Cloud provider manages
regional locality
✗RTO longer
user
sales
contracts
North America
.snapshots
snap 4-21-17
Contract1.txt
Contract2.txt
Contract1.txt
Contract2.txt
AWS S3
ADLS
95© Cloudera, Inc. All rights reserved.
Big data fundamentals
Cybersecurity
Optimizing to detect complex attacks over longer periods of time
96© Cloudera, Inc. All rights reserved.
Cybersecurity is a big data problem
Popular cyber platforms can
not cost effectively scale to
the volume and variety of
modern data
Only partial view of the
enterprise limits analytics and
slows investigations
Difficult to deploy advanced
machine learning detection
capabilities
Explosion of data Limited enterprise visibility Limited analytic processing
DataAccess
1%50%100%
DataVolume
10PB1PB1TB
IF (X) AND (Y)
THEN (Z)
Time
User
Network
Endpoint
Archived
data
Emerging
data
97© Cloudera, Inc. All rights reserved.
Open Data Models:
Enterprise Visibility, Support For Multiple Workloads
Endpoint User
Network
DIVERSE DATA SOURCES SINGLE ACCESS
Source: Momentum Partners Cybersecurity Snapshot April 2016
98© Cloudera, Inc. All rights reserved.
Detect advanced threats faster
with full compliment of analytic
frameworks for all cyber
workloads
Faster time to incident
investigation and response with
comprehensive enterprise
visibility
Change the economics of
cybersecurity with an open
source platform that supports
multiple LOB workloads
The value of Apache Spot
99© Cloudera, Inc. All rights reserved.
Many applications on one shared data set and architecture
Visualization & machine
learning applications can share
common data set &
infrastructure
CustomPackaged
Spot community is developing
out machine learning (e.g.
network threat detection)
Open Source
Build custom applications &
analytics using Cloudera
without having to buy new
infrastructure
100© Cloudera, Inc. All rights reserved.
But I already have Splunk …
Go Beyond Splunk’s SPL
• Share enriched data across
multiple analytic processing
engines
• Simple search, SQL, Python,
R, Scala
Data flexibility
• Faster, more agile, full-
fidelity data acquisition
• Data portability: Open data
model and open storage
Cost-effective scalability
• Elastic scale on-prem or in
the cloud
• Cloud-native pay-per-use and
transience
• Proven at big data scale
Hybrid
• Runs across multi-clouds &
on-prem
• Multi-storage over S3, HDFS,
Kudu, Isilon, etc
¢¢¢
101© Cloudera, Inc. All rights reserved.
Big data fundamentals
Management
Optimizing for reliable uptime and optimal resource utilization
102© Cloudera, Inc. All rights reserved.
Big data and the administrator
Get up and running
Monitor and maintain
Troubleshoot and resolve
Grow and adapt
103© Cloudera, Inc. All rights reserved.
Get up and running
Cloudera manager
service
Cloudera archives
Cloudera manager
agent
Packages
Templates
RoleC
RoleB
RoleA
Cluster member
104© Cloudera, Inc. All rights reserved.
Monitor and maintain
Services
Hosts
Applications Resources
105© Cloudera, Inc. All rights reserved.
Troubleshoot and resolve
Add your own customized
charts
See performance and resource utilization at a glance
Select historical time period for charts
106© Cloudera, Inc. All rights reserved.
Grow and adapt
• Utilization by tenant
• Project future needs
• Prioritize pre-emption
107© Cloudera, Inc. All rights reserved.
Backup and disaster recovery (BDR)
 Distributed (uses distcp)
 Work done by target cluster
 Secure (can have different
encryption keys on each side,
encrypted in motion)
 Bandwidth Limited (optional)
user
sales
contracts
North America
.snapshots
EMEA
snap 4-21-17
Contract1.txt
Contract2.txt
Contract1.txt
Contract2.txt
Contract3.txt
user
sales
contracts
North America
EMEA
Contract1.txt
Contract2.txt
Contract3.txt
.snapshots
snap 4-21-17
Contract3.txt
Federated clusters
108© Cloudera, Inc. All rights reserved.
Big data fundamentals
Information security
Optimizing for minimum risk
109© Cloudera, Inc. All rights reserved.
Big data security
Authentication, authorization, audit and compliance
Access
Defining what users
and applications can
do with data
Technical concepts:
Permissions
Authorization
Data
Protecting data in the
cluster from
unauthorized visibility
Technical concepts:
Encryption, tokenization,
Data masking
Visibility
Reporting on where
data came from and
how it’s being used
Technical concepts:
Auditing
Lineage
Cloudera Manager
Apache Sentry &
RecordService
Cloudera Navigator
Navigator Encrypt & Key
Trustee | Partners
Perimeter
Guarding access to the
cluster itself
Technical concepts:
Authentication
Network isolation
110© Cloudera, Inc. All rights reserved.
Active directory and Kerberos
Perimeter
• Manages Users, Groups, and Services
• Provides username / password
authentication
• Group membership determines service
access
Active directory
• Trusted and standard third-party
• Authenticated users receive “Tickets”
• “Tickets” gain access to services
Kerberos
User
authenticates
to AD
Authenticated
user gets
Kerberos Ticket
Ticket grants
access to Services
e.g. Impala
User [ssmith]
Password[***** ]
111© Cloudera, Inc. All rights reserved.
Apache Sentry
• Apache Sentry is an authorization
module for Hadoop
• Apache Licensed project
• Supported by multiple vendors
• Used in many industries
• Used by Hive, Impala, Search &
Spark
• Syncs with HDFS ACL
• Supports ease of administration through
role-based authorization (RBAC)
Access
Spark Bindings
Spark
112© Cloudera, Inc. All rights reserved.
Centralized role-based access control
Sentry Perm.
Read access to
Transactions.Date…
Where Country = US
Sentry Perm.
Read access to
Customers.CustomerID
… Where Country = US
Sentry Role
U.S.
Customer
Transaction
Analysis
Group
Tier 1
Customer
Support Reps
Sam Smith
Group
Tier 1
Broker
Analysts
Martha Jones
Cust. ID SSN Phone Country
6758493 329-44-9847 US
09:22:03 16-
Feb-2015
344-22-9876 EU
5768459 585-11-2345 US
Date/Time Cust. ID Trade Country
11:33:01 16-
Feb-2015
Sell US
09:22:03 16-
Feb-2015
344-
22-
9876
EU
13:45:24 16-
Feb-2015
Buy US
Access
113© Cloudera, Inc. All rights reserved.
Auditing
Track, understand, and
protect access to
sensitive data
• Auditing needs to happen automatically
• Audit logs need to be immutable
• Need to be able to drill down on events
to the original events/data
Visibility
114© Cloudera, Inc. All rights reserved.
Governance
Faceted search
Natural language
Incremental filters
Drill down links
Visibility
Used to facilitate research and the ability to find groups of similar assets
Jump to
application log
115© Cloudera, Inc. All rights reserved.
Metadata
Automatic collection
• No need to create XML files or
manage manual controls
Complete aggregation
• Full coverage across all platform
components
Simple accessibility
• Integrated user interface with full-
text search
116© Cloudera, Inc. All rights reserved.
Visibility
Enterprise metadata
The foundation for data management and governance
Metadata enables you to put context and meaning to data to
answer the important questions
Technical Managed Custom
Unified metadata repository
Who are the high-value customers?
How do we define that?
How is high value calculated?
Where is customer data stored and used?
Is the data reliable and accurate?
117© Cloudera, Inc. All rights reserved.
Lineage
• Where did the data come from?
• Who ran the process that created
the data?
• What code was used to generate
the values?
• Which files and columns were
used to derive the values?
Visibility
118© Cloudera, Inc. All rights reserved.
Is it encrypted?
Data written to HDFS✓
Metadata in RDBMS✗
Spill-over files✗
Data
119© Cloudera, Inc. All rights reserved.
Cloudera navigator encrypt
Transparent layer between application
and file system
• Compliance-ready
• Massively scalable
• High performance: Optimized for Intel
• Separation of duties
• Key management with Navigator Key
Trustee
Data
120© Cloudera, Inc. All rights reserved.
Cloudera Navigator Key Trustee
“Virtual safe-deposit box” for managing encryption keys or
other Hadoop security artifact
• Separates Keys from Encrypted Data
• Centralized Management with Audit
Controls
• Integration with HSMs
• Roadmap: Management of SSL
certificates, SSH keys, tokens,
passwords, Kerberos Keytab Files,
and more
Data
121© Cloudera, Inc. All rights reserved.
Redacted Log Files
SELET * FROM customers
WHERE ssn=‘123-45-6789’
hive.server2.logging.operation.log.location
HUE Saved Queries
Audit Logs
• Credit card numbers
• Social security numbers
• Email addresses
• Server host names / IP
122© Cloudera, Inc. All rights reserved.
Thank you
The modern platform for machine learning and
analytics, optimized for the cloud

Contenu connexe

Tendances

Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
 
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...Cloudera, Inc.
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformCloudera, Inc.
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18Cloudera, Inc.
 
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...Cloudera, Inc.
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceCloudera, Inc.
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissanceCloudera, Inc.
 
Data Drive Applications_Webinar
Data Drive Applications_WebinarData Drive Applications_Webinar
Data Drive Applications_WebinarSean Spediacci
 
Cloudera training secure your cloudera cluster 7.10.18
Cloudera training secure your cloudera cluster 7.10.18Cloudera training secure your cloudera cluster 7.10.18
Cloudera training secure your cloudera cluster 7.10.18Cloudera, Inc.
 
Big data journey to the cloud maz chaudhri 5.30.18
Big data journey to the cloud   maz chaudhri 5.30.18Big data journey to the cloud   maz chaudhri 5.30.18
Big data journey to the cloud maz chaudhri 5.30.18Cloudera, Inc.
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersCloudera, Inc.
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester WebinarCloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Cloudera, Inc.
 
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCustomer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCloudera, Inc.
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsCloudera, Inc.
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Cloudera, Inc.
 

Tendances (20)

Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
Comment développer une stratégie Big Data dans le cloud public avec l'offre P...
 
Turning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data PlatformTurning Data into Business Value with a Modern Data Platform
Turning Data into Business Value with a Modern Data Platform
 
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud   rohit pujari 5.30.18Big data journey to the cloud   rohit pujari 5.30.18
Big data journey to the cloud rohit pujari 5.30.18
 
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
 
How Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR complianceHow Cloudera SDX can aid GDPR compliance
How Cloudera SDX can aid GDPR compliance
 
Preparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity RenaissancePreparing for the Cybersecurity Renaissance
Preparing for the Cybersecurity Renaissance
 
Data Drive Applications_Webinar
Data Drive Applications_WebinarData Drive Applications_Webinar
Data Drive Applications_Webinar
 
Cloudera training secure your cloudera cluster 7.10.18
Cloudera training secure your cloudera cluster 7.10.18Cloudera training secure your cloudera cluster 7.10.18
Cloudera training secure your cloudera cluster 7.10.18
 
Big data journey to the cloud maz chaudhri 5.30.18
Big data journey to the cloud   maz chaudhri 5.30.18Big data journey to the cloud   maz chaudhri 5.30.18
Big data journey to the cloud maz chaudhri 5.30.18
 
Secure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game ChangersSecure Data - Why Encryption and Access Control are Game Changers
Secure Data - Why Encryption and Access Control are Game Changers
 
Cloudera SDX
Cloudera SDXCloudera SDX
Cloudera SDX
 
Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18Introducing Workload XM 8.7.18
Introducing Workload XM 8.7.18
 
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCustomer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWS
 
Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18Machine Learning Models: From Research to Production 6.13.18
Machine Learning Models: From Research to Production 6.13.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 
How to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of ThingsHow to Build Continuous Ingestion for the Internet of Things
How to Build Continuous Ingestion for the Internet of Things
 
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence
Driving Better Products with Customer Intelligence

Driving Better Products with Customer Intelligence

 

Similaire à Big Data Fundamentals 6.6.18

The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedCloudera, Inc.
 
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA DATASCIENCE
 
Seeking Cybersecurity--Strategies to Protect the Data
Seeking Cybersecurity--Strategies to Protect the DataSeeking Cybersecurity--Strategies to Protect the Data
Seeking Cybersecurity--Strategies to Protect the DataCloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Cloud expo 10 myths rex wang oracle ss
Cloud expo 10 myths rex wang oracle ssCloud expo 10 myths rex wang oracle ss
Cloud expo 10 myths rex wang oracle ssRex Wang
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and ManufacturingCloudera, Inc.
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...TheInevitableCloud
 
Cw13 big data and apache hadoop by amr awadallah-cloudera
Cw13 big data and apache hadoop by amr awadallah-clouderaCw13 big data and apache hadoop by amr awadallah-cloudera
Cw13 big data and apache hadoop by amr awadallah-clouderainevitablecloud
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationDataWorks Summit
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015hadooparchbook
 
AWS May Webinar Series - Industry Trends and Best Practices for Cloud Adoption
AWS May Webinar Series - Industry Trends and Best Practices for Cloud AdoptionAWS May Webinar Series - Industry Trends and Best Practices for Cloud Adoption
AWS May Webinar Series - Industry Trends and Best Practices for Cloud AdoptionAmazon Web Services
 
Edge to ai analytics from edge to cloud with efficient movement of machine data
Edge to ai  analytics from edge to cloud with efficient movement of machine dataEdge to ai  analytics from edge to cloud with efficient movement of machine data
Edge to ai analytics from edge to cloud with efficient movement of machine dataTimothy Spann
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Cloudera, Inc.
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Cloudera, Inc.
 
Implementing cloud based devops for distributed agile projects
Implementing cloud based devops for distributed agile projectsImplementing cloud based devops for distributed agile projects
Implementing cloud based devops for distributed agile projectsTom Stiehm
 
How to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and TrustHow to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and TrustApcera
 
Optimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analyticsOptimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analyticsCloudera, Inc.
 

Similaire à Big Data Fundamentals 6.6.18 (20)

The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
 
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science WorkbenchNOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
NOVA Data Science Meetup 2-21-2018 Presentation Cloudera Data Science Workbench
 
Seeking Cybersecurity--Strategies to Protect the Data
Seeking Cybersecurity--Strategies to Protect the DataSeeking Cybersecurity--Strategies to Protect the Data
Seeking Cybersecurity--Strategies to Protect the Data
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Cloud expo 10 myths rex wang oracle ss
Cloud expo 10 myths rex wang oracle ssCloud expo 10 myths rex wang oracle ss
Cloud expo 10 myths rex wang oracle ss
 
Hadoop and Manufacturing
Hadoop and ManufacturingHadoop and Manufacturing
Hadoop and Manufacturing
 
MultiValue Gets SaaS-y
MultiValue Gets SaaS-yMultiValue Gets SaaS-y
MultiValue Gets SaaS-y
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
 
Cw13 big data and apache hadoop by amr awadallah-cloudera
Cw13 big data and apache hadoop by amr awadallah-clouderaCw13 big data and apache hadoop by amr awadallah-cloudera
Cw13 big data and apache hadoop by amr awadallah-cloudera
 
Machine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to ImplementationMachine Learning Model Deployment: Strategy to Implementation
Machine Learning Model Deployment: Strategy to Implementation
 
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...
 
Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015Hadoop Application Architectures tutorial at Big DataService 2015
Hadoop Application Architectures tutorial at Big DataService 2015
 
AWS May Webinar Series - Industry Trends and Best Practices for Cloud Adoption
AWS May Webinar Series - Industry Trends and Best Practices for Cloud AdoptionAWS May Webinar Series - Industry Trends and Best Practices for Cloud Adoption
AWS May Webinar Series - Industry Trends and Best Practices for Cloud Adoption
 
Edge to ai analytics from edge to cloud with efficient movement of machine data
Edge to ai  analytics from edge to cloud with efficient movement of machine dataEdge to ai  analytics from edge to cloud with efficient movement of machine data
Edge to ai analytics from edge to cloud with efficient movement of machine data
 
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
Introducing Cloudera Navigator Optimizer: Offload Assessments and Active Data...
 
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
Securing the Data Hub--Protecting your Customer IP (Technical Workshop)
 
Implementing cloud based devops for distributed agile projects
Implementing cloud based devops for distributed agile projectsImplementing cloud based devops for distributed agile projects
Implementing cloud based devops for distributed agile projects
 
How to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and TrustHow to Migrate to Cloud with Complete Confidence and Trust
How to Migrate to Cloud with Complete Confidence and Trust
 
Optimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analyticsOptimize your cloud strategy for machine learning and analytics
Optimize your cloud strategy for machine learning and analytics
 
Big Data: Myths and Realities
Big Data: Myths and RealitiesBig Data: Myths and Realities
Big Data: Myths and Realities
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloudera, Inc.
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enoughCloudera, Inc.
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18Cloud Data Warehousing with Cloudera Altus 7.24.18
Cloud Data Warehousing with Cloudera Altus 7.24.18
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 
Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18Multi task learning stepping away from narrow expert models 7.11.18
Multi task learning stepping away from narrow expert models 7.11.18
 

Dernier

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Big Data Fundamentals 6.6.18

  • 1. 1© Cloudera, Inc. All rights reserved. Big data fundamentals Understanding the optimizationchoices in big data components
  • 2. 2© Cloudera, Inc. All rights reserved. Presentation goals  Teach you something  Help you see the potential of Big Data beyond Map Reduce  Be fair to Cloudera’s competitors  Inspire you to learn more If something doesn’t make sense, please ask.
  • 3. 3© Cloudera, Inc. All rights reserved. Notification • The information in this document is proprietary to Cloudera. No part of this document may be reproduced, copied or transmitted in any form for any purpose without the express prior written permission of Cloudera. • This document is a preliminary version and not subject to your license agreement or any other agreement with Cloudera. This document contains only intended strategies, developments and functionalities of Cloudera products and is not intended to be binding upon Cloudera to any particular course of business, product strategy and/or development. Please note that this document is subject to change and may be changed by Cloudera at any time without notice. • Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant the accuracy or completeness of the information, text, graphics, links or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement. • Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect or consequential damages that may result from the use of these materials. The limitation shall not apply in cases of gross negligence.
  • 4. 4© Cloudera, Inc. All rights reserved. Agenda • Open source software • Data storage and stewardship • Data integration • Data engineering • Data analytics • Life after Lambda architectures and IoT • Data science at scale • Big data in the clouds • Cybersecurity as a Big Data problem • Cluster management and security • Customer success stories • Question and answers
  • 5. 5© Cloudera, Inc. All rights reserved. Big data fundamentals Open source software Optimizing to benefit from community innovation
  • 6. 6© Cloudera, Inc. All rights reserved. Free evaluation Install, test, inspect, and evaluate open source code in perpetuity, with no financial obligation Freedom from lock-in Multiple vendors supporting same core technology makes it easier to move Scalable innovation The collective work of a global, passionate community keeps the code base evolving 3 Reasons open source is good for companies [1] [2] [3] These benefits derive from use of the permissive Apache License
  • 7. 7© Cloudera, Inc. All rights reserved. Not business focus Company assets should be working on core competency Real cost hard to measure Time developers spend solving problems or adding features often isn’t visible Multiple projects Each project is managed by a separate committee and there is not necessarily an overriding design 3 Reasons open source adds risk for companies [1 ] [2 ] [3]
  • 8. 8© Cloudera, Inc. All rights reserved. “Open source software is free like a puppy is free” - Scott McNealy CEO Sun Microsystems
  • 9. 9© Cloudera, Inc. All rights reserved. What if you got a dog for a reason? • Can take years to mature • Months of intensive training (when your attention should be elsewhere) • Dog becomes very bonded to the handler (and vice versa) • Poor training results in a misbehaving dog Developers don’t want to be tied to one system You don’t want your developers tied to one system
  • 10. 10© Cloudera, Inc. All rights reserved. What is a distribution?
  • 11. 11© Cloudera, Inc. All rights reserved. Each Apache project has its own dependencies and release cycle. Getting them to work together requires effort and thorough testing. Code in Open Source changes constantly. Cloudera provides a new feature release every quarter that is tested and supported. Distribution Vendors should employ Open Source Committers that can make sure fixes are added to the Open Source base. Benefits of using a distribution Stability Regular upgrades 24x7 Support and bug fixes
  • 12. 12© Cloudera, Inc. All rights reserved. With a Distribution, you can start developing applications right away. Building an environment from scratch would take months. With a distribution, you know what it will cost and you know that it will work. Building an environment from scratch provides no such guarantees. Building an environment from scratch would require the focus of a few of your best developers. Get them working on the real problem. More benefits of using a distribution Faster to market Minimize risk Focus on business problems
  • 13. 13© Cloudera, Inc. All rights reserved. The big data ecosystem vendors (Spark) (Kafka) Comprehensive distributions Single+ project specialists Proprietary + Hadoop in the gaps (Cassandra) Google Cloud Dataproc
  • 14. 14© Cloudera, Inc. All rights reserved. Apache software foundation ASF board of directors Project management committee chair – ensures the project complies with ASF requirements PMC members – decide the architecture, feature set and direction of the project, usually are also Committers Committers – have write access to the code, although contributions are approved by the PMC Developers (aka contributors) – anyone may propose changes to the code or documentation, but those changes have to be picked up and used by a committer Users – provide feedback, bug reports and feature suggestions appoints For each project
  • 15. 15© Cloudera, Inc. All rights reserved. Apache project requirements • Must be Apache licensed (may include compatibly licensed elements) • Free to download and use for any purpose • Branding requirements and restrictions • Source code must be open and available on the ASF website • Must provide sufficient documentation to use the project on website • Releases must follow the ASF PMC voting policies • Corporations may not directly contribute – only individuals • Must govern themselves independently of undue commercial influence • Must not discourage new contributions from competing vendors • Low diversity may incur ‘extra scrutiny’ from the board However, there are NO requirements to: • Have more than one commercial entity involved (random community members are ok) • Contribute to an existing project when there is overlap in functionality (competitive projects are ok) • Contribute modifications or enhancements back to the project • Employ Committers or PMC members if you are a commercial vendor
  • 16. 16© Cloudera, Inc. All rights reserved. Cloudera’s commitment to our customers Anything that stores your data Any APIs your applications call Uses open source code Our contributions and fixes go back to open source first When possible, use projects supported by multiple commercial vendors Keeping your cluster running Cloudera express edition No limit to number of servers Managing your applications Employ* committers, if not PMC members, on the projects we support * People manage their own careers. Temporary gaps may exist High availability features Ensure your success Open source License expiration won’t stop the cluster Free to use forever Provide enterprise value RBAC over your data 24x7 support Minimize your risk Rolling upgrades Data governance and lineage Automated backup and recovery Full disk encryption Multi-tenant usage reports
  • 17. 17© Cloudera, Inc. All rights reserved. Big data fundamentals Data storage and stewardship Optimizing for inexpensive, reliable storage accessed by multiple execution engines
  • 18. 18© Cloudera, Inc. All rights reserved. Anatomy of a big data cluster Masters Workers Gateway(s) Cloudera Manager Data Node HBase Region Server Search YARN Resource Pool(s) CM Agent Data Node HBase Region Server Search YARN Resource Pool(s) CM Agent Data Node HBase Region Server Search YARN Resource Pool(s) CM Agent Data Node Kudu Tablet Server Impala Daemon YARN Resource Pool(s) CM Agent Data Node Kudu Tablet Server Impala Daemon YARN Resource Pool(s) CM Agent Data Node Kudu Tablet Server Impala Daemon YARN Resource Pool(s) CM Agent HMaster CM Agent HUE Server Zookeeper Name Node YARN Kudu Master ⭐️ Zookeeper Secondary Name Node Impala Catalog Store Kudu Master⭐️ HMaster CM Agent Sentry Server Zookeeper HiveServer Impala Statestore Kudu Master HMaster CM Agent Oozie Server CM Agent CDSW User App User App Metadata Database(s) CM Agent CDSW CDSW Session CDSW Session CDSW Session CDSW Session CDSW Session Cloud Plugin Cloudera Director (optional)
  • 19. 19© Cloudera, Inc. All rights reserved. HDFS Name Node Secondary Name Node Standby Name Node Data NodeA Data NodeB Data NodeC Data NodeD FileQ BX BY BZ BX1 BX2 BX3 BY1 BY3 BY2 BZ3BZ2 BZ1 Rack1 Rack2 Rack3 Default block size = 256 MB
  • 20. 20© Cloudera, Inc. All rights reserved. HDFS Snapshots … user hive tables sales subscriptions Data1.parquet Data2.parquet .snapshot Data Node BX1 Name Node BY1 BZ1 BY2 BX2 BZ2 BY1 BX2 BY2 BX1 BZ1 BZ2 BX1 BY1 BY2 BX2 BZ1 BZ2snap1 Data1.parquet Data2.parquet
  • 21. 21© Cloudera, Inc. All rights reserved. Public cloud blob storage Public clouds are offering low cost, highly available storage Designed for access inside and outside of Hadoop Amazon Simple Storage Service (S3) Uses ‘bucket’ paradigm Requires S3 Guard (Apache Open Source) to achieve consistency Use protocol s3a://<bucket name>/<filename> • Microsoft Azure Data Lake Store (ADLS) ‘Feels’ more like a normal (POSIX) file system Use protocol adl://<directory>/<directory>/filename
  • 22. 22© Cloudera, Inc. All rights reserved. Compute over storage SparkImpala MapReduceSearch Hive Pig ADLS KuduHDFS Compute Storage Filesystem S3 HBase
  • 23. 23© Cloudera, Inc. All rights reserved. Schema on write or ‘structured data’ 1. Define schema 2. Create table(s) 3. Map known fields 4. Discard unknown fields
  • 24. 24© Cloudera, Inc. All rights reserved. Schema on read or ‘unstructured data’ 1. Write whole record(s) to filesystem (compressed) 3. Query engine applies schema to data 2. Register schema with metastore
  • 25. 25© Cloudera, Inc. All rights reserved. Popular file format options XML, JSON Files Can’t be both split and compressed Text/Delimited/CSV/JSON Records Usable everywhere Schema on read Poor performance, poor compression Avro Contain schema, but also allow schema on read Usable inside and outside of Hadoop Parquet Columnar, splitable, query performance benefits, excellent compression Support schema evolution (adding columns) Skips columns well during scans ORC (not supported by Cloudera, HDP Hive Only) Similar to Parquet but with higher compression but poor data skip Hortonworks working on ACID transactions, secondary indexes File type Example size Uncompressed CSV 1.8 GB Avro 1.5 GB Avro w/ snappy compression 750 MB Parquet w/ snappy compression 300 MB
  • 26. 26© Cloudera, Inc. All rights reserved. Raw and formatted data copies • Keep the raw version if there is an opportunity that information will be lost in the translation • Use Columnar storage on formatted data to improve analytic performance immensely • Think about a metadata tagging policy (e.g. Cloudera Navigator) to assist with Data stewardship
  • 27. 27© Cloudera, Inc. All rights reserved. Big data pipelines Data ingestion Data engineering Data stewardship Data science Data analytics Move Cleanse Conform Transform Enrich Store Secure Govern Tag Model Score Enrich Predict BI Online APIs Capture Stream
  • 28. 28© Cloudera, Inc. All rights reserved. Which do you want? Data lake Data hub
  • 29. 29© Cloudera, Inc. All rights reserved. Data lake to a data hub • Comprehensive, planned and enforced data hierarchy • Carefully administered versioning and retention policies • Comprehensive, unified security, governance and lineage • Encourage and support metadata • Establish standards for data, metadata and analytic models • Maximize reuse of data without making copies • Balanced with security and performance concerns – don’t be an ideologue! • Plan staffing around new roles
  • 30. 30© Cloudera, Inc. All rights reserved. Big data fundamentals Data integration Optimizing for data ingestion with volume, velocity and variety
  • 31. 31© Cloudera, Inc. All rights reserved. Apache Flume HDFS Flume Agent Flume Agent(s) Compress Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Filter Transform Flume Agent Encrypt Flume Agent •Pre-process data before storing • Such as transform, scrub or enrich • Store in any format • Text, compressed, binary, or custom sink •Collect data as it is produced • Files, syslogs, stdout or custom source •Process in place • Such as encrypt or compress • Write in parallel • Scalable throughput
  • 32. 32© Cloudera, Inc. All rights reserved. Apache Kafka Broker1 TopicA- Partition0 Broker2 TopicA- Partition1 Broker3 TopicA- Partition2 Producer Producer ConsumerA Consumer Consumer Group ConsumerB Producers push to Kafka Consumers pull from Kafka
  • 33. 33© Cloudera, Inc. All rights reserved. Kafka redundancy Broker3 TopicA- Partition2 TopicA- Partition0 -Replica TopicA- Partition1 -Replica Broker3 TopicA- Partition1 TopicA- Partition0 -Replica TopicA- Partition2 -Replica Broker3 TopicA- Partition0 TopicA- Partition1 -Replica TopicA- Partition2 -Replica
  • 34. 34© Cloudera, Inc. All rights reserved. Apache Sqoop RDBMS HDFS ▪ Rapidly moves large amounts of data between relational databases and HDFS – Import tables (or partial tables) from an RDBMS intoHDFS – Export data from HDFS to a database table ▪ Uses JDBC to connect to thedatabase – Works with virtually all standard RDBMSs ▪ Custom “connectors” for some RDBMSs provide much higher throughput – Available forcertain databases, such as Teradata and Oracle
  • 35. 35© Cloudera, Inc. All rights reserved. Big data fundamentals Data engineering Optimizing for parallel processing of big data with minimum code
  • 36. 36© Cloudera, Inc. All rights reserved. Directed acyclic graph (DAG)
  • 37. 37© Cloudera, Inc. All rights reserved. Directed acyclic graph (DAG) ✔ ✖
  • 38. 38© Cloudera, Inc. All rights reserved. Resilient Distributed Dataset (RDD) An RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with an API that offers transformations and actions. map (function) filter (predicate) sortBy (function) join (RDD2)
  • 39. 39© Cloudera, Inc. All rights reserved. Apache Spark RDDA RDDB RDDC RDDD RDDE RDDF RDDG map groupBy filtermap join
  • 40. 40© Cloudera, Inc. All rights reserved. Spark stages RDDA RDDB RDDC RDDD RDDE RDDF map groupBy filtermap
  • 41. 41© Cloudera, Inc. All rights reserved. Spark stages RDDA RDDB RDDC RDDD RDDE RDDF map groupBy filtermap RDDG join
  • 42. 42© Cloudera, Inc. All rights reserved. Spark caching RDDA RDDB RDDC RDDD RDDE RDDF map groupBy filtermap RDDG join
  • 43. 43© Cloudera, Inc. All rights reserved. Evolution of the RDD API DataFrame (Spark 1.3) • Untyped, for R and Python • Adds concept of ‘Schema’ to describe the data • Uses RDDs underneath • Allows Spark engine to perform some optimizations • Avoids use of Java serialization, uses off heap storage • Required different API than RDDs RDD (Spark 1.0) • Can be strongly typed in Java, Scala • Uses RDDs underneath • Catch compile-time errors Dataset (Spark 2.x) • Unified API • Typed and untyped
  • 44. 44© Cloudera, Inc. All rights reserved. Spark Streaming TCP/IP
  • 45. 45© Cloudera, Inc. All rights reserved. Other DAG/streaming processors (not supported by Cloudera)
  • 46. 46© Cloudera, Inc. All rights reserved. Spark ecosystem Spark core Spark SQL Spark Streaming Spark ML GraphX Standalone Mesos (not included in CDH) Yarn
  • 47. 47© Cloudera, Inc. All rights reserved. Spark SQL + Static typing (optional) + Storage and processing efficiencies
  • 48. 48© Cloudera, Inc. All rights reserved. ETL into EDW Data sources ETL EDW Archive Data marts Canned reports Dashboards/ analytic applications Non-SQL workloads Self-service BI/ad hocEDW
  • 49. 49© Cloudera, Inc. All rights reserved. EL-T into EDW Data sources EL EDW Archive Data marts Canned reports Dashboards/ analytic applications Non-SQL workloads Self-service BI/ad hoc T
  • 50. 50© Cloudera, Inc. All rights reserved. Modern data warehouse landscape Data sources Analytic database Operational database Data Science & engineering Shared data layer Modern Data Platform Fixed reports Dashboards/ analytic applications Non-SQL workloads Self-service BI/ad hoc Flexible reporting EDW
  • 51. 51© Cloudera, Inc. All rights reserved. Cloudera’s featured data engineering partners Hadoop Native Solution
  • 52. 52© Cloudera, Inc. All rights reserved. Big data fundamentals Data analytics Optimizing the engine to match the use case
  • 53. 53© Cloudera, Inc. All rights reserved. Apache Hive Hive Metastore HDFS BLOB OtherStorage Location Schema SerDe File format HiveServer2 Thrift Service Beeline CLI JDBC ODBC Driver Compiler Executor Driver Compiler Executor SessionA SessionB or
  • 54. 54© Cloudera, Inc. All rights reserved. Apache Hive ✓ Spins up processes under the control of YARN - shares resources well on the cluster - but there is a lot of overhead to create these processes ✓ Can handle the failure of a machine during the query - but recovery takes many seconds ✓ Will overflow join data to HDFS - can handle very large joins - but HDFS writes data 3 times, so this takes time Don’t forget who won the race, Bucko! Hive on Spark (Cloudera, MapR, Databricks) ✓Improves speed due to efficiencies of Spark Live Long and Process (Hortonworks) ✓Improves speed by using pre-allocated JVMs w/ caching Presto (Facebook) ✓Improves speed by optimizing data transfers for SQL and using data streaming instead of HDFS for intermediate data But all of these solutions are still JVM based
  • 55. 55© Cloudera, Inc. All rights reserved. Apache Impala ✓ Written in C++ - avoids issues of the JVM ✓ Uses the Hive metastore - better integration for security and administration ✓ Uses pre-allocated processes on worker nodes - no process spin up time - but still builds an execution plan for each query ✓ Employs algorithms from MPP databases But I left you in the dust at the starting line, Grandpa! If a machine fails during a query that only takes 1 second to run, you will just retry the query. Adopted by: (the fastest of the antelopes)
  • 56. 56© Cloudera, Inc. All rights reserved. So which engine should I choose? "If the only tool you have is a hammer, you tend to see every problem as a nail." - Abraham Maslow Psychologist Author of ‘Maslow’s Hierarchy of Needs’ SparkImpala MapReduceSearch Hive Pig ADLS KuduHDFS Filesystem S3 HBase
  • 57. 57© Cloudera, Inc. All rights reserved. Other SQL engines LLAPStinger.next CubeHive ++ aka Live Long And Process For JSON lovers Tied to proprietary front/backLayer over HBaseSQL engines ‘from scratch’ Low Latency Analytical Processing (not supported by Cloudera) IBM Big SQL OLAP
  • 58. 58© Cloudera, Inc. All rights reserved. How to interpret benchmark tests Standard test? How many of the queries were run? What is the criterion for excluding a query? Single-user or multi-user? Data size? Allow modifications to the queries? "There are three kinds of lies: lies, damned lies, and statistics." -Benjamin Disraeli Prime Minister of Britain
  • 59. 59© Cloudera, Inc. All rights reserved. Big data fundamentals Life after lambda architectures and IoT Optimizing for time series and changing data
  • 60. 60© Cloudera, Inc. All rights reserved. Updates or analytics using Analytics(Scans) Online (Random Access)slow slowfast fast (but not both at the same time) Write once, read many. No updates, but can append (sort of) Optimized for batch inserts and scans Read, write, update individual rows Optimized row-based access, sparse columns
  • 61. 61© Cloudera, Inc. All rights reserved. Lambda architectures (named for the simple shape)
  • 62. 62© Cloudera, Inc. All rights reserved. Lambda architectures (not so simple in practice) Source: http://horicky.blogspot.com/2014/08/lambda-architecture-principles.html
  • 63. 63© Cloudera, Inc. All rights reserved. Kudu design goals using Analytics(Scans) Online (Random Access)slow slowfast fast High throughput for big scans Goal: Close to Parquet on HDFS Low-latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD Database-like semantics (initially single-row ACID) Relational data model SQL query “NoSQL” style scan/insert/update (Java client)
  • 64. 64© Cloudera, Inc. All rights reserved. Why are updates important? Right to forget ETL mistakes/corrections Analytic enrichment
  • 65. 65© Cloudera, Inc. All rights reserved. Life without Lambda with BI Online
  • 66. 66© Cloudera, Inc. All rights reserved. Kudu use cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes ● Time series data ○ Examples: Stream market data; fraud detection and prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine data analysis ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
  • 67. 67© Cloudera, Inc. All rights reserved. Big data fundamentals Data science Optimizing to detect complex patterns over time
  • 68. 68© Cloudera, Inc. All rights reserved. Ask bigger questions
  • 69. 69© Cloudera, Inc. All rights reserved. Data science is a big data problem “It’s not who has the best algorithm that wins. It’s who has the most data.” Banko and Brill, 2001
  • 70. 70© Cloudera, Inc. All rights reserved. Notebooks What was our revenue last year? RDBMS $14,325,874,321.07 What will our revenue be next year? • Assumptions • Algorithms • Source Data • Methodology Your code tells a story • Tell it with pictures & results • Allow someone to re-run the numbers • Pass it to someone who may use it as the basis for a new/different story
  • 71. 71© Cloudera, Inc. All rights reserved. Notebook challenges Access For sensitive data, secure clusters are difficult to access. And IT typically doesn’t want random packages installed on a secure cluster. Popular open source tools don’t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don’t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don’t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to “put into production.”
  • 72. 72© Cloudera, Inc. All rights reserved. ‘Dependency hell’ Or ’I am my own Grandpa’ X (1.0.0) Y (1.0.0) MyApp X (1.0.0) Y (1.0.0) MyApp X (1.1.0) Upgrade Dependency Graph for Hadoop Java Client www.visioneye.com
  • 73. 73© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench  Team-based  R, Python, Scala  SDLC  Secure  Containerized  Integrated into the cluster
  • 74. 74© Cloudera, Inc. All rights reserved. The importance of an open ecosystem Open ecosystem Black box
  • 75. 75© Cloudera, Inc. All rights reserved. Containers Hardware Host OS Hypervisor (Optional) GuestOS GuestOS GuestOS Libs Libs Libs AppA1 AppA2 AppB VM Hardware Host OSContainer Daemon Libs Libs AppA1 AppA3 AppA2 AppB1 AppB3 AppB2 AppB4 Container Containers • Use less memory than VMs • You get to use more of the machine you pay for • Provide isolation between apps • Can share libraries between similar apps • Provide abstraction of the OS, not of the HW • Get you out of ‘Dependency Hell’ against other applications
  • 76. 76© Cloudera, Inc. All rights reserved. Scaling data science for big data Master(s) Workers Gateway(s) Name Node YARN CDSW CDSW CDSW Session CDSW Session CDSW Session CDSW Session CDSW Session Data Node YARN Resource Pool(s) Data Node YARN Resource Pool(s) Data Node YARN Resource Pool(s) Data Node YARN Resource Pool(s) Web browser login Start session CDSW Session CDSW Session Kubernetes
  • 77. 77© Cloudera, Inc. All rights reserved. Machine learning pipeline in Spark Load learning data frame Clean/process data Extract and transform features Vectorize features Save model Scoring results Test m odel Fit and access model Load test data frame Test resultsLoad scoring data frame Score DataSave Results
  • 78. 78© Cloudera, Inc. All rights reserved. Big data fundamentals Big Data in the Clouds Optimizing for a variety of operational choices
  • 79. 79© Cloudera, Inc. All rights reserved. My organization is moving to the cloud, why should we consider ?
  • 80. 80© Cloudera, Inc. All rights reserved. Traditional applications 80 Data Exploration STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG SQL & BI Analytics STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Operational Real-Time DB STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG ETL & Data Processing STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Custom Functions STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Many data silos, each with its own proprietary tools and infrastructure Different vendors, products, and services on-premises versus in cloud A fragmented approach is difficult, expensive, and risky
  • 81. 81© Cloudera, Inc. All rights reserved. Multiple compute engines, same data OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage CORE SERVICES STORAGE SERVICES
  • 82. 82© Cloudera, Inc. All rights reserved. Common metadata Data Catalog, Security, Governance, Lineage, Metadata Tags OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage CORE SERVICES STORAGE SERVICES METADATA SERVICES
  • 83. 83© Cloudera, Inc. All rights reserved. Data Silos 2.0 DW Cluster DW Service Source Data A B D C
  • 84. 84© Cloudera, Inc. All rights reserved. Deployed anywhere Data Catalog, Security, Governance, Lineage, Metadata Tags OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES CORE SERVICES STORAGE SERVICES METADATA SERVICES DEPLOYMENT OPTIONS
  • 85. 85© Cloudera, Inc. All rights reserved. But that still wasn’t quite right ….
  • 86. 86© Cloudera, Inc. All rights reserved. How do we deal with hybrid clouds? • Shared catalog • Unified security • Consistent governance • Easy workload management • Flexible ingest and replication
  • 87. 87© Cloudera, Inc. All rights reserved. Cloudera Enterprise Data Catalog, Security, Governance, Lineage, Metadata Tags OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage CORE SERVICES STORAGE SERVICES PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES DEPLOYMENT OPTIONS The modern platform for machine learning and analytics optimized for the cloud
  • 88. 88© Cloudera, Inc. All rights reserved. Deployment & management options Bare Metal Private Cloud Cloud IaaS Cloud PaaS Applications Applications Applications Applications Clusters Clusters Clusters Clusters Operating System Operating System Operating System Operating System Network Network Network Network Storage Storage Storage Storage Servers Servers Servers Servers Customer managed Vendor managed Manager Director Altus
  • 89. 89© Cloudera, Inc. All rights reserved. • Easy • Agile • Unified
  • 90. 90© Cloudera, Inc. All rights reserved. Altus service architecture ● Runs in Cloudera’s secured and monitored environment ● Manages CDH clusters in customer cloud account ● Customer data does not pass* to Cloudera * Workload Analytics requires opt-in log data transfer to Cloudera
  • 91. 91© Cloudera, Inc. All rights reserved. Keep your encryption keys outside of the cloud
  • 92. 92© Cloudera, Inc. All rights reserved. Cloudera usage based pricing option Pay per use Node based pricing  Cheaper for transient clusters  Cheaper for small machine types  Pay as you go or discounted credits  Cheaper for persistent or long-running clusters  Volume & enterprise discounts
  • 93. 93© Cloudera, Inc. All rights reserved. Hot-Warm-Cold Data Store partitions from the same table in different storage types m4.4xlarge m4.4xlarge i2.2xlarge serve serve preload serve preloadserve d2.4xlarge serve 0 1 3 14 Days of ‘Hot’ Data AWS Instance premium – 200% AWS Instance premium – 320% preload S3 S3 EBS S3 S3
  • 94. 94© Cloudera, Inc. All rights reserved. BDR to Blob Storage  Minimum Storage Cost  No Backup Cluster Costs (servers or subscription)  RPO unaffected  Cloud provider manages regional locality ✗RTO longer user sales contracts North America .snapshots snap 4-21-17 Contract1.txt Contract2.txt Contract1.txt Contract2.txt AWS S3 ADLS
  • 95. 95© Cloudera, Inc. All rights reserved. Big data fundamentals Cybersecurity Optimizing to detect complex attacks over longer periods of time
  • 96. 96© Cloudera, Inc. All rights reserved. Cybersecurity is a big data problem Popular cyber platforms can not cost effectively scale to the volume and variety of modern data Only partial view of the enterprise limits analytics and slows investigations Difficult to deploy advanced machine learning detection capabilities Explosion of data Limited enterprise visibility Limited analytic processing DataAccess 1%50%100% DataVolume 10PB1PB1TB IF (X) AND (Y) THEN (Z) Time User Network Endpoint Archived data Emerging data
  • 97. 97© Cloudera, Inc. All rights reserved. Open Data Models: Enterprise Visibility, Support For Multiple Workloads Endpoint User Network DIVERSE DATA SOURCES SINGLE ACCESS Source: Momentum Partners Cybersecurity Snapshot April 2016
  • 98. 98© Cloudera, Inc. All rights reserved. Detect advanced threats faster with full compliment of analytic frameworks for all cyber workloads Faster time to incident investigation and response with comprehensive enterprise visibility Change the economics of cybersecurity with an open source platform that supports multiple LOB workloads The value of Apache Spot
  • 99. 99© Cloudera, Inc. All rights reserved. Many applications on one shared data set and architecture Visualization & machine learning applications can share common data set & infrastructure CustomPackaged Spot community is developing out machine learning (e.g. network threat detection) Open Source Build custom applications & analytics using Cloudera without having to buy new infrastructure
  • 100. 100© Cloudera, Inc. All rights reserved. But I already have Splunk … Go Beyond Splunk’s SPL • Share enriched data across multiple analytic processing engines • Simple search, SQL, Python, R, Scala Data flexibility • Faster, more agile, full- fidelity data acquisition • Data portability: Open data model and open storage Cost-effective scalability • Elastic scale on-prem or in the cloud • Cloud-native pay-per-use and transience • Proven at big data scale Hybrid • Runs across multi-clouds & on-prem • Multi-storage over S3, HDFS, Kudu, Isilon, etc ¢¢¢
  • 101. 101© Cloudera, Inc. All rights reserved. Big data fundamentals Management Optimizing for reliable uptime and optimal resource utilization
  • 102. 102© Cloudera, Inc. All rights reserved. Big data and the administrator Get up and running Monitor and maintain Troubleshoot and resolve Grow and adapt
  • 103. 103© Cloudera, Inc. All rights reserved. Get up and running Cloudera manager service Cloudera archives Cloudera manager agent Packages Templates RoleC RoleB RoleA Cluster member
  • 104. 104© Cloudera, Inc. All rights reserved. Monitor and maintain Services Hosts Applications Resources
  • 105. 105© Cloudera, Inc. All rights reserved. Troubleshoot and resolve Add your own customized charts See performance and resource utilization at a glance Select historical time period for charts
  • 106. 106© Cloudera, Inc. All rights reserved. Grow and adapt • Utilization by tenant • Project future needs • Prioritize pre-emption
  • 107. 107© Cloudera, Inc. All rights reserved. Backup and disaster recovery (BDR)  Distributed (uses distcp)  Work done by target cluster  Secure (can have different encryption keys on each side, encrypted in motion)  Bandwidth Limited (optional) user sales contracts North America .snapshots EMEA snap 4-21-17 Contract1.txt Contract2.txt Contract1.txt Contract2.txt Contract3.txt user sales contracts North America EMEA Contract1.txt Contract2.txt Contract3.txt .snapshots snap 4-21-17 Contract3.txt Federated clusters
  • 108. 108© Cloudera, Inc. All rights reserved. Big data fundamentals Information security Optimizing for minimum risk
  • 109. 109© Cloudera, Inc. All rights reserved. Big data security Authentication, authorization, audit and compliance Access Defining what users and applications can do with data Technical concepts: Permissions Authorization Data Protecting data in the cluster from unauthorized visibility Technical concepts: Encryption, tokenization, Data masking Visibility Reporting on where data came from and how it’s being used Technical concepts: Auditing Lineage Cloudera Manager Apache Sentry & RecordService Cloudera Navigator Navigator Encrypt & Key Trustee | Partners Perimeter Guarding access to the cluster itself Technical concepts: Authentication Network isolation
  • 110. 110© Cloudera, Inc. All rights reserved. Active directory and Kerberos Perimeter • Manages Users, Groups, and Services • Provides username / password authentication • Group membership determines service access Active directory • Trusted and standard third-party • Authenticated users receive “Tickets” • “Tickets” gain access to services Kerberos User authenticates to AD Authenticated user gets Kerberos Ticket Ticket grants access to Services e.g. Impala User [ssmith] Password[***** ]
  • 111. 111© Cloudera, Inc. All rights reserved. Apache Sentry • Apache Sentry is an authorization module for Hadoop • Apache Licensed project • Supported by multiple vendors • Used in many industries • Used by Hive, Impala, Search & Spark • Syncs with HDFS ACL • Supports ease of administration through role-based authorization (RBAC) Access Spark Bindings Spark
  • 112. 112© Cloudera, Inc. All rights reserved. Centralized role-based access control Sentry Perm. Read access to Transactions.Date… Where Country = US Sentry Perm. Read access to Customers.CustomerID … Where Country = US Sentry Role U.S. Customer Transaction Analysis Group Tier 1 Customer Support Reps Sam Smith Group Tier 1 Broker Analysts Martha Jones Cust. ID SSN Phone Country 6758493 329-44-9847 US 09:22:03 16- Feb-2015 344-22-9876 EU 5768459 585-11-2345 US Date/Time Cust. ID Trade Country 11:33:01 16- Feb-2015 Sell US 09:22:03 16- Feb-2015 344- 22- 9876 EU 13:45:24 16- Feb-2015 Buy US Access
  • 113. 113© Cloudera, Inc. All rights reserved. Auditing Track, understand, and protect access to sensitive data • Auditing needs to happen automatically • Audit logs need to be immutable • Need to be able to drill down on events to the original events/data Visibility
  • 114. 114© Cloudera, Inc. All rights reserved. Governance Faceted search Natural language Incremental filters Drill down links Visibility Used to facilitate research and the ability to find groups of similar assets Jump to application log
  • 115. 115© Cloudera, Inc. All rights reserved. Metadata Automatic collection • No need to create XML files or manage manual controls Complete aggregation • Full coverage across all platform components Simple accessibility • Integrated user interface with full- text search
  • 116. 116© Cloudera, Inc. All rights reserved. Visibility Enterprise metadata The foundation for data management and governance Metadata enables you to put context and meaning to data to answer the important questions Technical Managed Custom Unified metadata repository Who are the high-value customers? How do we define that? How is high value calculated? Where is customer data stored and used? Is the data reliable and accurate?
  • 117. 117© Cloudera, Inc. All rights reserved. Lineage • Where did the data come from? • Who ran the process that created the data? • What code was used to generate the values? • Which files and columns were used to derive the values? Visibility
  • 118. 118© Cloudera, Inc. All rights reserved. Is it encrypted? Data written to HDFS✓ Metadata in RDBMS✗ Spill-over files✗ Data
  • 119. 119© Cloudera, Inc. All rights reserved. Cloudera navigator encrypt Transparent layer between application and file system • Compliance-ready • Massively scalable • High performance: Optimized for Intel • Separation of duties • Key management with Navigator Key Trustee Data
  • 120. 120© Cloudera, Inc. All rights reserved. Cloudera Navigator Key Trustee “Virtual safe-deposit box” for managing encryption keys or other Hadoop security artifact • Separates Keys from Encrypted Data • Centralized Management with Audit Controls • Integration with HSMs • Roadmap: Management of SSL certificates, SSH keys, tokens, passwords, Kerberos Keytab Files, and more Data
  • 121. 121© Cloudera, Inc. All rights reserved. Redacted Log Files SELET * FROM customers WHERE ssn=‘123-45-6789’ hive.server2.logging.operation.log.location HUE Saved Queries Audit Logs • Credit card numbers • Social security numbers • Email addresses • Server host names / IP
  • 122. 122© Cloudera, Inc. All rights reserved. Thank you The modern platform for machine learning and analytics, optimized for the cloud