Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next

17

Share

Big Data Fundamentals 6.6.18

Discover the origins of big data, discuss existing and new projects, share common use cases for those projects, and explain how you can modernize your architecture using data analytics, data operations, data engineering and data science.

Big Data Fundamentals is your prerequisite to building a modern platform for machine learning and analytics optimized for the cloud.

We’ll close out with a live Q&A with some of our technical experts as well.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Big Data Fundamentals 6.6.18

  1. 1. 1© Cloudera, Inc. All rights reserved. Big data fundamentals Understanding the optimizationchoices in big data components
  2. 2. 2© Cloudera, Inc. All rights reserved. Presentation goals  Teach you something  Help you see the potential of Big Data beyond Map Reduce  Be fair to Cloudera’s competitors  Inspire you to learn more If something doesn’t make sense, please ask.
  3. 3. 3© Cloudera, Inc. All rights reserved. Notification • The information in this document is proprietary to Cloudera. No part of this document may be reproduced, copied or transmitted in any form for any purpose without the express prior written permission of Cloudera. • This document is a preliminary version and not subject to your license agreement or any other agreement with Cloudera. This document contains only intended strategies, developments and functionalities of Cloudera products and is not intended to be binding upon Cloudera to any particular course of business, product strategy and/or development. Please note that this document is subject to change and may be changed by Cloudera at any time without notice. • Cloudera assumes no responsibility for errors or omissions in this document. Cloudera does not warrant the accuracy or completeness of the information, text, graphics, links or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose or non-infringement. • Cloudera shall have no liability for damages of any kind including without limitation direct, special, indirect or consequential damages that may result from the use of these materials. The limitation shall not apply in cases of gross negligence.
  4. 4. 4© Cloudera, Inc. All rights reserved. Agenda • Open source software • Data storage and stewardship • Data integration • Data engineering • Data analytics • Life after Lambda architectures and IoT • Data science at scale • Big data in the clouds • Cybersecurity as a Big Data problem • Cluster management and security • Customer success stories • Question and answers
  5. 5. 5© Cloudera, Inc. All rights reserved. Big data fundamentals Open source software Optimizing to benefit from community innovation
  6. 6. 6© Cloudera, Inc. All rights reserved. Free evaluation Install, test, inspect, and evaluate open source code in perpetuity, with no financial obligation Freedom from lock-in Multiple vendors supporting same core technology makes it easier to move Scalable innovation The collective work of a global, passionate community keeps the code base evolving 3 Reasons open source is good for companies [1] [2] [3] These benefits derive from use of the permissive Apache License
  7. 7. 7© Cloudera, Inc. All rights reserved. Not business focus Company assets should be working on core competency Real cost hard to measure Time developers spend solving problems or adding features often isn’t visible Multiple projects Each project is managed by a separate committee and there is not necessarily an overriding design 3 Reasons open source adds risk for companies [1 ] [2 ] [3]
  8. 8. 8© Cloudera, Inc. All rights reserved. “Open source software is free like a puppy is free” - Scott McNealy CEO Sun Microsystems
  9. 9. 9© Cloudera, Inc. All rights reserved. What if you got a dog for a reason? • Can take years to mature • Months of intensive training (when your attention should be elsewhere) • Dog becomes very bonded to the handler (and vice versa) • Poor training results in a misbehaving dog Developers don’t want to be tied to one system You don’t want your developers tied to one system
  10. 10. 10© Cloudera, Inc. All rights reserved. What is a distribution?
  11. 11. 11© Cloudera, Inc. All rights reserved. Each Apache project has its own dependencies and release cycle. Getting them to work together requires effort and thorough testing. Code in Open Source changes constantly. Cloudera provides a new feature release every quarter that is tested and supported. Distribution Vendors should employ Open Source Committers that can make sure fixes are added to the Open Source base. Benefits of using a distribution Stability Regular upgrades 24x7 Support and bug fixes
  12. 12. 12© Cloudera, Inc. All rights reserved. With a Distribution, you can start developing applications right away. Building an environment from scratch would take months. With a distribution, you know what it will cost and you know that it will work. Building an environment from scratch provides no such guarantees. Building an environment from scratch would require the focus of a few of your best developers. Get them working on the real problem. More benefits of using a distribution Faster to market Minimize risk Focus on business problems
  13. 13. 13© Cloudera, Inc. All rights reserved. The big data ecosystem vendors (Spark) (Kafka) Comprehensive distributions Single+ project specialists Proprietary + Hadoop in the gaps (Cassandra) Google Cloud Dataproc
  14. 14. 14© Cloudera, Inc. All rights reserved. Apache software foundation ASF board of directors Project management committee chair – ensures the project complies with ASF requirements PMC members – decide the architecture, feature set and direction of the project, usually are also Committers Committers – have write access to the code, although contributions are approved by the PMC Developers (aka contributors) – anyone may propose changes to the code or documentation, but those changes have to be picked up and used by a committer Users – provide feedback, bug reports and feature suggestions appoints For each project
  15. 15. 15© Cloudera, Inc. All rights reserved. Apache project requirements • Must be Apache licensed (may include compatibly licensed elements) • Free to download and use for any purpose • Branding requirements and restrictions • Source code must be open and available on the ASF website • Must provide sufficient documentation to use the project on website • Releases must follow the ASF PMC voting policies • Corporations may not directly contribute – only individuals • Must govern themselves independently of undue commercial influence • Must not discourage new contributions from competing vendors • Low diversity may incur ‘extra scrutiny’ from the board However, there are NO requirements to: • Have more than one commercial entity involved (random community members are ok) • Contribute to an existing project when there is overlap in functionality (competitive projects are ok) • Contribute modifications or enhancements back to the project • Employ Committers or PMC members if you are a commercial vendor
  16. 16. 16© Cloudera, Inc. All rights reserved. Cloudera’s commitment to our customers Anything that stores your data Any APIs your applications call Uses open source code Our contributions and fixes go back to open source first When possible, use projects supported by multiple commercial vendors Keeping your cluster running Cloudera express edition No limit to number of servers Managing your applications Employ* committers, if not PMC members, on the projects we support * People manage their own careers. Temporary gaps may exist High availability features Ensure your success Open source License expiration won’t stop the cluster Free to use forever Provide enterprise value RBAC over your data 24x7 support Minimize your risk Rolling upgrades Data governance and lineage Automated backup and recovery Full disk encryption Multi-tenant usage reports
  17. 17. 17© Cloudera, Inc. All rights reserved. Big data fundamentals Data storage and stewardship Optimizing for inexpensive, reliable storage accessed by multiple execution engines
  18. 18. 18© Cloudera, Inc. All rights reserved. Anatomy of a big data cluster Masters Workers Gateway(s) Cloudera Manager Data Node HBase Region Server Search YARN Resource Pool(s) CM Agent Data Node HBase Region Server Search YARN Resource Pool(s) CM Agent Data Node HBase Region Server Search YARN Resource Pool(s) CM Agent Data Node Kudu Tablet Server Impala Daemon YARN Resource Pool(s) CM Agent Data Node Kudu Tablet Server Impala Daemon YARN Resource Pool(s) CM Agent Data Node Kudu Tablet Server Impala Daemon YARN Resource Pool(s) CM Agent HMaster CM Agent HUE Server Zookeeper Name Node YARN Kudu Master ⭐️ Zookeeper Secondary Name Node Impala Catalog Store Kudu Master⭐️ HMaster CM Agent Sentry Server Zookeeper HiveServer Impala Statestore Kudu Master HMaster CM Agent Oozie Server CM Agent CDSW User App User App Metadata Database(s) CM Agent CDSW CDSW Session CDSW Session CDSW Session CDSW Session CDSW Session Cloud Plugin Cloudera Director (optional)
  19. 19. 19© Cloudera, Inc. All rights reserved. HDFS Name Node Secondary Name Node Standby Name Node Data NodeA Data NodeB Data NodeC Data NodeD FileQ BX BY BZ BX1 BX2 BX3 BY1 BY3 BY2 BZ3BZ2 BZ1 Rack1 Rack2 Rack3 Default block size = 256 MB
  20. 20. 20© Cloudera, Inc. All rights reserved. HDFS Snapshots … user hive tables sales subscriptions Data1.parquet Data2.parquet .snapshot Data Node BX1 Name Node BY1 BZ1 BY2 BX2 BZ2 BY1 BX2 BY2 BX1 BZ1 BZ2 BX1 BY1 BY2 BX2 BZ1 BZ2snap1 Data1.parquet Data2.parquet
  21. 21. 21© Cloudera, Inc. All rights reserved. Public cloud blob storage Public clouds are offering low cost, highly available storage Designed for access inside and outside of Hadoop Amazon Simple Storage Service (S3) Uses ‘bucket’ paradigm Requires S3 Guard (Apache Open Source) to achieve consistency Use protocol s3a://<bucket name>/<filename> • Microsoft Azure Data Lake Store (ADLS) ‘Feels’ more like a normal (POSIX) file system Use protocol adl://<directory>/<directory>/filename
  22. 22. 22© Cloudera, Inc. All rights reserved. Compute over storage SparkImpala MapReduceSearch Hive Pig ADLS KuduHDFS Compute Storage Filesystem S3 HBase
  23. 23. 23© Cloudera, Inc. All rights reserved. Schema on write or ‘structured data’ 1. Define schema 2. Create table(s) 3. Map known fields 4. Discard unknown fields
  24. 24. 24© Cloudera, Inc. All rights reserved. Schema on read or ‘unstructured data’ 1. Write whole record(s) to filesystem (compressed) 3. Query engine applies schema to data 2. Register schema with metastore
  25. 25. 25© Cloudera, Inc. All rights reserved. Popular file format options XML, JSON Files Can’t be both split and compressed Text/Delimited/CSV/JSON Records Usable everywhere Schema on read Poor performance, poor compression Avro Contain schema, but also allow schema on read Usable inside and outside of Hadoop Parquet Columnar, splitable, query performance benefits, excellent compression Support schema evolution (adding columns) Skips columns well during scans ORC (not supported by Cloudera, HDP Hive Only) Similar to Parquet but with higher compression but poor data skip Hortonworks working on ACID transactions, secondary indexes File type Example size Uncompressed CSV 1.8 GB Avro 1.5 GB Avro w/ snappy compression 750 MB Parquet w/ snappy compression 300 MB
  26. 26. 26© Cloudera, Inc. All rights reserved. Raw and formatted data copies • Keep the raw version if there is an opportunity that information will be lost in the translation • Use Columnar storage on formatted data to improve analytic performance immensely • Think about a metadata tagging policy (e.g. Cloudera Navigator) to assist with Data stewardship
  27. 27. 27© Cloudera, Inc. All rights reserved. Big data pipelines Data ingestion Data engineering Data stewardship Data science Data analytics Move Cleanse Conform Transform Enrich Store Secure Govern Tag Model Score Enrich Predict BI Online APIs Capture Stream
  28. 28. 28© Cloudera, Inc. All rights reserved. Which do you want? Data lake Data hub
  29. 29. 29© Cloudera, Inc. All rights reserved. Data lake to a data hub • Comprehensive, planned and enforced data hierarchy • Carefully administered versioning and retention policies • Comprehensive, unified security, governance and lineage • Encourage and support metadata • Establish standards for data, metadata and analytic models • Maximize reuse of data without making copies • Balanced with security and performance concerns – don’t be an ideologue! • Plan staffing around new roles
  30. 30. 30© Cloudera, Inc. All rights reserved. Big data fundamentals Data integration Optimizing for data ingestion with volume, velocity and variety
  31. 31. 31© Cloudera, Inc. All rights reserved. Apache Flume HDFS Flume Agent Flume Agent(s) Compress Flume Agent Flume Agent Flume Agent Flume Agent Flume Agent Filter Transform Flume Agent Encrypt Flume Agent •Pre-process data before storing • Such as transform, scrub or enrich • Store in any format • Text, compressed, binary, or custom sink •Collect data as it is produced • Files, syslogs, stdout or custom source •Process in place • Such as encrypt or compress • Write in parallel • Scalable throughput
  32. 32. 32© Cloudera, Inc. All rights reserved. Apache Kafka Broker1 TopicA- Partition0 Broker2 TopicA- Partition1 Broker3 TopicA- Partition2 Producer Producer ConsumerA Consumer Consumer Group ConsumerB Producers push to Kafka Consumers pull from Kafka
  33. 33. 33© Cloudera, Inc. All rights reserved. Kafka redundancy Broker3 TopicA- Partition2 TopicA- Partition0 -Replica TopicA- Partition1 -Replica Broker3 TopicA- Partition1 TopicA- Partition0 -Replica TopicA- Partition2 -Replica Broker3 TopicA- Partition0 TopicA- Partition1 -Replica TopicA- Partition2 -Replica
  34. 34. 34© Cloudera, Inc. All rights reserved. Apache Sqoop RDBMS HDFS ▪ Rapidly moves large amounts of data between relational databases and HDFS – Import tables (or partial tables) from an RDBMS intoHDFS – Export data from HDFS to a database table ▪ Uses JDBC to connect to thedatabase – Works with virtually all standard RDBMSs ▪ Custom “connectors” for some RDBMSs provide much higher throughput – Available forcertain databases, such as Teradata and Oracle
  35. 35. 35© Cloudera, Inc. All rights reserved. Big data fundamentals Data engineering Optimizing for parallel processing of big data with minimum code
  36. 36. 36© Cloudera, Inc. All rights reserved. Directed acyclic graph (DAG)
  37. 37. 37© Cloudera, Inc. All rights reserved. Directed acyclic graph (DAG) ✔ ✖
  38. 38. 38© Cloudera, Inc. All rights reserved. Resilient Distributed Dataset (RDD) An RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with an API that offers transformations and actions. map (function) filter (predicate) sortBy (function) join (RDD2)
  39. 39. 39© Cloudera, Inc. All rights reserved. Apache Spark RDDA RDDB RDDC RDDD RDDE RDDF RDDG map groupBy filtermap join
  40. 40. 40© Cloudera, Inc. All rights reserved. Spark stages RDDA RDDB RDDC RDDD RDDE RDDF map groupBy filtermap
  41. 41. 41© Cloudera, Inc. All rights reserved. Spark stages RDDA RDDB RDDC RDDD RDDE RDDF map groupBy filtermap RDDG join
  42. 42. 42© Cloudera, Inc. All rights reserved. Spark caching RDDA RDDB RDDC RDDD RDDE RDDF map groupBy filtermap RDDG join
  43. 43. 43© Cloudera, Inc. All rights reserved. Evolution of the RDD API DataFrame (Spark 1.3) • Untyped, for R and Python • Adds concept of ‘Schema’ to describe the data • Uses RDDs underneath • Allows Spark engine to perform some optimizations • Avoids use of Java serialization, uses off heap storage • Required different API than RDDs RDD (Spark 1.0) • Can be strongly typed in Java, Scala • Uses RDDs underneath • Catch compile-time errors Dataset (Spark 2.x) • Unified API • Typed and untyped
  44. 44. 44© Cloudera, Inc. All rights reserved. Spark Streaming TCP/IP
  45. 45. 45© Cloudera, Inc. All rights reserved. Other DAG/streaming processors (not supported by Cloudera)
  46. 46. 46© Cloudera, Inc. All rights reserved. Spark ecosystem Spark core Spark SQL Spark Streaming Spark ML GraphX Standalone Mesos (not included in CDH) Yarn
  47. 47. 47© Cloudera, Inc. All rights reserved. Spark SQL + Static typing (optional) + Storage and processing efficiencies
  48. 48. 48© Cloudera, Inc. All rights reserved. ETL into EDW Data sources ETL EDW Archive Data marts Canned reports Dashboards/ analytic applications Non-SQL workloads Self-service BI/ad hocEDW
  49. 49. 49© Cloudera, Inc. All rights reserved. EL-T into EDW Data sources EL EDW Archive Data marts Canned reports Dashboards/ analytic applications Non-SQL workloads Self-service BI/ad hoc T
  50. 50. 50© Cloudera, Inc. All rights reserved. Modern data warehouse landscape Data sources Analytic database Operational database Data Science & engineering Shared data layer Modern Data Platform Fixed reports Dashboards/ analytic applications Non-SQL workloads Self-service BI/ad hoc Flexible reporting EDW
  51. 51. 51© Cloudera, Inc. All rights reserved. Cloudera’s featured data engineering partners Hadoop Native Solution
  52. 52. 52© Cloudera, Inc. All rights reserved. Big data fundamentals Data analytics Optimizing the engine to match the use case
  53. 53. 53© Cloudera, Inc. All rights reserved. Apache Hive Hive Metastore HDFS BLOB OtherStorage Location Schema SerDe File format HiveServer2 Thrift Service Beeline CLI JDBC ODBC Driver Compiler Executor Driver Compiler Executor SessionA SessionB or
  54. 54. 54© Cloudera, Inc. All rights reserved. Apache Hive ✓ Spins up processes under the control of YARN - shares resources well on the cluster - but there is a lot of overhead to create these processes ✓ Can handle the failure of a machine during the query - but recovery takes many seconds ✓ Will overflow join data to HDFS - can handle very large joins - but HDFS writes data 3 times, so this takes time Don’t forget who won the race, Bucko! Hive on Spark (Cloudera, MapR, Databricks) ✓Improves speed due to efficiencies of Spark Live Long and Process (Hortonworks) ✓Improves speed by using pre-allocated JVMs w/ caching Presto (Facebook) ✓Improves speed by optimizing data transfers for SQL and using data streaming instead of HDFS for intermediate data But all of these solutions are still JVM based
  55. 55. 55© Cloudera, Inc. All rights reserved. Apache Impala ✓ Written in C++ - avoids issues of the JVM ✓ Uses the Hive metastore - better integration for security and administration ✓ Uses pre-allocated processes on worker nodes - no process spin up time - but still builds an execution plan for each query ✓ Employs algorithms from MPP databases But I left you in the dust at the starting line, Grandpa! If a machine fails during a query that only takes 1 second to run, you will just retry the query. Adopted by: (the fastest of the antelopes)
  56. 56. 56© Cloudera, Inc. All rights reserved. So which engine should I choose? "If the only tool you have is a hammer, you tend to see every problem as a nail." - Abraham Maslow Psychologist Author of ‘Maslow’s Hierarchy of Needs’ SparkImpala MapReduceSearch Hive Pig ADLS KuduHDFS Filesystem S3 HBase
  57. 57. 57© Cloudera, Inc. All rights reserved. Other SQL engines LLAPStinger.next CubeHive ++ aka Live Long And Process For JSON lovers Tied to proprietary front/backLayer over HBaseSQL engines ‘from scratch’ Low Latency Analytical Processing (not supported by Cloudera) IBM Big SQL OLAP
  58. 58. 58© Cloudera, Inc. All rights reserved. How to interpret benchmark tests Standard test? How many of the queries were run? What is the criterion for excluding a query? Single-user or multi-user? Data size? Allow modifications to the queries? "There are three kinds of lies: lies, damned lies, and statistics." -Benjamin Disraeli Prime Minister of Britain
  59. 59. 59© Cloudera, Inc. All rights reserved. Big data fundamentals Life after lambda architectures and IoT Optimizing for time series and changing data
  60. 60. 60© Cloudera, Inc. All rights reserved. Updates or analytics using Analytics(Scans) Online (Random Access)slow slowfast fast (but not both at the same time) Write once, read many. No updates, but can append (sort of) Optimized for batch inserts and scans Read, write, update individual rows Optimized row-based access, sparse columns
  61. 61. 61© Cloudera, Inc. All rights reserved. Lambda architectures (named for the simple shape)
  62. 62. 62© Cloudera, Inc. All rights reserved. Lambda architectures (not so simple in practice) Source: http://horicky.blogspot.com/2014/08/lambda-architecture-principles.html
  63. 63. 63© Cloudera, Inc. All rights reserved. Kudu design goals using Analytics(Scans) Online (Random Access)slow slowfast fast High throughput for big scans Goal: Close to Parquet on HDFS Low-latency for short accesses (primary key indexes and quorum design) Goal: 1ms read/write on SSD Database-like semantics (initially single-row ACID) Relational data model SQL query “NoSQL” style scan/insert/update (Java client)
  64. 64. 64© Cloudera, Inc. All rights reserved. Why are updates important? Right to forget ETL mistakes/corrections Analytic enrichment
  65. 65. 65© Cloudera, Inc. All rights reserved. Life without Lambda with BI Online
  66. 66. 66© Cloudera, Inc. All rights reserved. Kudu use cases Kudu is best for use cases requiring a simultaneous combination of sequential and random reads and writes ● Time series data ○ Examples: Stream market data; fraud detection and prevention; risk monitoring ○ Workload: Insert, updates, scans, lookups ● Machine data analysis ○ Examples: Network threat detection ○ Workload: Inserts, scans, lookups ● Online reporting ○ Examples: ODS ○ Workload: Inserts, updates, scans, lookups
  67. 67. 67© Cloudera, Inc. All rights reserved. Big data fundamentals Data science Optimizing to detect complex patterns over time
  68. 68. 68© Cloudera, Inc. All rights reserved. Ask bigger questions
  69. 69. 69© Cloudera, Inc. All rights reserved. Data science is a big data problem “It’s not who has the best algorithm that wins. It’s who has the most data.” Banko and Brill, 2001
  70. 70. 70© Cloudera, Inc. All rights reserved. Notebooks What was our revenue last year? RDBMS $14,325,874,321.07 What will our revenue be next year? • Assumptions • Algorithms • Source Data • Methodology Your code tells a story • Tell it with pictures & results • Allow someone to re-run the numbers • Pass it to someone who may use it as the basis for a new/different story
  71. 71. 71© Cloudera, Inc. All rights reserved. Notebook challenges Access For sensitive data, secure clusters are difficult to access. And IT typically doesn’t want random packages installed on a secure cluster. Popular open source tools don’t easily connect to these environments, or always support Hadoop data formats. Scale Laptops rarely have capacity for medium, let alone big data. This leads to a lot of sampling. Popular frameworks don’t easily parallelize on a cluster. Typically code has to get rewritten for production. Developer Experience Notebooks, while awesome, don’t easily support virtual environment and dependency management, especially for teams. This makes sharing and reproducibility hard. Notebooks are also challenging to “put into production.”
  72. 72. 72© Cloudera, Inc. All rights reserved. ‘Dependency hell’ Or ’I am my own Grandpa’ X (1.0.0) Y (1.0.0) MyApp X (1.0.0) Y (1.0.0) MyApp X (1.1.0) Upgrade Dependency Graph for Hadoop Java Client www.visioneye.com
  73. 73. 73© Cloudera, Inc. All rights reserved. Cloudera Data Science Workbench  Team-based  R, Python, Scala  SDLC  Secure  Containerized  Integrated into the cluster
  74. 74. 74© Cloudera, Inc. All rights reserved. The importance of an open ecosystem Open ecosystem Black box
  75. 75. 75© Cloudera, Inc. All rights reserved. Containers Hardware Host OS Hypervisor (Optional) GuestOS GuestOS GuestOS Libs Libs Libs AppA1 AppA2 AppB VM Hardware Host OSContainer Daemon Libs Libs AppA1 AppA3 AppA2 AppB1 AppB3 AppB2 AppB4 Container Containers • Use less memory than VMs • You get to use more of the machine you pay for • Provide isolation between apps • Can share libraries between similar apps • Provide abstraction of the OS, not of the HW • Get you out of ‘Dependency Hell’ against other applications
  76. 76. 76© Cloudera, Inc. All rights reserved. Scaling data science for big data Master(s) Workers Gateway(s) Name Node YARN CDSW CDSW CDSW Session CDSW Session CDSW Session CDSW Session CDSW Session Data Node YARN Resource Pool(s) Data Node YARN Resource Pool(s) Data Node YARN Resource Pool(s) Data Node YARN Resource Pool(s) Web browser login Start session CDSW Session CDSW Session Kubernetes
  77. 77. 77© Cloudera, Inc. All rights reserved. Machine learning pipeline in Spark Load learning data frame Clean/process data Extract and transform features Vectorize features Save model Scoring results Test m odel Fit and access model Load test data frame Test resultsLoad scoring data frame Score DataSave Results
  78. 78. 78© Cloudera, Inc. All rights reserved. Big data fundamentals Big Data in the Clouds Optimizing for a variety of operational choices
  79. 79. 79© Cloudera, Inc. All rights reserved. My organization is moving to the cloud, why should we consider ?
  80. 80. 80© Cloudera, Inc. All rights reserved. Traditional applications 80 Data Exploration STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG SQL & BI Analytics STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Operational Real-Time DB STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG ETL & Data Processing STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Custom Functions STORAG E SECURITY GOVERNANCE WORKLOAD MGMT INGEST & REPLICATION DATA CATALOG Many data silos, each with its own proprietary tools and infrastructure Different vendors, products, and services on-premises versus in cloud A fragmented approach is difficult, expensive, and risky
  81. 81. 81© Cloudera, Inc. All rights reserved. Multiple compute engines, same data OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage CORE SERVICES STORAGE SERVICES
  82. 82. 82© Cloudera, Inc. All rights reserved. Common metadata Data Catalog, Security, Governance, Lineage, Metadata Tags OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage CORE SERVICES STORAGE SERVICES METADATA SERVICES
  83. 83. 83© Cloudera, Inc. All rights reserved. Data Silos 2.0 DW Cluster DW Service Source Data A B D C
  84. 84. 84© Cloudera, Inc. All rights reserved. Deployed anywhere Data Catalog, Security, Governance, Lineage, Metadata Tags OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES CORE SERVICES STORAGE SERVICES METADATA SERVICES DEPLOYMENT OPTIONS
  85. 85. 85© Cloudera, Inc. All rights reserved. But that still wasn’t quite right ….
  86. 86. 86© Cloudera, Inc. All rights reserved. How do we deal with hybrid clouds? • Shared catalog • Unified security • Consistent governance • Easy workload management • Flexible ingest and replication
  87. 87. 87© Cloudera, Inc. All rights reserved. Cloudera Enterprise Data Catalog, Security, Governance, Lineage, Metadata Tags OPERATIONAL DATABASE DATA ENGINEERING ANALYTIC DATABASE DATA SCIENCE HDFS Kudu S3 ADLS Data Storage CORE SERVICES STORAGE SERVICES PRIVATE CLOUDBARE METAL INFRASTRUCTURE SERVICES DEPLOYMENT OPTIONS The modern platform for machine learning and analytics optimized for the cloud
  88. 88. 88© Cloudera, Inc. All rights reserved. Deployment & management options Bare Metal Private Cloud Cloud IaaS Cloud PaaS Applications Applications Applications Applications Clusters Clusters Clusters Clusters Operating System Operating System Operating System Operating System Network Network Network Network Storage Storage Storage Storage Servers Servers Servers Servers Customer managed Vendor managed Manager Director Altus
  89. 89. 89© Cloudera, Inc. All rights reserved. • Easy • Agile • Unified
  90. 90. 90© Cloudera, Inc. All rights reserved. Altus service architecture ● Runs in Cloudera’s secured and monitored environment ● Manages CDH clusters in customer cloud account ● Customer data does not pass* to Cloudera * Workload Analytics requires opt-in log data transfer to Cloudera
  91. 91. 91© Cloudera, Inc. All rights reserved. Keep your encryption keys outside of the cloud
  92. 92. 92© Cloudera, Inc. All rights reserved. Cloudera usage based pricing option Pay per use Node based pricing  Cheaper for transient clusters  Cheaper for small machine types  Pay as you go or discounted credits  Cheaper for persistent or long-running clusters  Volume & enterprise discounts
  93. 93. 93© Cloudera, Inc. All rights reserved. Hot-Warm-Cold Data Store partitions from the same table in different storage types m4.4xlarge m4.4xlarge i2.2xlarge serve serve preload serve preloadserve d2.4xlarge serve 0 1 3 14 Days of ‘Hot’ Data AWS Instance premium – 200% AWS Instance premium – 320% preload S3 S3 EBS S3 S3
  94. 94. 94© Cloudera, Inc. All rights reserved. BDR to Blob Storage  Minimum Storage Cost  No Backup Cluster Costs (servers or subscription)  RPO unaffected  Cloud provider manages regional locality ✗RTO longer user sales contracts North America .snapshots snap 4-21-17 Contract1.txt Contract2.txt Contract1.txt Contract2.txt AWS S3 ADLS
  95. 95. 95© Cloudera, Inc. All rights reserved. Big data fundamentals Cybersecurity Optimizing to detect complex attacks over longer periods of time
  96. 96. 96© Cloudera, Inc. All rights reserved. Cybersecurity is a big data problem Popular cyber platforms can not cost effectively scale to the volume and variety of modern data Only partial view of the enterprise limits analytics and slows investigations Difficult to deploy advanced machine learning detection capabilities Explosion of data Limited enterprise visibility Limited analytic processing DataAccess 1%50%100% DataVolume 10PB1PB1TB IF (X) AND (Y) THEN (Z) Time User Network Endpoint Archived data Emerging data
  97. 97. 97© Cloudera, Inc. All rights reserved. Open Data Models: Enterprise Visibility, Support For Multiple Workloads Endpoint User Network DIVERSE DATA SOURCES SINGLE ACCESS Source: Momentum Partners Cybersecurity Snapshot April 2016
  98. 98. 98© Cloudera, Inc. All rights reserved. Detect advanced threats faster with full compliment of analytic frameworks for all cyber workloads Faster time to incident investigation and response with comprehensive enterprise visibility Change the economics of cybersecurity with an open source platform that supports multiple LOB workloads The value of Apache Spot
  99. 99. 99© Cloudera, Inc. All rights reserved. Many applications on one shared data set and architecture Visualization & machine learning applications can share common data set & infrastructure CustomPackaged Spot community is developing out machine learning (e.g. network threat detection) Open Source Build custom applications & analytics using Cloudera without having to buy new infrastructure
  100. 100. 100© Cloudera, Inc. All rights reserved. But I already have Splunk … Go Beyond Splunk’s SPL • Share enriched data across multiple analytic processing engines • Simple search, SQL, Python, R, Scala Data flexibility • Faster, more agile, full- fidelity data acquisition • Data portability: Open data model and open storage Cost-effective scalability • Elastic scale on-prem or in the cloud • Cloud-native pay-per-use and transience • Proven at big data scale Hybrid • Runs across multi-clouds & on-prem • Multi-storage over S3, HDFS, Kudu, Isilon, etc ¢¢¢
  101. 101. 101© Cloudera, Inc. All rights reserved. Big data fundamentals Management Optimizing for reliable uptime and optimal resource utilization
  102. 102. 102© Cloudera, Inc. All rights reserved. Big data and the administrator Get up and running Monitor and maintain Troubleshoot and resolve Grow and adapt
  103. 103. 103© Cloudera, Inc. All rights reserved. Get up and running Cloudera manager service Cloudera archives Cloudera manager agent Packages Templates RoleC RoleB RoleA Cluster member
  104. 104. 104© Cloudera, Inc. All rights reserved. Monitor and maintain Services Hosts Applications Resources
  105. 105. 105© Cloudera, Inc. All rights reserved. Troubleshoot and resolve Add your own customized charts See performance and resource utilization at a glance Select historical time period for charts
  106. 106. 106© Cloudera, Inc. All rights reserved. Grow and adapt • Utilization by tenant • Project future needs • Prioritize pre-emption
  107. 107. 107© Cloudera, Inc. All rights reserved. Backup and disaster recovery (BDR)  Distributed (uses distcp)  Work done by target cluster  Secure (can have different encryption keys on each side, encrypted in motion)  Bandwidth Limited (optional) user sales contracts North America .snapshots EMEA snap 4-21-17 Contract1.txt Contract2.txt Contract1.txt Contract2.txt Contract3.txt user sales contracts North America EMEA Contract1.txt Contract2.txt Contract3.txt .snapshots snap 4-21-17 Contract3.txt Federated clusters
  108. 108. 108© Cloudera, Inc. All rights reserved. Big data fundamentals Information security Optimizing for minimum risk
  109. 109. 109© Cloudera, Inc. All rights reserved. Big data security Authentication, authorization, audit and compliance Access Defining what users and applications can do with data Technical concepts: Permissions Authorization Data Protecting data in the cluster from unauthorized visibility Technical concepts: Encryption, tokenization, Data masking Visibility Reporting on where data came from and how it’s being used Technical concepts: Auditing Lineage Cloudera Manager Apache Sentry & RecordService Cloudera Navigator Navigator Encrypt & Key Trustee | Partners Perimeter Guarding access to the cluster itself Technical concepts: Authentication Network isolation
  110. 110. 110© Cloudera, Inc. All rights reserved. Active directory and Kerberos Perimeter • Manages Users, Groups, and Services • Provides username / password authentication • Group membership determines service access Active directory • Trusted and standard third-party • Authenticated users receive “Tickets” • “Tickets” gain access to services Kerberos User authenticates to AD Authenticated user gets Kerberos Ticket Ticket grants access to Services e.g. Impala User [ssmith] Password[***** ]
  111. 111. 111© Cloudera, Inc. All rights reserved. Apache Sentry • Apache Sentry is an authorization module for Hadoop • Apache Licensed project • Supported by multiple vendors • Used in many industries • Used by Hive, Impala, Search & Spark • Syncs with HDFS ACL • Supports ease of administration through role-based authorization (RBAC) Access Spark Bindings Spark
  112. 112. 112© Cloudera, Inc. All rights reserved. Centralized role-based access control Sentry Perm. Read access to Transactions.Date… Where Country = US Sentry Perm. Read access to Customers.CustomerID … Where Country = US Sentry Role U.S. Customer Transaction Analysis Group Tier 1 Customer Support Reps Sam Smith Group Tier 1 Broker Analysts Martha Jones Cust. ID SSN Phone Country 6758493 329-44-9847 US 09:22:03 16- Feb-2015 344-22-9876 EU 5768459 585-11-2345 US Date/Time Cust. ID Trade Country 11:33:01 16- Feb-2015 Sell US 09:22:03 16- Feb-2015 344- 22- 9876 EU 13:45:24 16- Feb-2015 Buy US Access
  113. 113. 113© Cloudera, Inc. All rights reserved. Auditing Track, understand, and protect access to sensitive data • Auditing needs to happen automatically • Audit logs need to be immutable • Need to be able to drill down on events to the original events/data Visibility
  114. 114. 114© Cloudera, Inc. All rights reserved. Governance Faceted search Natural language Incremental filters Drill down links Visibility Used to facilitate research and the ability to find groups of similar assets Jump to application log
  115. 115. 115© Cloudera, Inc. All rights reserved. Metadata Automatic collection • No need to create XML files or manage manual controls Complete aggregation • Full coverage across all platform components Simple accessibility • Integrated user interface with full- text search
  116. 116. 116© Cloudera, Inc. All rights reserved. Visibility Enterprise metadata The foundation for data management and governance Metadata enables you to put context and meaning to data to answer the important questions Technical Managed Custom Unified metadata repository Who are the high-value customers? How do we define that? How is high value calculated? Where is customer data stored and used? Is the data reliable and accurate?
  117. 117. 117© Cloudera, Inc. All rights reserved. Lineage • Where did the data come from? • Who ran the process that created the data? • What code was used to generate the values? • Which files and columns were used to derive the values? Visibility
  118. 118. 118© Cloudera, Inc. All rights reserved. Is it encrypted? Data written to HDFS✓ Metadata in RDBMS✗ Spill-over files✗ Data
  119. 119. 119© Cloudera, Inc. All rights reserved. Cloudera navigator encrypt Transparent layer between application and file system • Compliance-ready • Massively scalable • High performance: Optimized for Intel • Separation of duties • Key management with Navigator Key Trustee Data
  120. 120. 120© Cloudera, Inc. All rights reserved. Cloudera Navigator Key Trustee “Virtual safe-deposit box” for managing encryption keys or other Hadoop security artifact • Separates Keys from Encrypted Data • Centralized Management with Audit Controls • Integration with HSMs • Roadmap: Management of SSL certificates, SSH keys, tokens, passwords, Kerberos Keytab Files, and more Data
  121. 121. 121© Cloudera, Inc. All rights reserved. Redacted Log Files SELET * FROM customers WHERE ssn=‘123-45-6789’ hive.server2.logging.operation.log.location HUE Saved Queries Audit Logs • Credit card numbers • Social security numbers • Email addresses • Server host names / IP
  122. 122. 122© Cloudera, Inc. All rights reserved. Thank you The modern platform for machine learning and analytics, optimized for the cloud
  • chenjie19

    Sep. 1, 2021
  • SangSubChong

    Oct. 18, 2020
  • vinodnerella

    Apr. 14, 2020
  • NachoMorande

    Jan. 20, 2020
  • gplai

    Dec. 28, 2019
  • MatthewTan8

    Dec. 13, 2019
  • rigualv

    Aug. 1, 2019
  • panweiyou

    Jan. 23, 2019
  • UmapathyV

    Jan. 15, 2019
  • matadewapw

    Dec. 13, 2018
  • ssuser92d6ec

    Nov. 2, 2018
  • MarilynTan3

    Aug. 9, 2018
  • projectlib

    Jul. 2, 2018
  • HarshaBandaru3

    Jul. 1, 2018
  • ichromeafwan

    Jun. 29, 2018
  • DzungNguyen

    Jun. 13, 2018
  • KennethJeckell

    Jun. 7, 2018

Discover the origins of big data, discuss existing and new projects, share common use cases for those projects, and explain how you can modernize your architecture using data analytics, data operations, data engineering and data science. Big Data Fundamentals is your prerequisite to building a modern platform for machine learning and analytics optimized for the cloud. We’ll close out with a live Q&A with some of our technical experts as well.

Views

Total views

1,925

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

0

Shares

0

Comments

0

Likes

17

×