Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2014 Dataguise Inc. All rights reserved. 
Discovering & Protecting 
Sensitive Data in Hadoop 
jeremy@dataguise.com
Goals For Today 
Big Data for banking, healthcare, tech, govt, 
education, etc. need data security (But few have 
workable...
Market 
Overview 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 3
Data Growth 
• 100% growth and 80% unstructured data by 2015 
…finding and classifying sensitive data will get 
harder 
© ...
Real-world unstructured data scenarios 
Web comment fields and customer 
surveys, CRM data 
Patient and doctor medical dat...
From%2012%to%2020,%enterprise%Big% 
Data%will%grow%7500%% 
in%next%6;8%yrs%% 
% 
© 2014 Dataguise Inc. All rights reserved...
Why Security in Big Data 
Vertical 
Refine 
Explore 
Enrich 
Retail & Web 
• Log Analysis Site 
Optimization 
• Social Net...
Why Security in Big Data 
Vertical 
Refine 
Explore 
Enrich 
Retail & Web 
• Log Analysis Site 
Optimization 
• Social Net...
Three Critical Considerations 
1. Ensuring Compliance 
• The Big Ps (PCI, HIPAA, Privacy), data residency, 
FERPA,FISMA, F...
Lab Project 
• Hadoop as 
R&D 
• Strictly data 
science 
• Zero $$$ or 
selection of 
Distribution 
• Zero 
recognition of...
On-Demand Hadoop. 
• Without adequate sensitive 
data protection, customers 
left to “Penalty Boxing” 
Hadoop 
» “Security...
Data Protection 
In Hadoop 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 12
Security in Hadoop In Summary 
• Like Cloud, Mobile, Virtualization… Big Data 
drives fundamental new rules in security 
»...
Hadoop Security Framework 
Access' 
Defining%what%users% 
and%applicaHons%can%do% 
with%data% 
Technical'Concepts:' 
Permi...
Kerberos on Hadoop 
• Kerberos (developed at MIT) has been the de-facto 
standard for strong authentication/authz 
» Prote...
MapR Improvements on Auth/Authz 
• Vastly simpler 
» But no requirements for Kerberos in core 
» Identity represented usin...
Elements of Data Centric Protection 
• 1. Identify which elements you want to protect 
via: 
» Delimiters (structured data...
Discovery 
• Within HDFS 
» Search for sensitive data per company policy – PII, PCI,… 
» Handle complex data types such as...
How Discovery Works 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 
19 
• MapReduce or Flume/FTP/Sqoop A...
Protection Measures 
• Protection plan should start with 
cutting 
» What data can we delete/cut? 
» What data can be reda...
Encryption “vs” Masking 
• Encryption: 
+ Reversible 
+ Trusted with security proofs 
+ The first hammer 
+ De-centralized...
Encryption “vs” Masking 
• Masking: 
+ Highest security 
+ Realistic data 
+ Range and value preserving 
+ Format-preservi...
Real-World Performance 
• Leveraging the power of MapReduce to run 
distributed encryption or masking 
• Data volume: 2.2 ...
Audit Strategy 
• Essential to all goals: Compliance, breach 
protection, visibility and metrics 
• Avoids the “gotcha” mo...
How It works: Detection and Protection 
In-flight or @Rest 
RDBMS 
Xaction 
Data 
warehouse 
Site 
WEB 
FTP 
FlDugmFel uAm...
Case Studies 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 26
Protecting sensitive data in 
top credit card firm 
Source Data Protection Analysis 
Credit Card 
Transactions 
Omniture F...
Protecting personal health info (PHI) 
in aggregate data lake 
DG FTP Agent 
SQL 
Data 
FTP 
Health 
Records 
© 2014 Datag...
Global Tech Product Analytics 
DG Flume Agent 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 
29 
Object...
Hadoop Data-Protection Checklist 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 
30 
 Discover sensitive...
Thank You 
© 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 
31 
Jeremy Stieglitz 
VP Products 
jeremy@data...
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Big Data Architectural Patterns
Next
Upcoming SlideShare
Big Data Architectural Patterns
Next
Download to read offline and view in fullscreen.

Share

Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

Download to read offline

Today, no industry is immune from a potential data breach and the havoc it can create. According to a 2013 Global Data Breach study by the Ponemon Institute, the average cost of data loss exceeds $5.4 million per breach, and the average per person cost of lost data is approaching $200 per record in the US. Protecting sensitive data in Hadoop is now the imperative for IT and business. With the emergence of Hadoop as a business-critical data platform, Hadoop offers organizations opportunities to improve performance, better understand customers and develop a competitive advantage. But reaching these desirable analytic outcomes depends on the ability to use data without exposing the organization to unnecessary risk. This presentation will cover best practices for a data-centric security, compliance and data governance approach, with a particular focus on two customer use cases within the financial services and insurance industries. You'll learn how these companies are reducing their security exposure through automated data-centric protection of sensitive data in Hadoop.

  • Be the first to like this

Big Data Everywhere Chicago: The Big Data Imperative -- Discovering & Protecting Sensitive Data in Hadoop (Dataguise)

  1. 1. © 2014 Dataguise Inc. All rights reserved. Discovering & Protecting Sensitive Data in Hadoop jeremy@dataguise.com
  2. 2. Goals For Today Big Data for banking, healthcare, tech, govt, education, etc. need data security (But few have workable approaches in production today) Hadoop security approaches (What works and doesn’t work from the past, challenges in the present) Real world case studies (data-centric protection) Credit card security Healthcare data lake (Data-as-a-Service) Product analytics in the cloud © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 2
  3. 3. Market Overview © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 3
  4. 4. Data Growth • 100% growth and 80% unstructured data by 2015 …finding and classifying sensitive data will get harder © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 4 Exabytes
  5. 5. Real-world unstructured data scenarios Web comment fields and customer surveys, CRM data Patient and doctor medical data in emails, PDFs, doctor’s notes © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 5 Voice-to-txt files in Hadoop for customer service optimization; Log data from wellheads and oil drilling sensors Web e-Commerce Pay System
  6. 6. From%2012%to%2020,%enterprise%Big% Data%will%grow%7500%% in%next%6;8%yrs%% % © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL IT%headcount%for% Big%Data%will%grow% 1.5x% The Importance of Automation
  7. 7. Why Security in Big Data Vertical Refine Explore Enrich Retail & Web • Log Analysis Site Optimization • Social Network Analysis © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL • Dynamic Pricing • Session & Content Optimization Retail • Loyalty Program Optimization • Brand & Sentiment Analysis • Dynamic Pricing/ Targeted Offer Intelligence • Threat Identification • Person of Interest Discovery • Cross Jurisdiction Queries Finance • Risk Modeling & Fraud Identification • Trade Performance Analytics • Surveillance & Fraud Detection • Customer Risk Analysis • Real-time upsell, cross sales marketing offers Energy • Smart Grid: Production Optimization • Grid Failure Prevention • Smart Meters • Individual Power Grid Manufacturing • Supply Chain Optimization • Customer Churn Analysis • Dynamic Delivery • Replacement Parts Healthcare & Payer • Electronic Medical Records (EMPI) • Clinical Trials Analysis • Insurance Premium Determination
  8. 8. Why Security in Big Data Vertical Refine Explore Enrich Retail & Web • Log Analysis Site Optimization • Social Network Analysis © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL • Dynamic Pricing • Session & Content Optimization Retail • Loyalty Program Optimization • Brand & Sentiment Analysis • Dynamic Pricing/ Targeted Offer Intelligence • Threat Identification • Person of Interest Discovery • Cross Jurisdiction Queries Finance • Risk Modeling & Fraud Identification • Trade Performance Analytics • Surveillance & Fraud Detection • Customer Risk Analysis • Real-time upsell, cross sales marketing offers Energy • Smart Grid: Production Optimization • Grid Failure Prevention • Smart Meters • Individual Power Grid Manufacturing • Supply Chain Optimization PCI or Financial • Customer Churn Analysis • Dynamic Delivery • Replacement Parts Healthcare & Payer • Electronic Medical Records (EMPI) • Clinical Trials Analysis • Insurance Premium Determination Privacy data PCI or Financial Personal Health (PHI) Personal Health (PHI) Privacy data Personal Health (PHI) Privacy data Privacy PdCaIt ao r Financial PCI or Financial Privacy data
  9. 9. Three Critical Considerations 1. Ensuring Compliance • The Big Ps (PCI, HIPAA, Privacy), data residency, FERPA,FISMA, FERC , etc. • 1200 laws in 63 countries 2. Reducing Breach Risk 3. Quantifying both 1. How much sensitive data? (“un-announced”) 2. Who is adding? (ad hoc user directories) 3. Who is accessing? (sharing, selling, re-purposing) © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 9
  10. 10. Lab Project • Hadoop as R&D • Strictly data science • Zero $$$ or selection of Distribution • Zero recognition of sensitive data or exposure Proof Stage • Achieving value • Data lake cost savings • Line of business ownership • Nodal expansion • Security elements? (unknown to InfoSec) ROI Validity • ROI and TCO validity • Distribution selection and purchase • The Security ‘A- Ha’ moment • Solved with legacy or penalty box Hadoop © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL On Demand Hadoop • Full scale production • Ad hoc new uses • Go Faster: Spark, Kafka • Security sanctified The Evolution of Hadoop Projects
  11. 11. On-Demand Hadoop. • Without adequate sensitive data protection, customers left to “Penalty Boxing” Hadoop » “Security zones” imposed by InfoSec » Slows business, costly and cumbersome • Data-centric protection can set those assets free © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 11
  12. 12. Data Protection In Hadoop © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 12
  13. 13. Security in Hadoop In Summary • Like Cloud, Mobile, Virtualization… Big Data drives fundamental new rules in security » Ad hoc computing, wide open data sets » Extended users and usages, sharing and selling » 3 Vs moving to 6 Vs (automation, non-blocking) • Problem #1 is compliance » Reporting/auditing/monitoring as/more important than data security • Data-centric protection can help © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 13
  14. 14. Hadoop Security Framework Access' Defining%what%users% and%applicaHons%can%do% with%data% Technical'Concepts:' Permissions% AuthorizaHon% Perimeter' % Guarding%access%to%the% cluster%itself% % %%% Technical'Concepts:' AuthenHcaHon% Network%isolaHon% % Perimeter' Guarding%access%to%the% cluster%itself% % Technical'Concepts:' AuthenHcaHon% Network%isolaHon% % ReporHng%on%where% data%came%from%and% how%it’s%being%used% % Technical'Concepts:' © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL Data' ProtecHng%data%in%the% cluster%from% unauthorized%visibility% % Technical'Concepts:' EncrypHon,%TokenizaHon,% Data%masking% % Visibility' AudiHng% Lineage% % • The 4 approaches to address security within Hadoop (Perimeter, Data, Access, Visibility) • Dataguise discovers & protects at the data layer and provides visibility for audit reporting and data lineage
  15. 15. Kerberos on Hadoop • Kerberos (developed at MIT) has been the de-facto standard for strong authentication/authz » Protection against user and service spoofing attacks, and allows for enforcement of user HDFS access permissions • What does Kerberos Do? » Establishes identity for clients, hosts, and services » Prevents impersonation, passwords are never sent over the wire » Tickets grant cryptographic “permissions” to resources • Kerberos is core of authentication in native Apache Hadoop from 2010 » Used for access ecosystem services HDFS, JT, Oozie., for server to server traffic auth. etc. BUT complex to manage! » Lots of steps for example: http://www.cloudera.com/content/cloudera--content/cloudera-- docs/CDH4/4.3.0/CDH4--Security--Guide/cdh4sg_topic_3.html © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 15 Access' Defining%what%users% and%applicaHons%can%do% with%data% % Technical'Concepts:' Permissions% AuthorizaHon% %
  16. 16. MapR Improvements on Auth/Authz • Vastly simpler » But no requirements for Kerberos in core » Identity represented using a ticket which is issued by MapR CLDB servers (Container Location DataBase) » Core services secured by default • Easier integration » User identity independent of host or operating system » Local to MapR (no external Kerberos required) • Faster » Leverage Intel accelerated hardware crypto © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 16
  17. 17. Elements of Data Centric Protection • 1. Identify which elements you want to protect via: » Delimiters (structured data), name-value pairs (semi-structured) or data discovery service (unstructured) • 2. Automated Protection Options: » Automatically apply protection via: » Format preserving encryption (FPE) » Masking (replace, randomize, intellimask, static) » Redaction (nullify) • 3. Audit Strategy » Sensitive data protection/access/lineage © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 17
  18. 18. Discovery • Within HDFS » Search for sensitive data per company policy – PII, PCI,… » Handle complex data types such as addresses » Process incrementally (default) to handle only the new content • In-flight » Processing data on the fly as they are ingested into Hadoop HDFS » Plug-in solution for FTP, Flume, Sqoop » Search for sensitive data per policy – PII, PCI, HIPAA… » NEXT UP: Kafka © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 18
  19. 19. How Discovery Works © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 19 • MapReduce or Flume/FTP/Sqoop Agent » Root directories and drill downs » Can scan entire dataset or incrementally (watermarking) • Runs pattern, logic, context, algorithm, and ontology filters • Can utilize white/black lists and reference sets
  20. 20. Protection Measures • Protection plan should start with cutting » What data can we delete/cut? » What data can be redacted? » Masking choices • Consistency • Realistic looking data • Partial reveal (Intellimask) Credit Card # 4541 **** **** 3241 • What data needs reversibility © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 20
  21. 21. Encryption “vs” Masking • Encryption: + Reversible + Trusted with security proofs + The first hammer + De-centralized architectures - Complex - Key management - Useless without robust authentication and authorization - Data value destruction - Needs both encrypt-decrypt tooling © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 21 • Masking: + Highest security + Realistic data + Range and value preserving + Once and done +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management - Not as well marketed - Not reversible
  22. 22. Encryption “vs” Masking • Masking: + Highest security + Realistic data + Range and value preserving + Format-preserving and partial reveals +Scale-out and distributed + No performance impact on usage + Zero need for authentication and authorization and key management - Not as well marketed - Not reversible - Perceived to grow data • Encryption: + Reversible + Trusted with security proofs + Format-preserving and partial reveals +Scale-out and distributed + The first hammer + De-centralized architectures - Complex - Key management - Useless without robust authentication and authorization - Data value destruction © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 22 The fundamental decision between masking and encryption comes down to reversibility: Some elements in analytics must resolve to original: (e.g. 66.249.22.145 or $34,332.12) Some elements ideal for psuedonyms: Social Security Numbers Credit Card Numbers Names
  23. 23. Real-World Performance • Leveraging the power of MapReduce to run distributed encryption or masking • Data volume: 2.2 TB • Run Time: 23 min • Sensitive Data %: 8/50 Columns in 2.2 Bn rows • Run on 360 node MapR system • In old-word database technology, this would type of job would have taken days/week(s) © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 23
  24. 24. Audit Strategy • Essential to all goals: Compliance, breach protection, visibility and metrics • Avoids the “gotcha” moment » Show all sensitive elements (count, location) » Remediation applied » Dashboard for fast access to critical policies and drill-downs for file and user action © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 24
  25. 25. How It works: Detection and Protection In-flight or @Rest RDBMS Xaction Data warehouse Site WEB FTP FlDugmFel uAmgee nt Plug-in DgFlume Agent 1. Detect sensitive data 2. Protect applying masking/encryption policies Production Cluster © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL Hadoop API Discover/ Mask/Encrypt DgHDFS Agent 1. Detect sensitive data 2. Protect applying masking/encryption policies Hadoop API DGHive, HDFS bulk decryption/Java app 1. Selective decryption based on user/role and policy 1 Data Discovery and protection while loaded into HDFS 2 Data masked or encrypted in HDFS with Map/Reduce job 3 Users can now access data DGDiscover-Masker 1. In DB (Oracle, SQL.. SharePoint, Files) 2. Protect applying masking/encryption policies Sqoop DgScoop Agent 1. Detect sensitive data 2. Protect applying masking/encryption policies
  26. 26. Case Studies © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 26
  27. 27. Protecting sensitive data in top credit card firm Source Data Protection Analysis Credit Card Transactions Omniture Files © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL Selective access to sensitive data based on role and app 27 Objectives ! Consolidate existing payment risk analysis inside high-scale, lower cost Hadoop ! Provide tiered access authorization for multiple business apps (fraud, risk, cross-sell Solution ! MapR Hadoop for single, reliable, high performance data analysis platform ! Dataguise consistent masking enables analysis and unique index key values for de-identified data ! Unique ability to output protected data in adjacent column or appended with delimiter inside existing column to protect data while governing access via authorization rules Incremental updates to HDFS automatically protected Results Benefits • Continuous real-time protection (job runs every 5 mins on ingest) • Analytics draws on the secure purchasing data of 90 million credit card holders across 127 countries
  28. 28. Protecting personal health info (PHI) in aggregate data lake DG FTP Agent SQL Data FTP Health Records © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL HDFS Authorization controlled through group membership in Active Directory 28 Objectives ! Reduce costly and preventable readmission, decrease mortality rates, and improve the quality of life for patients ! Internal data service model DAaaS (Data Architecture as a Service) Solution ! Solution needs to protect structured and unstructured source data in database, data warehouse, and flat file structures ! Customer required customization of encryption and key management to fit into their existing corp infrastructure and security policies ! Dataguise dashboard gives admins easy way to identify directories/files containing sensitive data Results Benefits • Delivered a cost-effective and easy way to determine where sensitive data resides within the cluster, and how it’s been protected ! Seamless access to encrypted data from a variety of data access methods {Hive, Pig, Analytic tools}
  29. 29. Global Tech Product Analytics DG Flume Agent © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 29 Objectives ! Aggregate logging data (product, usage, user configuration) for all smartphones worldwide t ! De-identify personal user info to ensure privacy and compliance with European/US Privacy Solution • Customer routes all device logging data into 7 Global AWS clouds • Uses Dataguise Flume agent to protect all sensitive data being written to Amazon S3 • Runs Dataguise in AWS, also utilizes Dataguise EMR security agents to selectively decrypt for authorized analytics in AWS Results Benefits Apache Flume • On-demand Hadoop for product analytics, user behavior, supply chain optimization • High scale-out, high performance and high scale-out paramount • 100% cloud based security Virtualized DG Secure Protected Data Amazon S3 Smartphone Device Log Collectors AWS Clouds in Korea, Singapore, US (3), UK, and Ireland
  30. 30. Hadoop Data-Protection Checklist © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 30 Discover sensitive data Automate protective measures Integrate into Hadoop authorization With continuous real-time tracking Dashboards, Reports Auditing Automated Risk Assessment/Scoring Automated inference protection (roadmap)
  31. 31. Thank You © 2014 Dataguise Inc. All rights reserved. COMPANY CONFIDENTIAL 31 Jeremy Stieglitz VP Products jeremy@dataguise.com

Today, no industry is immune from a potential data breach and the havoc it can create. According to a 2013 Global Data Breach study by the Ponemon Institute, the average cost of data loss exceeds $5.4 million per breach, and the average per person cost of lost data is approaching $200 per record in the US. Protecting sensitive data in Hadoop is now the imperative for IT and business. With the emergence of Hadoop as a business-critical data platform, Hadoop offers organizations opportunities to improve performance, better understand customers and develop a competitive advantage. But reaching these desirable analytic outcomes depends on the ability to use data without exposing the organization to unnecessary risk. This presentation will cover best practices for a data-centric security, compliance and data governance approach, with a particular focus on two customer use cases within the financial services and insurance industries. You'll learn how these companies are reducing their security exposure through automated data-centric protection of sensitive data in Hadoop.

Views

Total views

651

On Slideshare

0

From embeds

0

Number of embeds

8

Actions

Downloads

21

Shares

0

Comments

0

Likes

0

×