SlideShare a Scribd company logo
1 of 37
Magnus.Runesson@svenskaspel.se
DataWorks Summit 2018-04-19
Practical experiences using Atlas and Ranger to
implement GDPR
2018-04-19 1
22018-04-19
Who is talking?
Magnus
Runesson
Data Engineer @ Svenska Spel
Developer
Ops
RDBMS
BigData
High performance
Gaming is for everyone´s enjoyment
42018-04-19
• This talk does not cover all we have done around GDPR
• This is NOT a way to say if you do this you are GDPR compliant.
• Some details are left out or simplified
Disclaimer
52018-04-19
• Why?
• Svenska Spel’s data warehouse
• Atlas & Ranger
• How did we implement it?
• Experiences and conclusions
Agenda
62018-04-19
GDPR requires
• clear purpose for PII data
• privacy by design
• clear consent or legal ground
• not to use/store PII if not needed
• people own their own data.
• penalty if not followed
Why?
72018-04-19
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Goals
82018-04-19
Svenska Spel’s data warehouse
92018-04-19
• Moved from classic Cognos + Oracle
• HDP 2.6 using Hive
• Includes Personal Identifiable Information (PII)
• 300+ event streams in
• 150 published tables and views
Svenska Spel’s data warehouse
102018-04-19
• Used data are
• Understood
• Documented
• Modelled
• Modelled with Data Vault
• Oracle SQL Developer Data Modeler
• SQL code generated from model
Model based development
112018-04-19
• History tracking
• Uniquely linked
• Pattern based
• Easy to generate code
Data Vault
Link
Hub
Hub
Satellite
Satellite Satellite
12
CRM mart
ETL
Anonymization
DataLake
Integration
DataVault
Dimension
mart
ETL BI mart
Exasol
Tableau
Hadoop Presentation
Rolebasedaccess
CRM
Whitelisting
…
132018-04-19
Apache Atlas and Ranger
142018-04-19
• Metadata about resources
• Resource is
• Table
• Column
• Schema
• File on HDFS
• …
• Lineage
Apache Atlas
152018-04-19
• Tags have no meaning themselves
• Your business vocabulary define the meaning
• Example of tags:
• Business entity owning the data
• Indication of sensitive data
• The rules in Ranger enforces the policy
• Separate metadata from policy implementation
Atlas tags
162018-04-19
• Is user U allowed to do operation O on resource R?
• Access
• Row based filtering
• Masking
• Audit logging
• Resources referred with tags
Apache Ranger
172018-04-19
customer
Customer_id Name Postal_code Has_phone Marketing
1 Steve 12345 False False
2 Bill 54321 True False
3 Paul 54672 False True
Table in Hive before we started our work
182018-04-19
customer
Customer_id Name Postal_code Has_phone Marketing
1 Steve 12345 False False
2 Bill 54321 True False
3 Paul 54672 False True
PII_table
PII
Add PII tags on table and columns in Atlas.
No behaviour change.
PII
192018-04-19
customer
Customer_id Name Postal_code Has_phone Marketing
17 ABC 12345 False False
42 DEF 54321 True False
13 BDE 54672 False True
PII
We set a rule in Ranger to mask PII columns
Analyst view
PII_table
PII
202018-04-19
customer
Customer_id Name Postal_code Has_phone Marketing
3 Paul 54672 False True
PII
Ranger restrict our CRM user to only see rows with
Marketing = True
PII_table
PII
212018-04-19
How did we implement this?
222018-04-19
Development process
232018-04-19
• In-house tool
• Template based generation of SQL/HQL
• Generate files with tag-information
• Tables and columns respectively
HQL generator
HQL generator
CSV SQL
242018-04-19
schema;table;attribute;tags
dim_mart;customer_d;customer_id;PII,Sensitive
dim_mart;customer_d;has_phone;
Corresponding file for tables without attribute(column)
Tag file for columns
252018-04-19
• Hand coded of rules per tag
• Policy tool applies rule on all tables with the tag
• Can be different rules for different users
• Filter gets appended to where condition by Ranger
• Used for
• Row based filtering (access)
• Masking (anonymization)
• Catch all rule to deny access to tables not in our model
Ranger rules
262018-04-19
tag;groups;users;filter
PII_table;CRM_USERS;;EXISTS (SELECT 1
FROM customer_whitelist_crm whitelist_crm
WERE whitelist_crm.customer_id = $table.customer_id)
$table get replaced at deployment time
Example tag_row_policies.csv
272018-04-19
Deployment process
*.sql
table_tags.csv
column_tags.csv
tag_row_policies.csv
Policy tool - tag files
282018-04-19
• Makes it easy to manage
• Atlas tags
• Ranger policy rules
• Command line tool
• Consumes tags and policy CSV files
• Calls Atlas and Ranger API
• Less than 1000 rows of Python
Policytool
Policytool
CSV
292018-04-19
Put everything together
302018-04-19
Development process
*.sql
column_tags.csv
table_tags.csv
tag_row_policies.csv
312018-04-19
Deployment process
*.sql
column_tags.csv
table_tags.csv
tag_row_policies.csv
Policy tool - tag files
322018-04-19
Change in view of an Analyst
332018-04-19
• Simple and easy model
• Limited performance penalty
• Tag on table with masking rule => all columns masked
• Hard to understand API doc
• Restriction on Ranger row based filtering (not on tags)
• Row based filtering and masking not on direct file access
Experiences of Atlas and Ranger
342018-04-19
• Our customers and partners integrity is protected
• Users have only access to data aimed for current purpose
• Keep doing our required processing
• Adaptable for new requirements
• Maintainable solution
Reached Goals
352018-04-19
• Goals reached
• No SQL changes
• Scale when new datasets added
• Our data model is guaranteed in sync
• Better comments in Hive
• Minimal impact on ETL developers workflow
Conclusions
362018-04-19
• Make it as simple as possible
• Automate
• Know your tool
• Be clear on your authorization model
• Know your data
Takeaways
Magnus.Runesson@svenskaspel.se
@MRunesson
Thank you!
2018-04-19 37
karriar.svenskaspel.se

More Related Content

Similar to Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssummit 2018

Data analytics and audit coverage guide
Data analytics and audit coverage guideData analytics and audit coverage guide
Data analytics and audit coverage guideAstalapulosListestos
 
Data analytics and audit coverage guide
Data analytics and audit coverage guideData analytics and audit coverage guide
Data analytics and audit coverage guideCenapSerdarolu
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enoughCloudera, Inc.
 
Integrate ERP and CRM Metadata into ER/Studio
Integrate ERP and CRM Metadata into ER/StudioIntegrate ERP and CRM Metadata into ER/Studio
Integrate ERP and CRM Metadata into ER/StudioDATAVERSITY
 
Melb nov17 Virtual Entity and auto number
Melb nov17 Virtual Entity and auto numberMelb nov17 Virtual Entity and auto number
Melb nov17 Virtual Entity and auto numberAndre Margono
 
Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"
Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"
Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"Ragnar Heil
 
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Amazon Web Services
 
20181212 AWS NL - Informatica Cloud Overview
20181212 AWS NL - Informatica Cloud Overview20181212 AWS NL - Informatica Cloud Overview
20181212 AWS NL - Informatica Cloud OverviewGreg Rakers
 
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...Amazon Web Services
 
Bilot SmartMDM breakfast session 8.3.2018
Bilot SmartMDM breakfast session 8.3.2018Bilot SmartMDM breakfast session 8.3.2018
Bilot SmartMDM breakfast session 8.3.2018Bilot
 
Sports alliance - Case Study: Data Driven Marketing in European Football
Sports alliance - Case Study: Data Driven Marketing in European FootballSports alliance - Case Study: Data Driven Marketing in European Football
Sports alliance - Case Study: Data Driven Marketing in European FootballBigDataExpo
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analyticsDendej Sawarnkatat
 
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...Amazon Web Services
 
Practical steps to GDPR compliance
Practical steps to GDPR compliance Practical steps to GDPR compliance
Practical steps to GDPR compliance Jean-Michel Franco
 
Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution  Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution Sirinporn Setworaya
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessInformatica
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use CasesTatvic Analytics
 
Microsoft Dynamics strategy for small to medium size business: a new solution
Microsoft Dynamics strategy for small to medium size business: a new solutionMicrosoft Dynamics strategy for small to medium size business: a new solution
Microsoft Dynamics strategy for small to medium size business: a new solutionDXC Eclipse
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...DataWorks Summit
 

Similar to Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssummit 2018 (20)

Data analytics and audit coverage guide
Data analytics and audit coverage guideData analytics and audit coverage guide
Data analytics and audit coverage guide
 
Data analytics and audit coverage guide
Data analytics and audit coverage guideData analytics and audit coverage guide
Data analytics and audit coverage guide
 
When SAP alone is not enough
When SAP alone is not enoughWhen SAP alone is not enough
When SAP alone is not enough
 
Integrate ERP and CRM Metadata into ER/Studio
Integrate ERP and CRM Metadata into ER/StudioIntegrate ERP and CRM Metadata into ER/Studio
Integrate ERP and CRM Metadata into ER/Studio
 
Melb nov17 Virtual Entity and auto number
Melb nov17 Virtual Entity and auto numberMelb nov17 Virtual Entity and auto number
Melb nov17 Virtual Entity and auto number
 
Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"
Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"
Webinar Metalogix "Auf der Zielgeraden zur DSGVO!"
 
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
Data Privacy & Governance in the Age of Big Data: Deploy a De-Identified Data...
 
Meetup Data-science OVH
Meetup Data-science OVHMeetup Data-science OVH
Meetup Data-science OVH
 
20181212 AWS NL - Informatica Cloud Overview
20181212 AWS NL - Informatica Cloud Overview20181212 AWS NL - Informatica Cloud Overview
20181212 AWS NL - Informatica Cloud Overview
 
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...
Accelerate Digital Transformation Through AI-powered Cloud Analytics Moderniz...
 
Bilot SmartMDM breakfast session 8.3.2018
Bilot SmartMDM breakfast session 8.3.2018Bilot SmartMDM breakfast session 8.3.2018
Bilot SmartMDM breakfast session 8.3.2018
 
Sports alliance - Case Study: Data Driven Marketing in European Football
Sports alliance - Case Study: Data Driven Marketing in European FootballSports alliance - Case Study: Data Driven Marketing in European Football
Sports alliance - Case Study: Data Driven Marketing in European Football
 
001 More introduction to big data analytics
001   More introduction to big data analytics001   More introduction to big data analytics
001 More introduction to big data analytics
 
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
AWS Summit Singapore - Accelerate Digital Transformation through AI-powered C...
 
Practical steps to GDPR compliance
Practical steps to GDPR compliance Practical steps to GDPR compliance
Practical steps to GDPR compliance
 
Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution  Company Profile - NPC with TIBCO Spotfire solution
Company Profile - NPC with TIBCO Spotfire solution
 
Why an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business SuccessWhy an AI-Powered Data Catalog Tool is Critical to Business Success
Why an AI-Powered Data Catalog Tool is Critical to Business Success
 
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
[Webinar] Getting Started with BigQuery: Basics, Its Appilcations & Use Cases
 
Microsoft Dynamics strategy for small to medium size business: a new solution
Microsoft Dynamics strategy for small to medium size business: a new solutionMicrosoft Dynamics strategy for small to medium size business: a new solution
Microsoft Dynamics strategy for small to medium size business: a new solution
 
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
Artificial Intelligence and Analytic Ops to Continuously Improve Business Out...
 

Recently uploaded

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 

Recently uploaded (20)

RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 

Practical experiences using Atlas and Ranger to implement GDPR - Dataworkssummit 2018

Editor's Notes

  1. Two weeks ago we had one of our regular syncs with Hortonworks. As many others; we discussed things around GDPR in different aspects. I did a short adhoc presentation about how we are using Apache Atlas and Ranger to support access control and anonymization in our GDPR work. Our sales representative Johan got very exited and asked me to come here and do a presentation. I will talk about how we managed to keep our development process for ETL developers intact and avoid to do any changes in our SQL-code thanks to Apache Atlas, Ranger and some integrations. We can now present different views of our data for different users. CRM users see only users opted in for marketing and analyst sees data that are anonymized. Atlas and Ranger in combination can feel very abstract and unclear how to apply them. My goal is to be as concrete as possible how we use the tools and that you feel it is time to get your hands dirty.
  2. Magnus Runesson 20+ years experience in Development, Operation and databases with high demand. Lately focus on big data and distributed scalable systems. Data Engineer @ Svenska Spel
  3. Svenska Spel is a gaming company owned by the state. In direct translation Swedish Games. We provide games on-line, in stores and in casinos in Sweden The state asks us and to deliver attractive games in a responsible manner and at the same time take a big responsibility on counteract gambling addiction. Svenska Spel´s vision is therefore gaming is for everyone´s enjoyment.
  4. Before I go into subject I want to point out that: This talk does not cover all we have done around GDPR This is NOT a way to say if you do this and you are GDPR compliant. Some details are left out or simplified
  5. I will start talk about Why we are doing this. To be able to understand what we have done I’ll give a introduction to our DW environment. Moving on to a short introduction to Atlas and Ranger. I assume most of you knows about it but to ensure we are on the same page. With that information as base I’ll talk about how we use Atlas and Ranger to control access to PII data at scale in a controlled and maintainable manner.
  6. The laws in Europe about how to handle personal information will on May 25th be harmonized within EU. There are several aspects how to handle PII that becomes stricter and we get a penalty if breaking them. PAUS
  7. The goals we sat up before starting this work were: Our customers and partners integrity is protected Users have only access to data granted for purpose of interest We can do our required processing Adaptable for new requirements Maintainable solution
  8. SvS coming from classical EDW. It started to become slow to load and we did not have the granularity of the event that we wanted. Decided to go towards Hadoop and with the final goal to do all BI work within the Hadoop echo system. We are heavy users of Hive. The advantage is that it is relatively simple for an ETL developer with SQL knowledge to be productive. As many EDW ours includes PII Input to the system we have more than 300 event streams. These are processed into more than 150 tables published in Tableau. Inevitably the amount of tables make us want to automate much of our development and deployment as possible. To do that we have taken a model based development approach.
  9. Model based development means for us that the data we consumes is well understood. We document this understanding as a step to support analysts. We are modelling input, output and intermediate steps using ER diagrams in Oracle SQL Developer Data Modeler. Or integration layer Is modeled using Data Vault methodology The model in Data Modeler are is used to generate all DDL code and some of our transformation logic. The generation works when we have simple business rules. Complex business rules is hand coded.
  10. The benefits of using Data Vault are: * History tracking – It is easy to see how relations between entities have changed * Uniquely linked – All links are 1-N * Pattern based – The former two gives us pattern how to formulate things * Easy to generate code – These patterns makes it easy to generate code to transform the D-V into a Kimball dimension model Join draw back
  11. Physically we have a multi layer architecture. All our in data from source systems goes into our Data Lake. The data in the Lake can be structured in many ways, sometimes direct copies of internal data structures from the source systems. We model this data using Data Vault in our Integration layer. To a start this was very much views from the lake tables but due to performance we have materialized most views today. Last step in the processing is to put our data into the dimension mart where the data is dimension modelled according to Kimball. This makes it easy to consume by any analysis tool. Earlier the dimension layer was directly copied over to our presentation layer, an in-memory database Exasol. With the requirements of GDPR we wanted to separate the data into different marts based on purpose. Nowdays we have these purpose based marts as an intermediate step. Example of two marts are our BI mart and CRM mart. BI mart is anonymized data for all our customers. On this we can do Business analysis to understand how our business perform. But we cannot identify the individual user. In the CRM mart we only include customers that have opted in for marketing and recommendations. We do also have a handful of other marts each with a different purposes. What data is going to what mart is controlled by our roll based access, handled by Ranger. This is how we govern the data to ensure each purpose based marts only get the data they should that is I will talk about the rest of the time. This is also where Apache Ranger and Atlas comes into play.
  12. Atlas handles metadata about our resources. It is not necessary limited to resources in Hadoop. Resources are Hive tables, columns, schemas. But also files on HDFS. We get metadata around creation, usage and deletion. As a user one can tag each resource with tags.
  13. These tags have no meaning them self. Your business vocabulary define the meaning Example of tags: Business entity owning the data Indication of sensitive data The rules in Ranger enforces the policy This separate metadata from policy implementation. It is easier to understand that a tag PII means it is personal identifiable information. How this is implemented is often not necessary to understand when modeling. Therefore the policies are implemented separately in Ranger. The rules can often be complex and not obvious why they are relevant for a table. The tags separates the problems from each other. Let us take a look on an example how this can be used. PAUS
  14. Let us take a look on what Ranger and Atlas are in this perspective. Is user U allowed to do operation O on resource R? Access Row based filtering Masking Audit logging 15 min
  15. We have a plain customer table in Hive with some information about our customers. We want to do two things with this table before we make it available to the user. Anonymize data that can identify a person. CRM are only allowed to see customers opted in for marketing (Most left column)
  16. We add three tags in Atlas to this table and its columns. Two tags on the two columns we consider PII. We also add a tag to the table so that we know it includes PII even if we not will use it right now. Users of the table will still see no change.
  17. We add a masking rule in Ranger using our own hashing algorithm with a good salt to anonymize our data. Only data that are tagged PII will be masked. When this is applied our user will see this table. We have now fulfilled our first goal of anonymize the data. This is the view an analyst sees that analysis how we perform in different areas.
  18. To implement our second goal that CRM shall only see customers that have opted in for marketing. We must add a second rule saying: CRM users are only allowed to see customers with Marketing=True This rule is an appended where condition to our queries. Now our table is ready for CRM to use. A challenge is that in Ranger you have to create one rule for every table. In our case with a very systematic naming we know that all tables with a column Marketing must have this rule. This may not be valid for for this example, but in other. We needed tooling support for this otherwise it is impossible to ensure the 150 tables are correct. We also wanted to take advantage having all tables described in Oracle Data Modeler. Why not add PII and Marketing info there too? We need tooling for this too. 18-20min
  19. We have now in general terms looked on the problem statement and how Apache Atlas and Ranger can solve it. We have also observed that it might be infeasible to scale for many tables and rules using Atlas and Ranger directly. Tooling is needed to do this in a maintainable manner.
  20. These insights of needed tooling lead us to the insight that we wanted a development process looking like this. The ETL developer add new tables, columns etc in Oracle Data modeler. Including information what is PII and some other metadata we want. The developer store the model in a GIT repository. He or She runs a tool to generate the SQL code and information which tables and columns has what tags. Some SQL may have to be written by hand, but no table creations. This is core to be able to tag our data. Also the generated code is checked into GIT. That way our GIT repository includes a full description of or ETL process. The code is pushed to Jenkins and automatically deployed on Hive according to our release process. Exceptionally new rules who can access what data must be added. We went with the decision to store these in the repository too, since it is our developers creating the rules. Others might have other requirements. This process is very similar for the developer to what they had before. Only new thing they normally need to do is tagging the data within Oracle Data Modeler. PAUS
  21. To generate the code we have an in-house tool called HQLgenerator. It reads the files from Data Modeler and generates SQL. The generation is based on templates and three files defining what to generate and how to generate transformations. These files includes source and destination tables and if any transformation is needed. We have had this tool for about 2 years. This was the natural place to also generate the tagging information for our tables and columns. Easy to change SQL dialect. If you think about it. Describing what tags you want on a table is very much the same as describing the create statement of the table. Both are about data definition.
  22. We added functionality to generate two CSV files. One for columns and one for tables. Here you see an example of a file for columns. It includes schema, table, column (called attribute) and what tags it shall have. One line per column. It is very simplistic. The files for tables looks almost the same. Independent of end tool. We have considered using for out in-memory database Exasol too.
  23. As earlier said. It is not enough to set tags on the data. We also need rules. We did not find any good way to have them in our model but we wanted to have them in our repository for auditing purposes. The ranger rules are hand coded. One line per tag, user and rule. The tags will be expanded by our tooling to all tables having that tag. These rules is then added to Raynor. We use it for both row based filtering and masking. A catch all rule to deny access to tables not in our Model
  24. The Ranger policy file is close to as simple as the tag files. One specify the tag we want to have a rule for. What groups and/or uses this rule applies and finally our rule. Note that we have $table in the rule. This will when inserted to raynor be changed to the table name. In Raynor we will get one rule per table having the tag PII_table. Note that we can do this this way since we know that each table with PII information have a customer_id thanks to our structured and pattern based generation of SQL. In this example you can see how we trust that we have a customer_id in all our PII_tables. We will change this in a second iteration, we made it too simple.
  25. Okay we have now created all files that are created or generated by our ETL developer. It is deployed and becomes part of a job in Airflow, our scheduler. This job is for create or modify tables. It does three steps. Create all tables from the SQL files. Take the tag files and tag our tables in Atlas. This is done with a small command line tool called policytool. Our policies are applied to Ranger using policytool too. The not so much magic is handled by the policytool we have developed. The two last steps is repeated every day to verify no manual changes have been done in neither Raynor nor Atlas.
  26. The purpose of the Policytool is to make it simple to add and manage tags and policies in Atlas and Ranger. We can thanks to it handle our tags and rules within GIT. It is easy to audit and find if any unintended changes have happened. It is very simple. It consumes our CSV files and applies them to the Atlas and Ranger API. In the case of policy rules they are expanded to one rule per table instead of one per tag. The tool warns us if there are unknown tables or columns. Normal user cannot access tables that does not exist in our data model, thanks to this. All in all the tool is right now less than 1000 rows of Python code. 30 MIN
  27. To wrap it up. We have an simple and clear development process. All tables are modeled in a graphical tool, Oracle SQL Developer Data Modeler. Even metadata such as PII is added there. The full description of our model inclusive, policies and metadata are stored in GIT. Very much generated from our model. It is automatically deployed when merged.
  28. The content of the GIT repo ends up in Airflow that schedules our runs. Our policies and tags are running daily. Table changes are only run when needed.
  29. The behavior of applying tags and rules can be show as this. The analysts view of our data is now restricted according to our Policies. From our earlier example. The analyst can still see all data but PII is anonymized. The analyst have no need of name or other PII. She is interested in grouping on postal code or any other market segment. The CRM user sees only users opted in for marketing. This reduces risk of sending marketing to customers not asking for it.
  30. So what do we think about Atlas and Ranger. Both tools have simple and easy models. The performance penalty feels fair. There are some pitfalls like: If you put a tag on a table and then add a masking policy to that tag. All fields in the table will be masked. The most negative thing with the tools is that the API doc is hard to understand. There are several versions of the APIs. One in ranger not documented at all, but turned out to work best. Easiest way to understand the API was to wiretap the GUI. For us the restriction on Ranger row based filtering not be able to select tables to filter based on a tag was a drawback. We solved it with our tooling. In the end we would probably wanted that tool anyway. Row based filtering and masking not on direct file access It is not a big surprise from a technical perspective. But that forced us to materialize our purposed based marts so we could import them to the in-memory database Exasol.
  31. Lets evaluate how we reached our goals. With the work I have described using Atlas, Ranger, our data model. Extension of our HQLgenerator and creation of our policytool we reached our goal. With this, and other measures we feel that we protect our customers and partners integrity. Unauthorized users can not access data they should not. We can still do our processing, without any changes in our jobs. We are confident that we can adapt to new requirements. The solution is easy to maintain.
  32. In total the impact was that we reached our goal. The solution scales when we add new data sets. The change in our existing workflow was minimal. We also got some advantages we did not foresaw when started. In our data model we have a lot of comments that are valuable documentation. Since this work ensure us that our model is in sync. Short cuts cannot be taken we now have better documentation within Hive of our datamodel. Being in sync between model and Hive also helps further development.
  33. Finally some takeaways I want to give you if you will do something similar: Make it as simple as possible. It will be complex enough anyway. Think automatization. It eases the burden when in production. Helps you scale. But it also increases your quality and auditability. The number of policy rules we create, need to review and audit is less than 1 of 100 of those we actually have in Ranger. Be sure you understand the tool and how the limitations will impact you. This comes in proximity to that you should have a clear authorization model. Avoid complex rules like: If this, but not that or that do this but only when etc. Of course the last is obvious. Know your data. Be systematic about it. Always call your customer id customer_id not sometimes c_id or just id.
  34. Thank you! Fell free to contact me if you have any questions. I hope we have a few minutes for questions left.