SlideShare a Scribd company logo
1 of 35
PROTECT YOUR PRIVATE DATA
WITH ORC COLUMN
ENCRYPTION
Owen O’Malley
owen@cloudera.com
September 2019
@owen_omalley
© 2019 Cloudera, Inc. All rights reserved. 2
WHO AM I?
• First committer added to Hadoop
 MapReduce
 Scaling
 Security
• Hive
 ACID transactions
• ORC
 Creator
SECURITY AND DATA PROTECTION
IN HADOOP
© 2019 Cloudera, Inc. All rights reserved. 4
EXAMPLE DATA LAKE SCENARIO
Marketing
Demographics
Electronic
medical records
CRM
POS
(Structured)(Structured) (Structured) (Structured) (Structured)
Cluster 1: Dublin Cluster 2: San Francisco
(Unstructured)(Unstructured)(Unstructured)
Cluster 3: Prague
(Structured)
On Premise Data Lakes
(Unstructured)(Structured) (Unstructured) (Structured)
Cloud Data Lakes
Social
Weblogs & Feeds
Transactional
Mobile
IoT
Personal Data
© 2019 Cloudera, Inc. All rights reserved. 5
DIFFERENCES IN THE BIG DATA CONTEXT
• Breaking down silos: fantastic for analytics, but security challenges
 Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-grained
authorization
• Data democratization and the Data Scientist role (often a data super user with
elevated privileges)
• Data is maintained over a long duration
• Cloud and Hybrid architectures spanning data center and (multiple) public clouds
further broaden the attack surface area and present novel authentication and
authorization challenges
• Along with adherence to security fundamentals and defense in-depth, a data-centric
approach to security becomes critical
Watch Towers
Limited Entry Points
Moat
Kerberos
Securing your data lake
High Hard Walls
Check Identity
Inner Walls
Firewall
Encryption, TLS, Key
Trustee, Navigator
Encrypt, Ranger KMS
LDAP/AD
Apache Knox: AuthN, API
Gateway, Proxy, SSO
Apache Ranger : ABAC
AuthZ, Audits,
Anonymization
Apache Metron:
Detection
© 2019 Cloudera, Inc. All rights reserved. 7
DATA PROTECTION IN HADOOP
• Must be applied in three different levels
Storage – encrypt data at rest
• HDFS encryption zones
• Volume encryption
Transmission – encrypt data in motion
• SSL
Upon access – apply restrictions when accessed
• Ranger dynamic masking and row filtering
Dynamic Row Filtering & Column Masking WithApache Ranger &Apache Hive
User 2: Ivanna
Location : EU
Group: HR
User 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx
xxxx
null John Doe
US xxxxx7465 5391 xxxx xxxx
xxxx
null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies:
Filter rows by region & apply relevant column
masking
Users from US Analyst group see data for
US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country,
nationalid, name, mrn
FROM ww_customers
Analysts
HR Marketing
Ranger
WHAT ARE WE FIXING?
© 2019 Cloudera, Inc. All rights reserved. 10
FRAMING THE PROBLEM…
• Related data, different security requirements
• Authorization – who can see it
• Audit – track who read it
• Encrypt on disk – regulatory
• File-level (or blob) granularity isn’t enough
• File systems don’t understand columns
© 2019 Cloudera, Inc. All rights reserved. 11
REQUIREMENTS
• Readers should transparently decrypt data
If and only if the user has access to the key
The data must be decrypted locally
Old readers must not break!
• Columns are only decrypted as necessary
• Master keys must be managed securely
Support for Key Management Server & hardware
Support for key rolling
EARLIER WORKAROUNDS
© 2019 Cloudera, Inc. All rights reserved. 13
PARTIAL SOLUTION – HDFS ENCRYPTION ZONES
• Transparent HDFS Encryption
• Encryption zones
• HDFS directory trees
• Unique master key for each zone
• Client decrypts data
• Key Management via KeyProvider API
© 2019 Cloudera, Inc. All rights reserved. 14
HDFS ENCRYPTION ZONE LIMITATIONS
• Very coarse protection
• Only entire directory subtrees
• No ability to protect columns
• A lot of users need access to keys
• Moves between zones is painful
• When writing with Hive, data is moved multiple times per a query
© 2019 Cloudera, Inc. All rights reserved. 15
PARTIAL SOLUTION – HIVE SERVER 2
• Limit access to warehouse data to Hive
• Only “hive” user has HDFS access
• Breaks Hadoop’s multi-paradigm data access
• Many customers use both Hive & Spark
• JDBC isn’t a distributed protocol
• Funneling large data through a small pipe
• Spark Data Warehouse Connector to LLAP fixes this
© 2019 Cloudera, Inc. All rights reserved. 16
PARTIAL SOLUTION – SEPARATE TABLES
• Split private information out of tables
• Separate directories in HDFS
• HDFS and/or HS2 authorization
• Enables HDFS encryption
• Limitations
• Need to join with other tables
• Higher operational overhead
© 2019 Cloudera, Inc. All rights reserved. 17
PARTIAL SOLUTION – ENCRYPTION UDF
• Hive has user defined functions
• aes_encrypt and aes_decrypt
• Limitations
• Key management is problematic
• Encryption is not seeded
• Size of value leaks information
THE WINNER IS …
© 2019 Cloudera, Inc. All rights reserved. 19
COLUMNAR ENCRYPTION
• Columnar file formats, such as ORC and Parquet
• Write data in columns
• Column projection
• Better compression
• Encryption works really well
• Only encrypt bytes for column
• Can store multiple variants of data
© 2019 Cloudera, Inc. All rights reserved. 20
ORC FILE FORMAT
© 2019 Cloudera, Inc. All rights reserved. 21
USER EXPERIENCE
• Set table properties for encryption
• orc.encrypt = ”pii:ssn,email;credit:card_info”
• orc.mask = “sha256:card_info”
• Define where to get the encryption keys
• Configuration defines the key provider via URI
© 2019 Cloudera, Inc. All rights reserved. 22
KEY MANAGEMENT
• Create a master key for each use case
• “pii”, “pci”, or “hipaa”
• Each column in each file uses unique local key
• Allows audit of which users read which files
• Ranger policies limit access to keys
• Who, What, When, Where
© 2019 Cloudera, Inc. All rights reserved. 23
KEY PROVIDER API
• Provides limited access to encryption keys
• Encrypts or decrypts local keys
• Users are never given master keys
• Key versions and key rolling of master keys
• Allows 3rd party plugins
• Supports Cloud, Hadoop or Ranger KMS
© 2019 Cloudera, Inc. All rights reserved. 24
ENCRYPTION DATA FLOW
© 2019 Cloudera, Inc. All rights reserved. 25
ENCRYPTION FLOW
• Local key
• Random for each encrypted column in file
• Encrypted w/ master key by KMS
• Encrypted local key is stored in file metadata
• IV is generated to be unique
• Column, kind, stripe, & counter
© 2019 Cloudera, Inc. All rights reserved. 26
STATIC DATA MASKING
• What happens without key access?
• Define static masks
• Nullify – all values become null
• Redact – mask values ‘Xxxxx Xxxxx!’
• Can define ranges to unmask
• SHA256 – replace with SHA256
• Custom - user defined
© 2019 Cloudera, Inc. All rights reserved. 27
DATA ANONYMIZATION
• Anonymization is hard!
• AOL search logs
• Netflix prize datasets
• NYC taxi dataset
• Always evaluate security tradeoffs
• Tokenization is a useful technique
• Assigns arbitrary replacements
© 2019 Cloudera, Inc. All rights reserved. 28
KEY DISPOSAL
• Often need to keep data for 90 days
• Currently the data is written twice
• With column encryption:
• Roll keys daily
• Delete master key after 90 days
© 2019 Cloudera, Inc. All rights reserved. 29
ORC ENCRYPTION DESIGN
• Write both variants of streams
• Masked unencrypted
• Unmasked encrypted
• Encrypt both data and statistics
• Maintain compatibility for old readers
• Read unencrypted variant
• Preserve ability to seek in file
© 2019 Cloudera, Inc. All rights reserved. 30
ORC WRITE PIPELINE
• Streams go through pipeline
• Run length encoding
• Compression (zlib, snappy, lzo, lz4, zstd, or none)
• Encryption
• Encryption is AES/CTR
• Allows seek
• No padding
CONCLUSIONS
© 2019 Cloudera, Inc. All rights reserved. 32
CONCLUSIONS
• ORC column encryptions provides
Transparent encryption
Multi-paradigm column security
Compatible with old readers
Static masking
Audit logging (via KMS logging)
• Supports file merging
• Released in ORC 1.6
© 2019 Cloudera, Inc. All rights reserved. 33
INTEGRATION WITH OTHER TOOLS
• Hive & Spark
No change other than defining table properties
• Apache Hive’s LLAP
Cache and fast processing of SQL queries
Column encryption changes internal interfaces
Cache both encrypted and unencrypted variants
Ensure audit log reflects end-user and what they accessed
© 2019 Cloudera, Inc. All rights reserved. 34
LIMITATIONS
• Need encryption policy for write
Current Atlas & Ranger tags lag data
Auto-discovery requires pre-access
• Changes to masking policy
Need to re-write files
• Need additional data masks
Credit card, addresses, etc.
• Decrypted local keys could be saved
THANK YOU
Owen O’Malley
owen@cloudera.com
@owen_omalley

More Related Content

What's hot

Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019Timothy Spann
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseAldrin Piri
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFSEdureka!
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and visionStephan Ewen
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorFlink Forward
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateMichael Rainey
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkTimo Walther
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsCloudera, Inc.
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseDataWorks Summit
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiLev Brailovskiy
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifiAnshuman Ghosh
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflakeSivakumar Ramar
 

What's hot (20)

Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San JoseDataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
Dataflow with Apache NiFi - Apache NiFi Meetup - 2016 Hadoop Summit - San Jose
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
 
Master the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - SnowflakeMaster the Multi-Clustered Data Warehouse - Snowflake
Master the Multi-Clustered Data Warehouse - Snowflake
 
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGateContinuous Data Replication into Cloud Storage with Oracle GoldenGate
Continuous Data Replication into Cloud Storage with Oracle GoldenGate
 
CDC Stream Processing with Apache Flink
CDC Stream Processing with Apache FlinkCDC Stream Processing with Apache Flink
CDC Stream Processing with Apache Flink
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
HBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table SnapshotsHBaseCon 2013: Apache HBase Table Snapshots
HBaseCon 2013: Apache HBase Table Snapshots
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Seamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive WarehouseSeamless replication and disaster recovery for Apache Hive Warehouse
Seamless replication and disaster recovery for Apache Hive Warehouse
 
Nifi
NifiNifi
Nifi
 
Data ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFiData ingestion and distribution with apache NiFi
Data ingestion and distribution with apache NiFi
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifi
 
An overview of snowflake
An overview of snowflakeAn overview of snowflake
An overview of snowflake
 

Similar to Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
 
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...Denodo
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003lee tracie
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoopWei-Chiu Chuang
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFSDataWorks Summit
 
Overcoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the CloudOvercoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the CloudZscaler
 
cisco csr1000v
cisco csr1000vcisco csr1000v
cisco csr1000vMing914298
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Asug84339 how to secure privacy data in a hybrid s4 hana landscape
Asug84339   how to secure privacy data in a hybrid s4 hana landscapeAsug84339   how to secure privacy data in a hybrid s4 hana landscape
Asug84339 how to secure privacy data in a hybrid s4 hana landscapeDharma Atluri
 
Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012Trend Micro
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Cloudera, Inc.
 
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...IRJET Journal
 
CipherCloud for Any App
CipherCloud for Any AppCipherCloud for Any App
CipherCloud for Any AppCipherCloud
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoopNiel Dunnage
 
Fiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSFiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSJaime Martin Losa
 

Similar to Protect your private data with ORC column encryption (20)

Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFS
 
LDSS for mobile cloud
LDSS for mobile cloud  LDSS for mobile cloud
LDSS for mobile cloud
 
Overcoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the CloudOvercoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the Cloud
 
cisco csr1000v
cisco csr1000vcisco csr1000v
cisco csr1000v
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Asug84339 how to secure privacy data in a hybrid s4 hana landscape
Asug84339   how to secure privacy data in a hybrid s4 hana landscapeAsug84339   how to secure privacy data in a hybrid s4 hana landscape
Asug84339 how to secure privacy data in a hybrid s4 hana landscape
 
Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
 
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...
 
CipherCloud for Any App
CipherCloud for Any AppCipherCloud for Any App
CipherCloud for Any App
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 
Fiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSFiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPS
 

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 IcebergOwen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 

More from Owen O'Malley (19)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
ORC Files
ORC FilesORC Files
ORC Files
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 

Recently uploaded

While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 

Recently uploaded (20)

While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 

Protect your private data with ORC column encryption

  • 1. PROTECT YOUR PRIVATE DATA WITH ORC COLUMN ENCRYPTION Owen O’Malley owen@cloudera.com September 2019 @owen_omalley
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 WHO AM I? • First committer added to Hadoop  MapReduce  Scaling  Security • Hive  ACID transactions • ORC  Creator
  • 3. SECURITY AND DATA PROTECTION IN HADOOP
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 EXAMPLE DATA LAKE SCENARIO Marketing Demographics Electronic medical records CRM POS (Structured)(Structured) (Structured) (Structured) (Structured) Cluster 1: Dublin Cluster 2: San Francisco (Unstructured)(Unstructured)(Unstructured) Cluster 3: Prague (Structured) On Premise Data Lakes (Unstructured)(Structured) (Unstructured) (Structured) Cloud Data Lakes Social Weblogs & Feeds Transactional Mobile IoT Personal Data
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 DIFFERENCES IN THE BIG DATA CONTEXT • Breaking down silos: fantastic for analytics, but security challenges  Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-grained authorization • Data democratization and the Data Scientist role (often a data super user with elevated privileges) • Data is maintained over a long duration • Cloud and Hybrid architectures spanning data center and (multiple) public clouds further broaden the attack surface area and present novel authentication and authorization challenges • Along with adherence to security fundamentals and defense in-depth, a data-centric approach to security becomes critical
  • 6. Watch Towers Limited Entry Points Moat Kerberos Securing your data lake High Hard Walls Check Identity Inner Walls Firewall Encryption, TLS, Key Trustee, Navigator Encrypt, Ranger KMS LDAP/AD Apache Knox: AuthN, API Gateway, Proxy, SSO Apache Ranger : ABAC AuthZ, Audits, Anonymization Apache Metron: Detection
  • 7. © 2019 Cloudera, Inc. All rights reserved. 7 DATA PROTECTION IN HADOOP • Must be applied in three different levels Storage – encrypt data at rest • HDFS encryption zones • Volume encryption Transmission – encrypt data in motion • SSL Upon access – apply restrictions when accessed • Ranger dynamic masking and row filtering
  • 8. Dynamic Row Filtering & Column Masking WithApache Ranger &Apache Hive User 2: Ivanna Location : EU Group: HR User 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers Analysts HR Marketing Ranger
  • 9. WHAT ARE WE FIXING?
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 FRAMING THE PROBLEM… • Related data, different security requirements • Authorization – who can see it • Audit – track who read it • Encrypt on disk – regulatory • File-level (or blob) granularity isn’t enough • File systems don’t understand columns
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 REQUIREMENTS • Readers should transparently decrypt data If and only if the user has access to the key The data must be decrypted locally Old readers must not break! • Columns are only decrypted as necessary • Master keys must be managed securely Support for Key Management Server & hardware Support for key rolling
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 PARTIAL SOLUTION – HDFS ENCRYPTION ZONES • Transparent HDFS Encryption • Encryption zones • HDFS directory trees • Unique master key for each zone • Client decrypts data • Key Management via KeyProvider API
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 HDFS ENCRYPTION ZONE LIMITATIONS • Very coarse protection • Only entire directory subtrees • No ability to protect columns • A lot of users need access to keys • Moves between zones is painful • When writing with Hive, data is moved multiple times per a query
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 PARTIAL SOLUTION – HIVE SERVER 2 • Limit access to warehouse data to Hive • Only “hive” user has HDFS access • Breaks Hadoop’s multi-paradigm data access • Many customers use both Hive & Spark • JDBC isn’t a distributed protocol • Funneling large data through a small pipe • Spark Data Warehouse Connector to LLAP fixes this
  • 16. © 2019 Cloudera, Inc. All rights reserved. 16 PARTIAL SOLUTION – SEPARATE TABLES • Split private information out of tables • Separate directories in HDFS • HDFS and/or HS2 authorization • Enables HDFS encryption • Limitations • Need to join with other tables • Higher operational overhead
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 PARTIAL SOLUTION – ENCRYPTION UDF • Hive has user defined functions • aes_encrypt and aes_decrypt • Limitations • Key management is problematic • Encryption is not seeded • Size of value leaks information
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 COLUMNAR ENCRYPTION • Columnar file formats, such as ORC and Parquet • Write data in columns • Column projection • Better compression • Encryption works really well • Only encrypt bytes for column • Can store multiple variants of data
  • 20. © 2019 Cloudera, Inc. All rights reserved. 20 ORC FILE FORMAT
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 USER EXPERIENCE • Set table properties for encryption • orc.encrypt = ”pii:ssn,email;credit:card_info” • orc.mask = “sha256:card_info” • Define where to get the encryption keys • Configuration defines the key provider via URI
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22 KEY MANAGEMENT • Create a master key for each use case • “pii”, “pci”, or “hipaa” • Each column in each file uses unique local key • Allows audit of which users read which files • Ranger policies limit access to keys • Who, What, When, Where
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 KEY PROVIDER API • Provides limited access to encryption keys • Encrypts or decrypts local keys • Users are never given master keys • Key versions and key rolling of master keys • Allows 3rd party plugins • Supports Cloud, Hadoop or Ranger KMS
  • 24. © 2019 Cloudera, Inc. All rights reserved. 24 ENCRYPTION DATA FLOW
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 ENCRYPTION FLOW • Local key • Random for each encrypted column in file • Encrypted w/ master key by KMS • Encrypted local key is stored in file metadata • IV is generated to be unique • Column, kind, stripe, & counter
  • 26. © 2019 Cloudera, Inc. All rights reserved. 26 STATIC DATA MASKING • What happens without key access? • Define static masks • Nullify – all values become null • Redact – mask values ‘Xxxxx Xxxxx!’ • Can define ranges to unmask • SHA256 – replace with SHA256 • Custom - user defined
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 DATA ANONYMIZATION • Anonymization is hard! • AOL search logs • Netflix prize datasets • NYC taxi dataset • Always evaluate security tradeoffs • Tokenization is a useful technique • Assigns arbitrary replacements
  • 28. © 2019 Cloudera, Inc. All rights reserved. 28 KEY DISPOSAL • Often need to keep data for 90 days • Currently the data is written twice • With column encryption: • Roll keys daily • Delete master key after 90 days
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29 ORC ENCRYPTION DESIGN • Write both variants of streams • Masked unencrypted • Unmasked encrypted • Encrypt both data and statistics • Maintain compatibility for old readers • Read unencrypted variant • Preserve ability to seek in file
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30 ORC WRITE PIPELINE • Streams go through pipeline • Run length encoding • Compression (zlib, snappy, lzo, lz4, zstd, or none) • Encryption • Encryption is AES/CTR • Allows seek • No padding
  • 32. © 2019 Cloudera, Inc. All rights reserved. 32 CONCLUSIONS • ORC column encryptions provides Transparent encryption Multi-paradigm column security Compatible with old readers Static masking Audit logging (via KMS logging) • Supports file merging • Released in ORC 1.6
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33 INTEGRATION WITH OTHER TOOLS • Hive & Spark No change other than defining table properties • Apache Hive’s LLAP Cache and fast processing of SQL queries Column encryption changes internal interfaces Cache both encrypted and unencrypted variants Ensure audit log reflects end-user and what they accessed
  • 34. © 2019 Cloudera, Inc. All rights reserved. 34 LIMITATIONS • Need encryption policy for write Current Atlas & Ranger tags lag data Auto-discovery requires pre-access • Changes to masking policy Need to re-write files • Need additional data masks Credit card, addresses, etc. • Decrypted local keys could be saved