SlideShare a Scribd company logo
1 of 35
Fine Grained Access Control for Big
Data: ORC Column Encryption
Owen O’Malley
owen@cloudera.com
@owen_omalley
March 2019
Srikanth Venkat
svenkat@cloudera.com
@srikvenk
2 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Who Are We?
• Owen
• Worked on Hadoop since Jan 2006
• MapReduce, Security, Hive, and ORC
• Founder & Technical Fellow
• Srikanth
• Senior Director, Product Management (Security &
Governance portfolio)
• Apache Ranger, Apache Knox, Apache Atlas, ODPi
• Security, Data Stewardship, Metadata, Governance areas
3 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Security & Data Protection in Hadoop
4 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Example Data Lake Scenario
Marketing
Demographics
Electronic
medical records
CRM
POS
(Structured)(Structured) (Structured) (Structured) (Structured)
Cluster 1: Dublin Cluster 2: San Francisco
(Unstructured)(Unstructured)(Unstructured)
Cluster 3: Prague
(Structured)
On Premise Data Lakes
(Unstructured)(Structured) (Unstructured) (Structured)
Cloud Data Lakes
Social
Weblogs & Feeds
Transactional
Mobile
IoT
Personal Data
5 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
What’s different about the Big Data context?
 Breaking down silos: fantastic for analytics, but leads to increased security
challenges
– Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-
grained authorization
 Data democratization and the Data Scientist role (often a data superuser
with elevated privileges)
 Data is maintained over a long duration
 Cloud and Hybrid architectures spanning data center and (multiple) public
clouds further broaden the attack surface area and present novel
authentication and authorization challenges
 Along with adherence to security fundamentals and defense in-depth, a
data-centric approach to security becomes critical
6 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Watch Towers
Limited Entry Points
Moat
Kerberos
Securing your data lake
High Hard Walls
Check Identity
Inner Walls
Firewall
Encryption, TLS, Key
Trustee, Navigator
Encrypt, Ranger KMS
LDAP/AD
Apache Knox: AuthN, API
Gateway, Proxy, SSO
Apache Ranger : ABAC
AuthZ, Audits,
Anonymization
Apache Sentry: RBAC
AuthZ
7 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Data Protection in Hadoop
must be applied at three different layers
in Apache Hadoop
Storage: encrypt data while it is at rest
Transparent Data Encryption in HDFS, Navigator Key Trustee, Navigator
Encrypt, Ranger KMS + HSM, Partner Products (HPE Voltage, Protegrity,
Dataguise)
Transmission: encrypt data as it is in motion
Wire encryption (TLS, SASL,..)
Upon Access: apply restrictions when accessed
Apache Ranger (Dynamic Column Masking + Row Filtering), Partner
Masking + Encryption
Data Protection
8 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Encryption of Data in Hadoop
Volume
Encryption
Protects data after physical theft
or accidental loss of a disk volume.
Entire volume is encrypted: very
coarse-grained security
Does not protect against viruses or
other attacks that occur while a
system is running.
Application-
level encryption
Encryption within an application
running on top of Hadoop
Supports a higher level of
granularity and prevents "rogue
admin" access
Adds a layer of complexity to the
application architecture.
HDFS data-at-
rest encryption
Encrypts selected files and
directories stored ("at rest") in
HDFS.
Uses specially designated HDFS
directories known as "encryption
zones.”
End-to-end encryption of data
read from and written to HDFS.
HDFS does not have access to
unencrypted data or keys.
9 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Dynamic Row Filtering & Column Masking With Apache Ranger & Apache Hive
User 2: Ivanna
Location : EU
Group: HR
User 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx
xxxx
null John Doe
US xxxxx7465 5391 xxxx xxxx
xxxx
null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies:
Filter rows by region & apply relevant column
masking
Users from US Analyst group see data for
US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country,
nationalid, name, mrn
FROM ww_customers
Analysts
HR Marketing
10 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Framing the Problem…..
• Related data, different security requirements
• Authorization – who can see it
• Audit – track who read it
• Encrypt on disk – regulatory
• File-level (or blob) granularity isn’t enough
• File systems don’t understand columns
11 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Requirements
• Readers should transparently decrypt data
• If and only if the user has access to the key
• The data must be decrypted locally
• Columns are only decrypted as necessary
• Master keys must be managed securely
• Support for Key Management Server & hardware
• Support for key rolling
12 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solutions
13 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solution – HDFS Encryption
• Transparent HDFS Encryption
• Encryption zones
• HDFS directory trees
• Unique master key for each zone
• Client decrypts data
• Key Management via KeyProvider API
14 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
HDFS Encryption Limitations
• Very coarse protection
• Only entire directory subtrees
• No ability to protect columns
• A lot of users need access to keys
• Moves between zones is painful
• When writing with Hive, data is moved
multiple times per a query
15 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Hive Server 2 Limitations
• Limits access to Hive SQL
• Only user ‘hive’ has access
• Breaks Hadoop’s multi-paradigm data access
• Many customers use both Hive & Spark
• JDBC is not distributed
• New Spark to LLAP connector addresses this
16 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solution – Separate tables
• Split private information out of tables
• Separate directories in HDFS
• HDFS and/or HS2 authorization
• Enables HDFS encryption
• Limitations
• Need to join with other tables
• Higher operational overhead
17 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Partial Solution – Encryption UDF
• Hive has user defined functions
• aes_encrypt and aes_decrypt
• Limitations
• Key management is problematic
• Encryption is not seeded
• Size of value leaks information
18 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Solution
19 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Columnar Encryption
• Columnar file formats (eg. ORC)
• Write data in columns
• Column projection
• Better compression
• Encryption works really well
• Only encrypt bytes for column
• Can store multiple variants of data
20 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
ORC File Format
File Footer
Postscript
Index Data
Row Data
Stripe Footer
~200MBStripe
Index Data
Row Data
Stripe Footer
~200MBStripe
Index Data
Row Data
Stripe Footer
~200MBStripe Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Column 1
Column 2
Column 7
Column 8
Column 3
Column 6
Column 4
Column 5
Stream 2.1
Stream 2.2
Stream 2.3
Stream 2.4
21 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
User Experience
• Set table properties for encryption
• orc.encrypt.pii = ”ssn,email”
• orc.encrypt.credit = “card_info”
• Define where to get the encryption keys
• Configuration defines the key provider via URI
22 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Key Management
• Create a master key for each use case
• “pii”, “pci”, or “hipaa”
• Each column in each file uses unique local key
• Allows audit of which users read which files
• Ranger policies limit access to keys
• Who, What, When, Where
23 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
KeyProvider API
• Provides limited access to encryption keys
• Encrypts or decrypts local keys
• Users are never given master keys
• Key versions and key rolling of master keys
• Allows 3rd party plugins
• Supports Cloud, Hadoop or Ranger KMS
24 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Encryption Data Flow
25 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Encryption Flow
• Local key
• Random for each encrypted column in file
• Encrypted w/ master key by KMS
• Encrypted local key is stored in file metadata
• IV is generated to be unique
• Column, kind, stripe, & counter
26 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Static Data Masking
• What happens without key access?
• Define static masks
• Nullify – all values become null
• Redact – mask values ‘Xxxxx Xxxxx!’
• Can define ranges to unmask
• SHA256 – replace with SHA256
• Custom - user defined
27 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Data Masking
• Anonymization is hard!
• AOL search logs
• Netflix prize datasets
• NYC taxi dataset
• Always evaluate security tradeoffs
• Tokenization is a useful technique
• Assign arbitrary replacements
28 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Key Disposal
• Often need to keep data for 90 days
• Currently the data is written twice
• With column encryption:
• Roll keys daily
• Delete master key after 90 days
29 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
ORC Encryption Design
• Write both variants of streams
• Masked unencrypted
• Unmasked encrypted
• Encrypt both data and statistics
• Maintain compatibility for old readers
• Read unencrypted variant
• Preserve ability to seek in file
30 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
ORC Write Pipeline
• Streams go through pipeline
• Run length encoding
• Compression (zlib, snappy, or lzo)
• Encryption
• Encryption is AES/CTR
• Allows seek
• No padding
31 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Conclusions
32 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Conclusions
• ORC column encryptions provides
• Transparent encryption
• Multi-paradigm column security
• Audit logging (via KMS logging)
• Static masking
• Supports file merging
• Different stripes with different local key
33 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Integration with Other Tools
• Hive & Spark
• No change other than defining table properties
• Apache Hive’s LLAP
• Cache and fast processing of SQL queries
• Column encryption changes internal interfaces
• Cache both encrypted and unencrypted variants
• Ensure audit log reflects end-user and what they
accessed
34 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Limitations
• Need encryption policy for write
• Current Atlas & Ranger tags lag data
• Auto-discovery requires pre-access
• Changes to masking policy
• Need to re-write files
• Need additional data masks
• Credit card, addresses, etc.
• Decrypted local keys could be saved
35 © Hortonworks Inc. 2011 – 2019. All Rights Reserved
Thank you!
Twitter: @owen_omalley
Email: owen@cloudera.com

More Related Content

What's hot

Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBryan Bende
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column EncryptionOwen O'Malley
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseDataWorks Summit
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonHortonworks
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceDataWorks Summit/Hadoop Summit
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonDataWorks Summit/Hadoop Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
 

What's hot (20)

Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Building Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFiBuilding Data Pipelines for Solr with Apache NiFi
Building Data Pipelines for Solr with Apache NiFi
 
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
ORC Column Encryption
ORC Column EncryptionORC Column Encryption
ORC Column Encryption
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the EnterpriseUsing Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
Using Spark Streaming and NiFi for the Next Generation of ETL in the Enterprise
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Scaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC IsilonScaling real time streaming architectures with HDF and Dell EMC Isilon
Scaling real time streaming architectures with HDF and Dell EMC Isilon
 
Building a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and SparkBuilding a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and Spark
 
Benefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business IntelligenceBenefits of an Agile Data Fabric for Business Intelligence
Benefits of an Agile Data Fabric for Business Intelligence
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
 
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 

Similar to Fine Grain Access Control for Big Data: ORC Column Encryption

Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryptionOwen O'Malley
 
GDPR/CCPA Compliance and Data Governance in Hadoop
GDPR/CCPA Compliance and Data Governance in HadoopGDPR/CCPA Compliance and Data Governance in Hadoop
GDPR/CCPA Compliance and Data Governance in HadoopEyad Garelnabi
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Big Data Spain
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPHortonworks
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not laterDataWorks Summit
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...DataWorks Summit
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Artem Ervits
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoTEric Kavanagh
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataRommel Garcia
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big DataGreat Wide Open
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureDataWorks Summit
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...DataWorks Summit
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoopNiel Dunnage
 

Similar to Fine Grain Access Control for Big Data: ORC Column Encryption (20)

Protect your private data with ORC column encryption
Protect your private data with ORC column encryptionProtect your private data with ORC column encryption
Protect your private data with ORC column encryption
 
GDPR/CCPA Compliance and Data Governance in Hadoop
GDPR/CCPA Compliance and Data Governance in HadoopGDPR/CCPA Compliance and Data Governance in Hadoop
GDPR/CCPA Compliance and Data Governance in Hadoop
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
Keeping your Enterprise’s Big Data Secure by Owen O’Malley at Big Data Spain ...
 
Dynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDPDynamic Column Masking and Row-Level Filtering in HDP
Dynamic Column Masking and Row-Level Filtering in HDP
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 
Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
Security and Governance on Hadoop with Apache Atlas and Apache Ranger by Srik...
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Open Source Security Tools for Big Data
Open Source Security Tools for Big DataOpen Source Security Tools for Big Data
Open Source Security Tools for Big Data
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
 
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
Understanding Your Crown Jewels: Finding, Organizing, and Profiling Sensitive...
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 IcebergOwen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 

More from Owen O'Malley (18)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 

Recently uploaded

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZABSYZ Inc
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROmotivationalword821
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Rob Geurden
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 

Recently uploaded (20)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Salesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZSalesforce Implementation Services PPT By ABSYZ
Salesforce Implementation Services PPT By ABSYZ
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
How To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTROHow To Manage Restaurant Staff -BTRESTRO
How To Manage Restaurant Staff -BTRESTRO
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...Simplifying Microservices & Apps - The art of effortless development - Meetup...
Simplifying Microservices & Apps - The art of effortless development - Meetup...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 

Fine Grain Access Control for Big Data: ORC Column Encryption

  • 1. Fine Grained Access Control for Big Data: ORC Column Encryption Owen O’Malley owen@cloudera.com @owen_omalley March 2019 Srikanth Venkat svenkat@cloudera.com @srikvenk
  • 2. 2 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Who Are We? • Owen • Worked on Hadoop since Jan 2006 • MapReduce, Security, Hive, and ORC • Founder & Technical Fellow • Srikanth • Senior Director, Product Management (Security & Governance portfolio) • Apache Ranger, Apache Knox, Apache Atlas, ODPi • Security, Data Stewardship, Metadata, Governance areas
  • 3. 3 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Security & Data Protection in Hadoop
  • 4. 4 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Example Data Lake Scenario Marketing Demographics Electronic medical records CRM POS (Structured)(Structured) (Structured) (Structured) (Structured) Cluster 1: Dublin Cluster 2: San Francisco (Unstructured)(Unstructured)(Unstructured) Cluster 3: Prague (Structured) On Premise Data Lakes (Unstructured)(Structured) (Unstructured) (Structured) Cloud Data Lakes Social Weblogs & Feeds Transactional Mobile IoT Personal Data
  • 5. 5 © Hortonworks Inc. 2011 – 2019. All Rights Reserved What’s different about the Big Data context?  Breaking down silos: fantastic for analytics, but leads to increased security challenges – Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine- grained authorization  Data democratization and the Data Scientist role (often a data superuser with elevated privileges)  Data is maintained over a long duration  Cloud and Hybrid architectures spanning data center and (multiple) public clouds further broaden the attack surface area and present novel authentication and authorization challenges  Along with adherence to security fundamentals and defense in-depth, a data-centric approach to security becomes critical
  • 6. 6 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Watch Towers Limited Entry Points Moat Kerberos Securing your data lake High Hard Walls Check Identity Inner Walls Firewall Encryption, TLS, Key Trustee, Navigator Encrypt, Ranger KMS LDAP/AD Apache Knox: AuthN, API Gateway, Proxy, SSO Apache Ranger : ABAC AuthZ, Audits, Anonymization Apache Sentry: RBAC AuthZ
  • 7. 7 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Data Protection in Hadoop must be applied at three different layers in Apache Hadoop Storage: encrypt data while it is at rest Transparent Data Encryption in HDFS, Navigator Key Trustee, Navigator Encrypt, Ranger KMS + HSM, Partner Products (HPE Voltage, Protegrity, Dataguise) Transmission: encrypt data as it is in motion Wire encryption (TLS, SASL,..) Upon Access: apply restrictions when accessed Apache Ranger (Dynamic Column Masking + Row Filtering), Partner Masking + Encryption Data Protection
  • 8. 8 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Encryption of Data in Hadoop Volume Encryption Protects data after physical theft or accidental loss of a disk volume. Entire volume is encrypted: very coarse-grained security Does not protect against viruses or other attacks that occur while a system is running. Application- level encryption Encryption within an application running on top of Hadoop Supports a higher level of granularity and prevents "rogue admin" access Adds a layer of complexity to the application architecture. HDFS data-at- rest encryption Encrypts selected files and directories stored ("at rest") in HDFS. Uses specially designated HDFS directories known as "encryption zones.” End-to-end encryption of data read from and written to HDFS. HDFS does not have access to unencrypted data or keys.
  • 9. 9 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Dynamic Row Filtering & Column Masking With Apache Ranger & Apache Hive User 2: Ivanna Location : EU Group: HR User 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers Analysts HR Marketing
  • 10. 10 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Framing the Problem….. • Related data, different security requirements • Authorization – who can see it • Audit – track who read it • Encrypt on disk – regulatory • File-level (or blob) granularity isn’t enough • File systems don’t understand columns
  • 11. 11 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Requirements • Readers should transparently decrypt data • If and only if the user has access to the key • The data must be decrypted locally • Columns are only decrypted as necessary • Master keys must be managed securely • Support for Key Management Server & hardware • Support for key rolling
  • 12. 12 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Partial Solutions
  • 13. 13 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Partial Solution – HDFS Encryption • Transparent HDFS Encryption • Encryption zones • HDFS directory trees • Unique master key for each zone • Client decrypts data • Key Management via KeyProvider API
  • 14. 14 © Hortonworks Inc. 2011 – 2019. All Rights Reserved HDFS Encryption Limitations • Very coarse protection • Only entire directory subtrees • No ability to protect columns • A lot of users need access to keys • Moves between zones is painful • When writing with Hive, data is moved multiple times per a query
  • 15. 15 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Hive Server 2 Limitations • Limits access to Hive SQL • Only user ‘hive’ has access • Breaks Hadoop’s multi-paradigm data access • Many customers use both Hive & Spark • JDBC is not distributed • New Spark to LLAP connector addresses this
  • 16. 16 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Partial Solution – Separate tables • Split private information out of tables • Separate directories in HDFS • HDFS and/or HS2 authorization • Enables HDFS encryption • Limitations • Need to join with other tables • Higher operational overhead
  • 17. 17 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Partial Solution – Encryption UDF • Hive has user defined functions • aes_encrypt and aes_decrypt • Limitations • Key management is problematic • Encryption is not seeded • Size of value leaks information
  • 18. 18 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Solution
  • 19. 19 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Columnar Encryption • Columnar file formats (eg. ORC) • Write data in columns • Column projection • Better compression • Encryption works really well • Only encrypt bytes for column • Can store multiple variants of data
  • 20. 20 © Hortonworks Inc. 2011 – 2019. All Rights Reserved ORC File Format File Footer Postscript Index Data Row Data Stripe Footer ~200MBStripe Index Data Row Data Stripe Footer ~200MBStripe Index Data Row Data Stripe Footer ~200MBStripe Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Column 1 Column 2 Column 7 Column 8 Column 3 Column 6 Column 4 Column 5 Stream 2.1 Stream 2.2 Stream 2.3 Stream 2.4
  • 21. 21 © Hortonworks Inc. 2011 – 2019. All Rights Reserved User Experience • Set table properties for encryption • orc.encrypt.pii = ”ssn,email” • orc.encrypt.credit = “card_info” • Define where to get the encryption keys • Configuration defines the key provider via URI
  • 22. 22 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Key Management • Create a master key for each use case • “pii”, “pci”, or “hipaa” • Each column in each file uses unique local key • Allows audit of which users read which files • Ranger policies limit access to keys • Who, What, When, Where
  • 23. 23 © Hortonworks Inc. 2011 – 2019. All Rights Reserved KeyProvider API • Provides limited access to encryption keys • Encrypts or decrypts local keys • Users are never given master keys • Key versions and key rolling of master keys • Allows 3rd party plugins • Supports Cloud, Hadoop or Ranger KMS
  • 24. 24 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Encryption Data Flow
  • 25. 25 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Encryption Flow • Local key • Random for each encrypted column in file • Encrypted w/ master key by KMS • Encrypted local key is stored in file metadata • IV is generated to be unique • Column, kind, stripe, & counter
  • 26. 26 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Static Data Masking • What happens without key access? • Define static masks • Nullify – all values become null • Redact – mask values ‘Xxxxx Xxxxx!’ • Can define ranges to unmask • SHA256 – replace with SHA256 • Custom - user defined
  • 27. 27 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Data Masking • Anonymization is hard! • AOL search logs • Netflix prize datasets • NYC taxi dataset • Always evaluate security tradeoffs • Tokenization is a useful technique • Assign arbitrary replacements
  • 28. 28 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Key Disposal • Often need to keep data for 90 days • Currently the data is written twice • With column encryption: • Roll keys daily • Delete master key after 90 days
  • 29. 29 © Hortonworks Inc. 2011 – 2019. All Rights Reserved ORC Encryption Design • Write both variants of streams • Masked unencrypted • Unmasked encrypted • Encrypt both data and statistics • Maintain compatibility for old readers • Read unencrypted variant • Preserve ability to seek in file
  • 30. 30 © Hortonworks Inc. 2011 – 2019. All Rights Reserved ORC Write Pipeline • Streams go through pipeline • Run length encoding • Compression (zlib, snappy, or lzo) • Encryption • Encryption is AES/CTR • Allows seek • No padding
  • 31. 31 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Conclusions
  • 32. 32 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Conclusions • ORC column encryptions provides • Transparent encryption • Multi-paradigm column security • Audit logging (via KMS logging) • Static masking • Supports file merging • Different stripes with different local key
  • 33. 33 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Integration with Other Tools • Hive & Spark • No change other than defining table properties • Apache Hive’s LLAP • Cache and fast processing of SQL queries • Column encryption changes internal interfaces • Cache both encrypted and unencrypted variants • Ensure audit log reflects end-user and what they accessed
  • 34. 34 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Limitations • Need encryption policy for write • Current Atlas & Ranger tags lag data • Auto-discovery requires pre-access • Changes to masking policy • Need to re-write files • Need additional data masks • Credit card, addresses, etc. • Decrypted local keys could be saved
  • 35. 35 © Hortonworks Inc. 2011 – 2019. All Rights Reserved Thank you! Twitter: @owen_omalley Email: owen@cloudera.com