Best Practices to Deploy a Governed Data Lake

© 2015 MapR Technologies 1© 2015 MapR Technologies
Deploying a Governed Data Lake

© 2015 MapR Technologies 2
Welcome
• Event will be recorded
• Ask your questions in the Q&A Panel in the lower right-hand
corner of your screen
• Tweet us @mapr during the event

Key Points
• The data lake is becoming a “real-time” shared service to provide
data to the business to support data science and big data
analytics needs
• As the data lake becomes a trusted source of data to drive big
data analytics, security and data governance have to be
addressed
• Security and data governance policies need to be implemented
in a way that still enables self-service and quick time to value vs.
creating 3-6 month delays

Deliver Data Discovery Agility with a Governed “Data Layer”
Adhere to security,
compliance and data
governance policies
Catalog data assets at scale,
with secure provisioning to
the business
Find and understand best-
suited and most trusted data

The danger of the data lake becoming a flea market
Botond Horvath / Shutterstock.com
INVENTORY
DATA
Can’t create and maintain an
inventory fast enough
Big Data Architect INVENTORY
DATA
Can’t explore everything to find
the best item
Data Engineer/Data
Scientist/Business Analyst
INVENTORY
DATA
Can’t tell what’s what and what
can be trusted
CDO/Data Steward

Imagine shopping on Amazon.com
GOVERNANCE
Inventory
Find and Understand
Provision

Governed data lake is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision

Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Operational
Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including
Apache Hadoop
The Governed Data Lake on Apache Hadoop
Data Inventory:
Find, understand
and govern

The Governed Data Lake
Define Ingest Inventory Explore Provision
Wrangle/Model/Vi
sualize
• Critical data elements
• Sensitive data elements
• Security and data
governance policies
• Load
• Profile
• Automatic tagging
• Discover metadata
and generate tags
• Discover data lineage
• Manage tags
• Browse/search
inventory
• Inspect data quality
• Tag and annotate
• Bookmark
• Copy
• Authorized view
Governed data lake as a shared service
Data Governance Data Discovery Agility
Data protection, authentication, authorization, auditing
Can you achieve both?

Find, understand and govern data in Hadoop

Waterline Data is like Amazon.com for data in Hadoop
GOVERNANCE
Inventory
Find and Understand
Provision

Inventory

Find and Understand

Provision
Future: Generate
Drill Views

Governance

Sources
RELATIONAL,
SAAS,
MAINFRAME
DOCUMENTS,
EMAILS
LOG FILES,
CLICKSTREAMS
SENSORS
BLOGS,
TWEETS,
LINK DATA
Analytics
Search
Schema-less
data exploration
BI, reporting
Ad-hoc integrated
analytics
Operational
Apps
Recommendation
Fraud Detection
Logistics
MapR-DB MapR-FS
MapR Data Platform
Distribution including
Apache Hadoop
The Governed Data Lake on Apache Hadoop with MapR
Data Inventory:
Find, understand
and govern

Separate Distinct Data Sets via MapR Volumes
Volumes dramatically simplify
management:
• Replication factor
• Scheduled mirroring
• Scheduled snapshots
• Data placement control
• User access and tracking
• Administrative permissions
/projects
/tahoe
/yosemite
/user
/msmith
/bjohnson

MapR Trust Model (Product Security)
Flexible
Authentication
• Wire-level authentication for all
services in the cluster
• NSA-level cryptographic algorithms
• Integration with LDAP, Active
Directory and other third party
directory services
• Kerberos or username/password
authentication
1
A
AA
DP
Granular
Authorization
• Access Control Expressions
• Protect files, tables, column families,
columns, and management objects
• Extend to role-based access control
(RBAC) with custom role functions
• Drill Views
2Robust
Auditing
• All events recorded immediately
in JSON log files
• Includes data access and
administrative actions
• Ad-hoc queries and custom
reports on audit logs via SQL and
standard BI tools
3
Ubiquitous
Data Protection
• Encryption for Data in Motion
• Within a Cluster
• Between Clusters
• Between Client and Cluster
• Encryption for Data at Rest
• LUKS
• Self-Encrypting Disk
• Partners
4

MapR Comprehensive Auditing
Serving Security Analysts…
Monitoring
Incident
Response
• Who touched customer records outside of
business hours?
• What actions did users take in the days
before leaving the company?
• What operations were performed without
following change control?
• Are users accessing sensitive files from
protected/secured source IPs?
• Why do my reports look different, despite
sourcing from same underlying data?
Security

MapR Comprehensive Auditing (cont.)
…And Data Scientists Too
• Which data is used most frequently?
Implication: High Value; Share More
Broadly
• Which data is least commonly used?
Implication: Low Value; Candidate
for Purge
• Which data should be used more?
Implication: Underutilized; Increase
Awareness
• What administrative actions are
most commonly performed?
Implication: Candidate for
automation
Predictive Analytics

MapR Audits – Key Features
Data Access
• Files
• MapR-DB Tables
Cluster Operations
• Administrative Operations
• Maprcli commands
Authentication Requests
Secure
High Performance
Flexible
• Retention Period
• Maxsize
• Coalesce Interval
JSON Format
{"timestamp":"{$date=2015-06-
01T05:24:58.231Z}","operation":"GETATTR",
"user":"root","uid":"0","ipAddress":"10.10.x.x",
"nfsServer":"10.10.x.x","srcPath":"/dbtest.0/","
srcFid":"2147.16.2","VolumeName":“mktg_file
s","volumeId":“mktg_files","status":"0"}

Access Control that Scales
PAM Authentication +
User Impersonation
Fine-grained row and
column level access control
with Drill Views – no
centralized security
repository required
Files HBase Hive
Drill
View 1
Drill
View 2
UUU
User
User

Ownership Chaining
Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path

Find, Understand and Govern Data in Hadoop
At Scale and in Real-Time
Discover and protect
sensitive data, audit
and authorize access
to the data lake,
discover data lineage,
and provide data
stewardship
CDO/Data Steward
Automate cataloging of
data assets at scale,
with secure
provisioning to
business users
Big Data Architect
Find and understand
best-suited and most
trusted data without
having to explore
every file manually
Data Engineer/Data
Scientist/Business Analyst

Learn More
www.waterlinedata.com
• Watch the solution video
• Read analyst papers
• Download the free Waterline
Data / MapR sandbox
• Request a demo
• Download and evaluate the
product
www.mapr.com
• Get free On-Demand
Training for Hadoop
• Download the free Waterline
Data / MapR sandbox

Best Practices to Deploy a Governed Data Lake

Recommandé

Recommandé

Contenu connexe

Plus de MapR Technologies

Plus de MapR Technologies (20)

Dernier

Dernier (20)

Best Practices to Deploy a Governed Data Lake