The frustration of working in the data industry is that so much time is spent finding, understanding, cleaning and reorganising data rather than putting it to good use. The cause comes down to a gap in the capabilities of our data processing platforms.
In software engineering we teach people that data is private to an application and should only be accessed through the application interface. However, the moment we want to do any form of analysis, we rip the data out of the application, copy it around and start using it for different projects. Very quickly the original context of the data is lost and downstream users waste time reconstructing it.
ODPi Egeria is an open source project delivering embeddable metadata management libraries and interchange technology for our data platforms that ensures metadata can flow with the data in a form that is accessible to tools from many vendors. This open metadata management is coupled with open governance APIs to enable business owners to set policies that is then pushed down into the data platforms engines and tools simplifying regulatory requirements and protection of valuable data assets.
The technology includes a comprehensive metadata type model seeded from many popular standards and enhanced with semantics and governance concepts. The underlying metamodel is a graph designed to be distributed across multiple heterogeneous metadata servers. Metadata is then accessible through replication, event notification and federated queries ensuring metadata is shared and linked to build a rich body of knowledge around the data.
In this presentation I will cover the basic mechanisms of Egeria and how its use across our data platforms and tools could revolutionise the data industry.
2. https://github.com/odpi/egeria
AI is having an increasing impact on every aspect of modern life
Energy &
Utilities
Financial
Services
Government ManufacturingHealthcare Insurance
Retail
Telecommunication
High Tech Hospital
Oil &
Gas
Travel &
Hotel
Transportation Multi-channel
integration
Stock Market
7. https://github.com/odpi/egeria
Curation
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
I know
I wonder
what this
means
8. https://github.com/odpi/egeria
Metadata
should bring
as much
information
about the data
sets to Callie’s
data science
as is known
collectively by
the
organization.
Employee Directory
NameBand Job Title
X
Data Set Name: Employee
Directory
X
Description:
Core attributes describing all
employees of Coco
Pharmaceuticals created from a
daily extract from Kenexa.
Owner: Penny Payer
Status:
Last accessed: 6th May 2016
Records: 3488
Last Update: 1st May 2016
Contents:
Structure …
Contents …
Lineage …
XColumn:
Band
Classification Ranges:
Confidentiality: Public, Confidential,
Sensitive
Confidence: Authoritative
Retention: Indefinitely
Characteristi
cs
LineageDescription
Position reference number for non-
exempt employees. The value ranges
from 01 to 06 where 01 is the most senior
and 06 is the most junior.
Type: String
Classification: Public
9. https://github.com/odpi/egeria
Scared to share
Faith Broker
Human Resources
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 56944 045 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 43800 215 27 Code St Harlem NY 1 3
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 ##### ### 27 Code St Harlem NY 1 3
00 3809890 3 7 Callie Quartile 328080 7432 5 New York 4 27 Data Scientist 1 ##### ### 27 Code St Harlem NY 1 3
00 3809890 1 7 Tanya Tidie 209482 4051 2 New York 4 27 Data Steward 1 ##### ### 27 Code St Harlem NY 1 3
Callie Quartile
Data Scientist
Very Sensitive DataVery Sensitive Data
15. https://github.com/odpi/egeria
Using glossary function for semantic processing
Business
metadata
Structural
metadata for
a data store
EMPNAME EMPNO JOBCODE SALARY
EMPLOYEE
RECORD
Employee
Work Location
Annual Salary
Job Title
Employee Id
Employee Name
Hourly Pay Rate
Manager Compensation Plan
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
HAS-A
IS-A
IS-A
Sensitive
IS-A
Data
00 3809890 6 7 Lemmie Stage 818928 3082 4 New York 4 27 DataStage Expert 1 45324 300 27 Code St Harlem NY 1 3
19. https://github.com/odpi/egeria
Importance of the Graph Model – Using Entity Proxies
19
Database
Column
Glossary
Term
Server 1
Server 3
Server 2
Meaning
Database
Column
Glossary
Term
Entity
Proxy
24. https://github.com/odpi/egeria
Coco Pharmaceuticals persona
Jules Keeper, CDO Tessa Tube,
Chief Researcher
Erin Overview,
Information Architect
Faith Broker
Chief Privacy Offic
e
r
Bob Nitter,
Integration Developer
Callie Quartile,
Data Scientist
Nancy Noah
Cloud Specialist
Gary Geeke
IT Infrastructure
https://odpi.github.io/data-governance/coco-pharmaceuticals/personas/
26. https://github.com/odpi/egeria
Different personas need different services
Callie Quartile
Data Scientist
Jules Keeper
Chief Data Officer
Find data
Understand data
Manage analytics models
Build data strategy
Define governance program
Monitor progress
27. https://github.com/odpi/egeria
Different personas need different services
Tanya Tidie
Clinical Trials Administrator
Ivor Padlock
Chief Security Officer
Maintain accurate patient records
Catalog clinical trials data
Demonstrate good data management practices
Understand risks to organization
Set up protection
Monitor for suspicious activity
29. https://github.com/odpi/egeria
Open metadata type model summary
Glossary Collaboration
Governance
Models and
Reference Data
Metadata
Discovery
Lineage Data Assets
Base Types, Systems
and Infrastructure
29
30. https://github.com/odpi/egeria
Each area caters for appropriate metadata structures
Policy Metadata (Principles,
Regulations, Standards,
Approaches, Rule Specifications,
Roles and Metrics)
Governance
Actions and
Processes
Augmentation
MappingImplementation
Business Objects and
Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
Teaming Metadata
(people profiles,
communities, projects,
notebooks, …)
Models and Schemas
4
3
1
5
Physical Asset Descriptions
(Data stores, APIs,
models and components)
Asset Collections
(Sets, Typed Sets, Type
Organized Sets)
Information Views
Rights
Management
Reference Data
Feedback Metadata
(tags, comments, ratings, …)
ClassificationSchemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Rollout
2
Discovery
Metadata (profile data,
technical classification, data
classification,
data quality assessment, …)
Augmentation
Instrument
Association
Information Process
Instrumentation (design lineage)
6
7
ConnectorsBasic Types, Infrastructure and Systems
Access
0
30
31. https://github.com/odpi/egeria
Current Open Metadata Access Services (OMASs)
31
Project Management
Community ProfileAsset Catalog
Stewardship Action
Information View
Governance Program
Data Process
Subject Area
Connected Asset Discovery EngineGovernance Engine
Data Protection
Software Developer
Data Platform
Asset Owner
Digital Architecture
Data Science
DevOps
Asset Consumer
Data Infrastructure
Data Privacy
Asset Lineage
32. https://github.com/odpi/egeria
Realizing open metadata and governance
Delivering core technology
Recruiting vendors
Assisting practitioners
32
Vendors
Practitioners
Core
Technology
Compliance
Suite
Best
Practices
Project
Egeria
Project
Data
Governance
33. https://github.com/odpi/egeria
Help wanted
Governance practice leaders needed to build out best practices
If you buy data technology please encourage your vendors to consume the Egeria technology.
Looking for developers:
UI development
Graph repository (eg JanusGraph/TinkerPop)
Python clients
Join the ODPi to help fund our work
Tell everyone about want we do
33
Business metadata describes the data that the business needs, what it means and how it should be classified and protected.
Structural metadata describes how the data is actually stored and labelled in the data store.
The linkage between the business and technical metadata allows our technology to switch between these two perspectives. For example,
A request for data expressed in business terminology can be translated into a query for data from a data store.
An integration engine copying data into a sand box can discover which are the fields that the business classifies as sensitive and then mask these values dynamically.