SlideShare une entreprise Scribd logo
1  sur  30
Cloudera Navigator
Headline Goes Here
Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Outline
●
●
●
●

Capabilities
Architecture
Quick Demo
Q&A
Capabilities
●

Discovery
○

○

●

Lineage
○
○

●

Search through metadata to find data set/operation of
interest.
View schema, associated metadata etc. for a dataset
Given a data set, trace back to the original source.
Understand the impact of modifying a data set.

Audit
○
○

Generate report of access to a data set in Hadoop.
Generate alert when a restricted data set is accessed.
Discovery & Lineage(Questions to be asked?)
●
●
●

Ad-hoc or only predefined?
Granularity?
Analysis?
Discovery & Lineage (Supported Systems)
●
●
●
●
●
●
●

HDFS
Hive
MR1
Oozie
Pig
YARN
...More coming...
Discovery (Metadata Search)
Discovery (Metadata Search)
Discovery (Metadata Search)
Discovery (View Schema)
Discovery (Augment Metadata )
Discovery (Search on associated metadata)
Sidecars.. (Colocation of associated metadata)
/user/root/customers/cust_demo
/user/root/customers/.cust_demo.navigator
Contents of .cust_demo.navigator
{
"properties" : {
"secret" : "true",
"retention" : "small"
},
"tags" : ["pci"]
}
Lineage (Hive Query)
INSERT OVERWRITE TABLE machine_vendors
SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper
(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca.
address_state,ca.customerKey,cm.clusterId,ms.machineName
FROM crm_accounts ca JOIN cluster_metadata cm
ON ca.customerKey = cm.customerKey JOIN machine_stats ms
ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
Lineage
Lineage (Path highlighted)
Lineage (Instance)
Lineage (Template)
Lineage (Pig Script)
posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage()
AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray,
score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int,
lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray,
tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray,
communityOwnedDate:chararray);

comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage()
AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray,
userId: int);

joined_post_comments = JOIN posts by id, comments by postId;

post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate,
comments::postId..comments::userId;
grouped_comments = GROUP post_comments BY posts::id;
comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment;
rmf stackoverflow/output/comments_per_post
STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
Lineage (Pig)
Discovery & Lineage Architecture
Model
●
●

Generic (Element, Relations)
Element
○
○
○

Unique Identity
Key-value pairs
Tags

(Operation, Operation Execution, FSElement, Table,
Column…)
Model (Contd…)
●

Relation
○
○
○

Unique Identity
Two sets of related elements
Relationship type

(Parent Child Relation, Data Flow Relation, Control Flow
Relation, Instance Of Relation, Alias Relation, Generic
Relation)
Discovery & Lineage (REST API)
●

Elements Resource
○

curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100'

[{
"identity" : "513bf7add8d5f56b7f0f34769707cb5f",
"originalName" : "job_1389320017591_0024_conf.xml",
"firstClassParentId" : null,
"name" : null,
"description" : null,
"tags" : null,
"properties" : null,
"fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml",
"category" : "FILE",
"size" : 139211,
"lastModified" : "1969-12-31T23:59:59.999Z",
"lastAccessed" : "2014-02-04T02:12:01.369Z",
"owner" : "root",
"group" : "hadoop",
"blockSize" : null,
"mimeType" : "application/octet-stream",
"replication" : null,
"deleted" : false,
"resType" : "HDFS",
"permission" : 432,
"resId" : "858e5548b4cd3457432eb491ee74729d",
"type" : "fselement"
}, ...]

○
○

curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’
curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost:
5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
Discovery & Lineage (REST API)
●

Relations Resource
curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2'
[{
"identity" : “91540192d3dd727f912b3c0bb91cdd81”,
"type" : “PARENT_CHILD",
"parent" : [ {
"elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c",
},"children" : [ {
"elementIds" : [ "6144fabee63641275c5577697f16266a" ],
}
"name" : null},...]

●

Interactive Resource
curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2'
{
"offset" : 0,
"totalMatched" : 2,
"limit" : 1,
"results" : [ {
"identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889",
"category" : "DIRECTORY",... },
"facets" : { },
"qtime" : 10
}
Audit (Supported Systems)
●
●
●
●
●

HDFS
HBase
Hive
Impala
...More coming...
Audit Configuration
Audit View
Audit Details
●

User
○

●

Operation Information
○

●

Username, Impersonator, Ip Address
Operation Type, Session Id, Query Id, Operation Text, Status,
Time

Object Information
○

ServiceName, Path (Different in different systems)
Audit Architecture

Log4j
Appender
Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

Contenu connexe

Similaire à Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

Reflection Slides by Zubair Dar
Reflection Slides by Zubair DarReflection Slides by Zubair Dar
Reflection Slides by Zubair Darzubairdar6
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享Chengjen Lee
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Websolutions Agency
 
Graph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemGraph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemMarco Parenzan
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsDataWorks Summit
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutchsebastian_nagel
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalogkendallclark
 
Five android architecture
Five android architectureFive android architecture
Five android architectureTomislav Homan
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDBArangoDB Database
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
groonga with PostgreSQL
groonga with PostgreSQLgroonga with PostgreSQL
groonga with PostgreSQLAkihiro Okuno
 

Similaire à Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters (20)

Introduction to Datastore
Introduction to DatastoreIntroduction to Datastore
Introduction to Datastore
 
Mastro
MastroMastro
Mastro
 
Mastro
MastroMastro
Mastro
 
Reflection Slides by Zubair Dar
Reflection Slides by Zubair DarReflection Slides by Zubair Dar
Reflection Slides by Zubair Dar
 
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
將 Open Data 放上 Open Source Platforms: 開源資料入口平台 CKAN 開發經驗分享
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'Handout: 'Open Source Tools & Resources'
Handout: 'Open Source Tools & Resources'
 
Introducing Datawave
Introducing DatawaveIntroducing Datawave
Introducing Datawave
 
Graph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft EcosystemGraph Databases in the Microsoft Ecosystem
Graph Databases in the Microsoft Ecosystem
 
Neo4j: Graph-like power
Neo4j: Graph-like powerNeo4j: Graph-like power
Neo4j: Graph-like power
 
AI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analyticsAI from your data lake: Using Solr for analytics
AI from your data lake: Using Solr for analytics
 
Web Crawling with Apache Nutch
Web Crawling with Apache NutchWeb Crawling with Apache Nutch
Web Crawling with Apache Nutch
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalog
 
Stardog Linked Data Catalog
Stardog Linked Data CatalogStardog Linked Data Catalog
Stardog Linked Data Catalog
 
Five android architecture
Five android architectureFive android architecture
Five android architecture
 
Spring data requery
Spring data requerySpring data requery
Spring data requery
 
Kantara OTTO slides
Kantara OTTO slidesKantara OTTO slides
Kantara OTTO slides
 
Graph Analytics with ArangoDB
Graph Analytics with ArangoDBGraph Analytics with ArangoDB
Graph Analytics with ArangoDB
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
groonga with PostgreSQL
groonga with PostgreSQLgroonga with PostgreSQL
groonga with PostgreSQL
 

Plus de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Plus de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Dernier

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Dernier (20)

Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Cloudera Federal Forum 2014: Tracking Provenance in Hadoop Clusters

  • 1. Cloudera Navigator Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12
  • 3. Capabilities ● Discovery ○ ○ ● Lineage ○ ○ ● Search through metadata to find data set/operation of interest. View schema, associated metadata etc. for a dataset Given a data set, trace back to the original source. Understand the impact of modifying a data set. Audit ○ ○ Generate report of access to a data set in Hadoop. Generate alert when a restricted data set is accessed.
  • 4. Discovery & Lineage(Questions to be asked?) ● ● ● Ad-hoc or only predefined? Granularity? Analysis?
  • 5. Discovery & Lineage (Supported Systems) ● ● ● ● ● ● ● HDFS Hive MR1 Oozie Pig YARN ...More coming...
  • 11. Discovery (Search on associated metadata)
  • 12. Sidecars.. (Colocation of associated metadata) /user/root/customers/cust_demo /user/root/customers/.cust_demo.navigator Contents of .cust_demo.navigator { "properties" : { "secret" : "true", "retention" : "small" }, "tags" : ["pci"] }
  • 13. Lineage (Hive Query) INSERT OVERWRITE TABLE machine_vendors SELECT upper(trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)",1))) AS manufacturer,upper (trim(regexp_extract(ms.dmidecode,"System InformationntManufacturer: ([^n]+)ntProduct Name: ([^n]+)",2))) AS product,ca. address_state,ca.customerKey,cm.clusterId,ms.machineName FROM crm_accounts ca JOIN cluster_metadata cm ON ca.customerKey = cm.customerKey JOIN machine_stats ms ON cm.customerKey = ms.customerKey AND cm.clusterId = ms.clusterId AND cm.collectionTS = ms.collectionTS
  • 18. Lineage (Pig Script) posts = LOAD 'stackoverflow/posts/posts.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postTypeId:int, acceptedAnswerId:int, parentId:int, creationDate:chararray, score:int, viewCount:int, body:chararray, ownerUserId:chararray, lastEditorUserId:int, lastEditorDisplayName:chararray, lastEditDate:chararray, lastActivityDate:chararray, tile:chararray, tags:chararray, answerCount:int, commentCount:int, favoriteCount:int, closedDate: chararray, communityOwnedDate:chararray); comments = LOAD 'stackoverflow/comments/comments.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage() AS (id:int, postId:int, score:int, text:chararray, creationDate:chararray, userDisplayName:chararray, userId: int); joined_post_comments = JOIN posts by id, comments by postId; post_comments = FOREACH joined_post_comments GENERATE posts::id..posts::communityOwnedDate, comments::postId..comments::userId; grouped_comments = GROUP post_comments BY posts::id; comments_per_post = FOREACH grouped_comments GENERATE group as postId, post_comments.comments::text as comment; rmf stackoverflow/output/comments_per_post STORE comments_per_post INTO 'stackoverflow/output/comments_per_post' USING PigStorage();
  • 20. Discovery & Lineage Architecture
  • 21. Model ● ● Generic (Element, Relations) Element ○ ○ ○ Unique Identity Key-value pairs Tags (Operation, Operation Execution, FSElement, Table, Column…)
  • 22. Model (Contd…) ● Relation ○ ○ ○ Unique Identity Two sets of related elements Relationship type (Parent Child Relation, Data Flow Relation, Control Flow Relation, Instance Of Relation, Alias Relation, Generic Relation)
  • 23. Discovery & Lineage (REST API) ● Elements Resource ○ curl 'http://localhost:5150/api/v1/elements?query=originalName:job_&limit=100&offset=100' [{ "identity" : "513bf7add8d5f56b7f0f34769707cb5f", "originalName" : "job_1389320017591_0024_conf.xml", "firstClassParentId" : null, "name" : null, "description" : null, "tags" : null, "properties" : null, "fileSystemPath" : "/user/history/done/2014/01/31/000000/job_1389320017591_0024_conf.xml", "category" : "FILE", "size" : 139211, "lastModified" : "1969-12-31T23:59:59.999Z", "lastAccessed" : "2014-02-04T02:12:01.369Z", "owner" : "root", "group" : "hadoop", "blockSize" : null, "mimeType" : "application/octet-stream", "replication" : null, "deleted" : false, "resType" : "HDFS", "permission" : 432, "resId" : "858e5548b4cd3457432eb491ee74729d", "type" : "fselement" }, ...] ○ ○ curl ‘http://localhost:5150/api/v1/elements/f53ae3547a90b7519b44041db1898972’ curl -X PUT -H "Content-Type: application/json" -d '{"displayName":"test","descriptin":"describe me","tags":[]}' http://localhost: 5150/api/v1/elements/e5f94cd59a8ca6df96247ce88b6c9c28
  • 24. Discovery & Lineage (REST API) ● Relations Resource curl 'http://localhost:7187/api/v1/relations?elementIds=83f4cdcc37c379144fef22e3dbdf7c8c&types=PARENT_CHILD&depth=2' [{ "identity" : “91540192d3dd727f912b3c0bb91cdd81”, "type" : “PARENT_CHILD", "parent" : [ { "elementId" : "83f4cdcc37c379144fef22e3dbdf7c8c", },"children" : [ { "elementIds" : [ "6144fabee63641275c5577697f16266a" ], } "name" : null},...] ● Interactive Resource curl 'http://localhost:7187/api/v1/interactive/elements?query=originalName:test&limit=2' { "offset" : 0, "totalMatched" : 2, "limit" : 1, "results" : [ { "identity" : "9b7b9d95eb06ccf0b1b0cd1a39642889", "category" : "DIRECTORY",... }, "facets" : { }, "qtime" : 10 }
  • 28. Audit Details ● User ○ ● Operation Information ○ ● Username, Impersonator, Ip Address Operation Type, Session Id, Query Id, Operation Text, Status, Time Object Information ○ ServiceName, Path (Different in different systems)