SlideShare une entreprise Scribd logo
1  sur  83
Télécharger pour lire hors ligne
Agile Data Profiling
Sean Kandel
What’s in your data?
Opening Questions
in the Data Lifecycle…
Unboxing What’s in this data?
Can I make use of it?
… Become Persistent Questions
in the Data Lifecycle
What’s in this data?
Can I make use of it?
Unboxing Transformation Analysis Visualization Productization
Unboxing Transformation Analysis Visualization Productization
Unboxing Transformation Analysis Visualization Productization
STRUCTURING CLEANING
ENRICHMENT DISTILLATION
“Its easy to just think you know what you
are doing and not look at data at every
intermediary step.
An analysis has 30 different steps. Its
tempting to just do this then that and then
this. You have no idea in which ways you
are wrong and what data is wrong.”
What’s in the data?
• The Expected: Models, Densities, Constraints
• The Unexpected: Residuals, Outlier, Anomalies
Average Movie Ratings
Expected
Unexpected
Overview of all variables
Show relevant perspectives
What to compute?
• Densities and descriptive statistics
• Identify anomalies and outliers
How often to compute it?
Unboxing Transformation Analysis Visualization Productization
Challenge: Agility
• Profiling throughout the lifecycle
• Particularly important as you manipulate data
Design Space and Tradeoffs
Mapping out the Design Space
How much data to examine?
How accurate are the results?
How fast can you get them?
Mapping out the Design Space
Decide how your requirements fall on these axes
Find a strategy (if one exists) that fits the requirements
Accuracy
Urgency
Data Volume
Accuracy
Urgency
Data Volume
Strategy vs Cost
Head of file
Good EnoughAnomaliesBig PictureUnbox
Strategy vs Cost
Random Sample
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
Strategy vs Cost
Scan, summarize, collect samples
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
Far better an approximate answer
to the right question, which is often
vague, than the exact answer to
the wrong question, which can
always be made precise.
Data Analysis & Statistics, Tukey & Wilk 1966
Technical Methods
Sanity Check: Is this really expensive?
• Computers are fast
• In-memory, column stores, OLAP, …
• Still, “Big Data” can be hard
• Big is sometimes really big
• Big data can be raw: no indexes or precomputed summaries
• Agility remains critical to harness the “informed human mind”
Two Useful Techniques
Sampling
• A variety of techniques available
Sketches
• One-pass memory-efficient structures for capturing distributions
Accuracy
Urgency
Data Volume
Technique I: Sampling
Approaches to Sampling
• Scan-based access
• Head-of-file
• Bernoulli
• Reservoir
• Random I/O Sampling
• Block-level sampling
Head-of-File
• Pros:
• Very fast: small data, no disk seeks
• Absolutely required when unboxing raw data
• Nested data (JSON/XML), Text (logs, database dumps, etc.)
• Cons:
• Correlation of position and value
Bernoulli
• Take a full pass, flip a (weighted) coin for each record
• Pros:
• trivial to implement
• trivial to parallelize
• almost no memory required
• Cons:
• requires a full scan of the data
• output size proportional to input size, and random
filter(lambda x : random() < 0.01, data)
Reservoir
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
… 61141217 139
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
1141217
Reservoir … 6 133
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
41217
Reservoir … 6 137 3
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
Meta-Strategy: Stratified Sampling
• Sometimes you need representative samples from each “group”
• Coverage: e.g., displaying examples for every state in a map
• Robustness: e.g., consider average income
• if you miss the rare top tax bracket, estimate is way off
Stratification: the GroupBy / Agg pattern
• Given:
• A group-partitioning key for stratification
• Sizes for each stratum
• Easy to implement: partition, and construct sample per partition
• your favorite sampling technique applies
SELECT D.group_key, reservoir(D.value)
FROM data D
GROUP BY D.group_key;
Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
• Pretty inefficient: touches 1-(1-r)p blocks
Record Sampling
% items sampled
%blockstouched(expected)
1-(1-r)p with p = 100
Block Sampling
• Randomly sample blocks of records from disk
• Concern: clustering bias.
• Techniques from database literature: assess bias and correct
• Beware: even block sampling needs to be well below 5%.
Sampling in Hadoop
• Larger unit of access: HDFS blocks (128MB vs. 64KB)
• HDFS buffering makes forward seeking within block cheaper
• But CPU costs may encourage sampling within the block.
• …and Hadoop makes it easy to sample across nodes
• Each worker only processes one block
• Must find record boundaries
• Tougher when dealing with quote escaping
Technique II: Sketching
Sketching
• Family of algorithms for estimating contents of a data stream
• Constant-sized memory footprint
• Computed in 1 pass over the data
• Classic Examples
• Bloom filter: existence testing
• HyperLogLog Sketches (FM): distinct values
• CountMin (CM): a surprisingly versatile sketch for frequencies
CountMin Sketch: Initialization
0
dhashfunctions
w hash buckets
Count-Min Sketch
0 0 0 0
0 0 0 0 0
0 0 0 0 0
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
Insert(7)
h1
h2
hw
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
1
1
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
Insert(4)
h1
h2
hw
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1
CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
Count(7)?
h1
h2
hw
CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
Count(7)?
CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
min
Count(7)
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch
an over-estimate
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch
w controls expected error amount
d controls probability of error
Suppose we want:
0.1% error, 99.9% probability.
w = 2000
d = 10
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch Idea: subtract out expected
overage.
i.e. mean of other cells
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—
mean
—
median
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—
mean
—
mean
—
median
Count(7)
CountMin (and CountMeanMin) answer “point frequency queries”.
Surprisingly, we can use them to answer many more questions
• densities
• even order statistics (median, quantiles, etc.)
The Versatile CountMin Sketch
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
Count(x=13)
CountMin: Point Queries
0001020304050607080910111213141516171819202122232425262728293031
Count(x ∊ [14-15])
CountMin(⌊x/2⌋): Pair Queries
0001020304050607080910111213141516171819202122232425262728293031
Count(x ∊ [16-19])
CountMin(⌊x/4⌋): Quartet Queries
0001020304050607080910111213141516171819202122232425262728293031
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16
0001020304050607080910111213141516171819202122232425262728293031
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
More Statistics
• Count-Range Queries
• Median
• Quantiles: generalization of Median
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms:
• fixed-width bins: range queries
• fixed-height bins: quantiles
1-10 11-20 21-30 31-40
Putting It Together
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Bernoulli
Block
Reservoir
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Bernoulli
Sketching
Stratified
Block
Reservoir
Summary
• ABP: Always Be Profiling
• Tradeoff latency and accuracy
• Approximation methods
• Heuristics and reasonable assumptions
Acknowledgments
Adam Silberstein, Joe Hellerstein

Contenu connexe

Similaire à Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataDatabricks
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer visionEran Shlomo
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)Ali-ziane Myriam
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizationsBrendan Gregg
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatternsgrepalex
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)guest0f8e278
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profitRodrigo Campos
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer InsightMapR Technologies
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"NUS-ISS
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfAlexanderKyalo3
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jamsBol.com Techlab
 

Similaire à Sean Kandel - Data profiling: Assessing the overall content and quality of a data set (20)

A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 

Plus de huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifactahuguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introhuguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoophuguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watsonhuguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitchinghuguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoringhuguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startuphuguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapulthuguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysishuguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analyticshuguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Socialhuguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligencehuguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthyhuguk
 

Plus de huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Dernier

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Dernier (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Sean Kandel - Data profiling: Assessing the overall content and quality of a data set