The task of “data profiling”—assessing the overall content and quality of a data set—is a core aspect of the analytic experience. Traditionally, profiling was a fairly cut-and-dried task: load the raw numbers into a stat package, run some basic descriptive statistics, and report the output in a summary file or perhaps a simple data visualization. However, data volumes can be so large today that traditional tools and methods for computing descriptive statistics become intractable; even with scalable infrastructure like Hadoop, aggressive optimization and statistical approximation techniques must be used. In this talk Sean will cover technical challenges in keeping data profiling agile in the Big Data era. He will discuss both research results and real-world best practices used by analysts in the field, including methods for sampling, summarizing and sketching data, and the pros and cons of using these various approaches.
Sean is Trifacta’s Chief Technical Officer. He completed his Ph.D. at Stanford University, where his research focused on user interfaces for database systems. At Stanford, Sean led development of new tools for data transformation and discovery, such as Data Wrangler. He previously worked as a data analyst at Citadel Investment Group.
4. … Become Persistent Questions
in the Data Lifecycle
What’s in this data?
Can I make use of it?
Unboxing Transformation Analysis Visualization Productization
7. “Its easy to just think you know what you
are doing and not look at data at every
intermediary step.
An analysis has 30 different steps. Its
tempting to just do this then that and then
this. You have no idea in which ways you
are wrong and what data is wrong.”
8. What’s in the data?
• The Expected: Models, Densities, Constraints
• The Unexpected: Residuals, Outlier, Anomalies
21. Mapping out the Design Space
How much data to examine?
How accurate are the results?
How fast can you get them?
22. Mapping out the Design Space
Decide how your requirements fall on these axes
Find a strategy (if one exists) that fits the requirements
Accuracy
Urgency
Data Volume
24. Strategy vs Cost
Random Sample
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
25. Strategy vs Cost
Scan, summarize, collect samples
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
26. Far better an approximate answer
to the right question, which is often
vague, than the exact answer to
the wrong question, which can
always be made precise.
Data Analysis & Statistics, Tukey & Wilk 1966
28. Sanity Check: Is this really expensive?
• Computers are fast
• In-memory, column stores, OLAP, …
• Still, “Big Data” can be hard
• Big is sometimes really big
• Big data can be raw: no indexes or precomputed summaries
• Agility remains critical to harness the “informed human mind”
29. Two Useful Techniques
Sampling
• A variety of techniques available
Sketches
• One-pass memory-efficient structures for capturing distributions
Accuracy
Urgency
Data Volume
31. Approaches to Sampling
• Scan-based access
• Head-of-file
• Bernoulli
• Reservoir
• Random I/O Sampling
• Block-level sampling
32. Head-of-File
• Pros:
• Very fast: small data, no disk seeks
• Absolutely required when unboxing raw data
• Nested data (JSON/XML), Text (logs, database dumps, etc.)
• Cons:
• Correlation of position and value
33. Bernoulli
• Take a full pass, flip a (weighted) coin for each record
• Pros:
• trivial to implement
• trivial to parallelize
• almost no memory required
• Cons:
• requires a full scan of the data
• output size proportional to input size, and random
filter(lambda x : random() < 0.01, data)
34. Reservoir
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
… 61141217 139
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
35. 1141217
Reservoir … 6 133
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
36. 41217
Reservoir … 6 137 3
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
37. Meta-Strategy: Stratified Sampling
• Sometimes you need representative samples from each “group”
• Coverage: e.g., displaying examples for every state in a map
• Robustness: e.g., consider average income
• if you miss the rare top tax bracket, estimate is way off
38. Stratification: the GroupBy / Agg pattern
• Given:
• A group-partitioning key for stratification
• Sizes for each stratum
• Easy to implement: partition, and construct sample per partition
• your favorite sampling technique applies
SELECT D.group_key, reservoir(D.value)
FROM data D
GROUP BY D.group_key;
39. Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
40. Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
• Pretty inefficient: touches 1-(1-r)p blocks
42. Block Sampling
• Randomly sample blocks of records from disk
• Concern: clustering bias.
• Techniques from database literature: assess bias and correct
• Beware: even block sampling needs to be well below 5%.
43. Sampling in Hadoop
• Larger unit of access: HDFS blocks (128MB vs. 64KB)
• HDFS buffering makes forward seeking within block cheaper
• But CPU costs may encourage sampling within the block.
• …and Hadoop makes it easy to sample across nodes
• Each worker only processes one block
• Must find record boundaries
• Tougher when dealing with quote escaping
45. Sketching
• Family of algorithms for estimating contents of a data stream
• Constant-sized memory footprint
• Computed in 1 pass over the data
• Classic Examples
• Bloom filter: existence testing
• HyperLogLog Sketches (FM): distinct values
• CountMin (CM): a surprisingly versatile sketch for frequencies
62. CountMin (and CountMeanMin) answer “point frequency queries”.
Surprisingly, we can use them to answer many more questions
• densities
• even order statistics (median, quantiles, etc.)
The Versatile CountMin Sketch
75. More Statistics
• Count-Range Queries
• Median
• Quantiles: generalization of Median
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
76. More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms:
• fixed-width bins: range queries
• fixed-height bins: quantiles
1-10 11-20 21-30 31-40