Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
2. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and
Business Intelligence since 1996
Began consulting database programing
and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in
Intelligent Enterprise
Launched Data Science, Data
Interaction and Cloud practices Laser focus on extending Data
Analytics with Big Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Top 20 Big Data
Consulting - CIO Review
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing
(BDW) Meetup NYC: 2,000+ Members
2015 Awarded for getting data out
of SAP for data analytics
Established best practices for big data
ecosystem implementations
3. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy and Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
4. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Agenda
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
Hands-on
Exercises
And Breaks
6. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
•Data is coming in so
fast, how do we
monitor it?
•Real real-time
analytics
•What does
“complete” mean
•Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
•Wider breadth of
datasets and sources
in scope requires
larger data
governance support
•Data governance
cannot start at the
data warehouse
•Data volume is
higher so the process
must be more reliant
on programmatic
administration
•Less people/process
dependence
Volume Variety
VelocityVeracity
The Challenges Building a Data Lake
7. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
What’s Old is New Again
Before Data Warehousing Governance
Users trying to produce reports from raw source data
No Data Conformance
No Master Data Management
No Data Quality processes
No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
Before Data Lake Governance
We can put “anything” in Hadoop
We can analyze anything
We’re scientists, we don’t need IT, we make the rules
Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
data governance will create a mess
Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
8. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Making it Right
The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
New tools
External data
Data blending
Decentralization
With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
We need more systemic administration
We need systems, tools to help with big data governance
This space is EXTREMELY immature!
Steps towards Data Governance for the Data Lake
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
9. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Process Architecture
Communication
Organization
IFP
Governance
Administration
Compliance
Reporting
Standards
Value Proposition
Risk/Reward
Information
Accountabilities
Stewardship
Architecture
Enterprise Data
Council
Data Integrity
Metrics
Control Mechanisms
Principles and
Standards
Information Usability
Communication
BDG provides vision, oversight and accountability for leveraging
corporate information assets to create competitive advantage,
and accelerate the vision of integrated delivery.
Value Creation
• Acts on Requirements
Build Capabilities
• Does the Work
• Responsible for adherence
Governance
Committees
Data Stewards
Project Teams
Enterprise
Data Council
• Executive Oversight
• Prioritizes work
Drives change
Accountable for results
Definitions
Data Governance for the Data Lake
10. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
11. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Big Data
12. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Lake Governance Realities
Full data governance can only be applied to “Structured” data
The data must have a known and well documented schema
This can include materialized endpoints such as files or tables OR
projections such as a Hive table
Governed structured data must have:
A known schema with Metadata
A known and certified lineage
A monitored, quality test, managed process for ingestion and
transformation
A governed usage Data isn’t just for enterprise BI tools anymore
We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
13. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
The Data Scientists Can Help!
Data Science to Big Data Warehouse mapping
Full Data Governance Requirements
Provide full process lineage
Data certification process by data stewards and business owners
Ongoing Data Quality monitoring that includes Quality Checks
Provide requirements for Data Lake
Proper metadata established:
Catalog
Data Definitions
Lineage
Quality monitoring
Know and validate data
completeness
14. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Analytics Pyramid
Metadata Catalog
ILM who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata Catalog
ILM who has access, how long do we
“manage it”
Data Quality and Monitoring
Monitoring of completeness of data
Metadata Catalog
ILM who has access, how long do we “manage it”
Data Quality and Monitoring Monitoring of
completeness of data
Hadoop has different governance demands at each tier.
Only top tier of the pyramid is fully governed.
We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
15. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
What does a Data Scientist Do, Anyway?
Searching for the data they need
Making sense of the data
Figuring why the data looks the way is does and assessing its validity
Cleaning up all the garbage within the data so it represents true business
Combining events with Reference data to give it context
Correlating event data with other events
Finally, they write algorithms to perform mining, clustering and
predictive analytics
Writes really cool and sophisticated
algorithms that impacts the way the
business runs.
Much of the time of a Data Scientist
is spent:
NOT
23. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
24. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
1. Business Understanding
In this initial phase of the project we will need to speak to
humans.
• It would be premature to jump in to the data, or begin
selection of the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business
requirements into a preliminary technical design (decision
model) and plan.
Since this is an iterative process, this phase will be revisited
throughout the entire process.
25. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
2. Data Understanding
• Data Discovery understand where the data you
need comes from
• Data Profiling interrogate the data at the entity
level, understand key entities and fields that are
relevant to the analysis.
• Cleansing Requirements understand data
quality, data density, skew, etc
• Data Munging collocate, blend and analyze data
for early insights! Valuable information can be
achieved from simple group-by, aggregate queries,
and even more with SQL Jujitsu!
Significant iteration between Business Understanding
and Data Understanding phases.
28. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data
Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete
values, whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
29. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Quality and Monitoring
• BUILD a robust data quality
subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL
Toolkit
• Each error instance of each data
quality check is captured
• Implemented as sub-system
after ingestion
• Each fact stores unique
identifier of the defective source
row
31. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data
preparation techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional
data points, or uncover additional data quality issues!
32. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Machine Learning
The goal of machine learning is to get software to make decisions and
learn from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning inferring functions based on labeled training
data
• Unsupervised learning finding hidden structure/patterns within
data, no training data is supplied
We will review some popular, easy to understand machine
learning algorithms
35. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Category or Values?
There are several classes of algorithms depending on whether the
prediction is a category (like cat or dog) or a value, like the value of a
home.
Classification algorithms are general well fit for categorization, while
algorithms like Regression and Decision Trees are well suited for
predicting values.
38. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random
“centroids” and assigns the nearest
items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in
data
Uses:
• Cluster Analysis
• Classification
• Content Filtering
39. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs.
Memory Based)
• Leveraging collaboration between multiple agents to filter, project,
or detect patterns
• Popular in recommender systems for projecting the “taste” for of
specific individuals for items they have not yet expressed one.
40. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Item-based
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user
rating
• Then recommendations are created by producing a weighted sum of
top items, based on the users previously rated items
44. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Project Objective
• Create a functional recommendation engine to surface to provide
relevant product recommendations to customers.
• Improve Customer Experience
• Increase Customer Retention
• Increase Customer Purchase Activity
• Accurately suggest relevant products to customers based on their peer
behavior.
45. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not have
thought of
• What makes a good recommendation?
• Relevant but not obvious
• Sense of “surprise”
23” LED TV 24” LED TV 25” LED TV
23” LED TV``
SOLD!!
Blu-Ray Home Theater HDMI Cables
46. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Where do we use recommendations?
• Applications can be found in a wide variety of industries and applications:
• Travel
• Financial Service
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others
Our Example: Movies
47. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
The Goal of the Recommender
• Create a powerful, scalable recommendation engine with minimal
development
• Make recommendations to users as they are browsing movie titles -
instantaneously
• Recommendation must have context to the movie they are currently
viewing.
OOPS! – too much surprise!
48. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
How do we do it?
We leverage two algorithms:
• Content-Based Filtering – how similar is this particular movie to
other movies based on usage.
• Collaborative Filtering – predict an individuals preference based
on their peers ratings. Spark MLlib implements a collaborative
filtering algorithm called Alternating Least Squares (ALS)
• Both algorithms only require a simple dataset of 3 fields:
“User ID” , “Item ID”, “Rating”
49. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Content-Based Filtering
“People who liked this movie liked these as well”
• Content Based Filter builds a matrix of items to other items and
calculates similarity (based on user rating)
• The most similar item are then output as a list:
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)
7 100 0.690951001800917
7 50 0.653299445638532
7 117 0.643701303640083
50. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Collaborative Filtering
“People with similar taste to you liked these movies”
• Collaborative filtering applies weights based on “peer” user preference.
• Essentially it determines the best movie critics for you to follow
• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are
“Seven” and “Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
51. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Recommendation Store
• Serving recommendations needs to be instantaneous
• The core to this solution is two reference tables:
• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID
Rec_Item_Similarity
Item_ID
Similar_Item
Similarity_Score
Rec_User_Item_Base
User_ID
Item_ID
Recommendation_Score
52. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Delivering Recommendations
Item-Based:
Peers like these
Movies
Best
Recommendations
Item Similarity Raw Score Score
Fargo 0.691 1.000
Star Wars 0.653 0.946
Rock, The 0.644 0.932
Pulp Fiction 0.628 0.909
Return of the Jedi 0.627 0.908
Independence Day 0.618 0.894
Willy Wonka 0.603 0.872
Mission: Impossible 0.597 0.864
Silence of the Lambs, The 0.596 0.863
Star Trek: First Contact 0.594 0.859
Raiders of the Lost Ark 0.584 0.845
Terminator, The 0.574 0.831
Blade Runner 0.571 0.826
Usual Suspects, The 0.569 0.823
Seven (Se7en) 0.569 0.823
Item-Base (Peer) Raw Score Score
Seven 5.000 1.000
Donnie Brasco 4.707 0.941
Babe 4.688 0.938
Heat 4.688 0.938
To Kill a Mockingbird 4.686 0.937
Jaws 4.683 0.937
Monty Python, Holy Grail 4.670 0.934
Blade Runner 4.670 0.934
Get Shorty 4.655 0.931
Top 10 Recommendations
So if Johny is viewing “12 Monkeys” we query our recommendation store
and present the results
Seven (Se7en) 1.823
Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934
53. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
…but the 6th result would have been “Babe” the children's movie
• Tuning the algorithms might help: parameter changes, similarity measures.
• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means
OOPS!
57. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
57
Sophisticated Recommendation Model
What items are we
promoting at time
of sale?
What items are
being promoted
by the Store or
Market?
What are people
with similar
characteristics
buying?
57
Peer Based
Item
Clustering
Corporate
Deals/
Offers
Customer
Behavior
Market/
Store
Recommendation
What items have
you bought in the
past?
What did people
who ordered
these items also
order?
The solution
allows balancing
of algorithms to
attain the most
effective
recommendation
59. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Some Thoughts – Enable the Future
Data Science requires the
convergence of data quality,
advanced math, data engineering
and visualization and business smarts
Make sure your data can be trusted
and people can be held accountable
for impact caused by low data
quality.
Good data scientists are rare: It will
take a village to achieve all the tasks
required for effective data science
Get good!
Be great!
Blaze new trails!
https://exploredatascience.com/
Data Science Training:
We focused our attention on building a single version of the truth
We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM.
We had a fairly restrictive set of tools for using the EDW data Enterprise BI tools It was easier to GOVERN how the data would be used.
Data science is not about Hadoop, but it is about modern data engineering. Think polyglot persistence – the right tool for the job.
Visualization can be tableau, excel, ggplot2 or d3.js. Or anything.
Supervised learning: finds patterns over time and predicts what might happen next.
Unsupervised learning: organizes, groups, classifies (clusters), categorizes data
Paco nathan made one of these, too.
One of the most respected data scientist I know says 90% of her ML work uses regression analysis
Circuit board analogy: all of the circuit boards have their switches flipped in the same direction – and then single out the single characteristic they don’t share. This is how to isolate the true impact of that single switch on the sprawling circuit board.
May find Muslims don’t shop on Friday afternoons or females with higher education shop more in the morning than any other
When the outcome is a real number then it is a regression tree
K-means is unsupervised learning
K-nearest is supervised learning and needs history
Memory: Uses rating data to compute the similarity between users or items
Model: Based on training data
Challenges: sparse data effects performance of recommendation. (performance in ML means how good is it, not how fast is it)
Ratings can be crap, biased.
Limited history can skew recommendation, long history can mean more sales = higher score (rich get richer)