SlideShare une entreprise Scribd logo
1  sur  61
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Introduction to Data Science
(by a non-data scientist)
Joe Caserta
President
Caserta Concepts
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Caserta Timeline
Launched Big Data practice
Co-author, with Ralph Kimball, The
Data Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and
Business Intelligence since 1996
Began consulting database programing
and data modeling 25+ years hands-on experience
building database solutions
Founded Caserta Concepts in NYC
Web log analytics solution published in
Intelligent Enterprise
Launched Data Science, Data
Interaction and Cloud practices Laser focus on extending Data
Analytics with Big Data solutions
1986
2004
1996
2009
2001
2013
2012
2014
Dedicated to Data Governance
Techniques on Big Data (Innovation)
Top 20 Big Data
Consulting - CIO Review
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing
(BDW) Meetup NYC: 2,000+ Members
2015 Awarded for getting data out
of SAP for data analytics
Established best practices for big data
ecosystem implementations
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy and Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Agenda
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
Hands-on
Exercises
And Breaks
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Enrollments
Claims
Finance
ETL
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
NoSQL
Databases
ETL
Ad-Hoc/Canned
Reporting
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Traditional
EDW
Others…
Today’s business environment requires Big Data
Data Science
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
•Data is coming in so
fast, how do we
monitor it?
•Real real-time
analytics
•What does
“complete” mean
•Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
•Wider breadth of
datasets and sources
in scope requires
larger data
governance support
•Data governance
cannot start at the
data warehouse
•Data volume is
higher so the process
must be more reliant
on programmatic
administration
•Less people/process
dependence
Volume Variety
VelocityVeracity
The Challenges Building a Data Lake
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
What’s Old is New Again
 Before Data Warehousing Governance
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up
with two different sets of numbers!
 Before Data Lake Governance
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or
data governance will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to
the old days: No Trust = Not Actionable
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Making it Right
 The promise is an “agile” data culture where communities of users are encouraged
to explore new datasets in new ways
 New tools
 External data
 Data blending
 Decentralization
 With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
 We need more systemic administration
 We need systems, tools to help with big data governance
 This space is EXTREMELY immature!
 Steps towards Data Governance for the Data Lake
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to
governance
4. Establish a set of tools to make governing Big Data feasible
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Process Architecture
Communication
Organization
IFP
Governance
Administration
Compliance
Reporting
Standards
Value Proposition
Risk/Reward
Information
Accountabilities
Stewardship
Architecture
Enterprise Data
Council
Data Integrity
Metrics
Control Mechanisms
Principles and
Standards
Information Usability
Communication
BDG provides vision, oversight and accountability for leveraging
corporate information assets to create competitive advantage,
and accelerate the vision of integrated delivery.
Value Creation
• Acts on Requirements
Build Capabilities
• Does the Work
• Responsible for adherence
Governance
Committees
Data Stewards
Project Teams
Enterprise
Data Council
• Executive Oversight
• Prioritizes work
Drives change
Accountable for results
Definitions
Data Governance for the Data Lake
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
•This is the ‘people’ part. Establishing Enterprise Data Council,
Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from),
business definitions, technical metadataMetadata
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve,
certify
Data Quality and
Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members,
Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Big Data
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Lake Governance Realities
 Full data governance can only be applied to “Structured” data
 The data must have a known and well documented schema
 This can include materialized endpoints such as files or tables OR
projections such as a Hive table
 Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion and
transformation
 A governed usage  Data isn’t just for enterprise BI tools anymore
 We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
 Even in the case of unstructured data, structure must be
extracted/applied in just about every case imaginable before analysis
can be performed.
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
The Data Scientists Can Help!
 Data Science to Big Data Warehouse mapping
 Full Data Governance Requirements
 Provide full process lineage
 Data certification process by data stewards and business owners
 Ongoing Data Quality monitoring that includes Quality Checks
 Provide requirements for Data Lake
 Proper metadata established:
 Catalog
 Data Definitions
 Lineage
 Quality monitoring
 Know and validate data
completeness
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Big
Data
Warehouse
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
The Big Data Analytics Pyramid
Metadata  Catalog
ILM  who has access,
how long do we
“manage it”
Raw machine data
collection, collect
everything
Data is ready to be turned into
information: organized, well
defined, complete.
Agile business insight through data-
munging, machine learning, blending
with external data, development of
to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we
“manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Hadoop has different governance demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)
User community arbitrary queries and
reporting
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
What does a Data Scientist Do, Anyway?
 Searching for the data they need
 Making sense of the data
 Figuring why the data looks the way is does and assessing its validity
 Cleaning up all the garbage within the data so it represents true business
 Combining events with Reference data to give it context
 Correlating event data with other events
 Finally, they write algorithms to perform mining, clustering and
predictive analytics
 Writes really cool and sophisticated
algorithms that impacts the way the
business runs.
 Much of the time of a Data Scientist
is spent:
 NOT
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Why Data Science?
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive
Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Source: Gartner
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Business
Expertise
Advanced
Mathematics/
Statistics
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Easier to Find Than an Awesome Data Scientist
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Modern Data Engineering
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Which Visualization, When?
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Advanced Mathematics / Statistics
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Domain and Outcome Sensibility
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
1. Business Understanding
In this initial phase of the project we will need to speak to
humans.
• It would be premature to jump in to the data, or begin
selection of the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business
requirements into a preliminary technical design (decision
model) and plan.
Since this is an iterative process, this phase will be revisited
throughout the entire process.
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
2. Data Understanding
• Data Discovery  understand where the data you
need comes from
• Data Profiling  interrogate the data at the entity
level, understand key entities and fields that are
relevant to the analysis.
• Cleansing Requirements  understand data
quality, data density, skew, etc
• Data Munging  collocate, blend and analyze data
for early insights! Valuable information can be
achieved from simple group-by, aggregate queries,
and even more with SQL Jujitsu!
Significant iteration between Business Understanding
and Data Understanding phases.
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Science Data Quality Priorities
Be
Corrective
Be Fast
Be
Transparent
Be Thorough
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Science Data Quality Priorities
Data Quality
SpeedtoValueFast
Slow
Raw Refined
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data
Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete
values, whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Quality and Monitoring
• BUILD a robust data quality
subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL
Toolkit
• Each error instance of each data
quality check is captured
• Implemented as sub-system
after ingestion
• Each fact stores unique
identifier of the defective source
row
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Data Exploration and Preparation Exercise
Give it a try!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data
preparation techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional
data points, or uncover additional data quality issues!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Machine Learning
The goal of machine learning is to get software to make decisions and
learn from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning  inferring functions based on labeled training
data
• Unsupervised learning  finding hidden structure/patterns within
data, no training data is supplied
We will review some popular, easy to understand machine
learning algorithms
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
What to use when?
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Supervised Learning
Name Weight Color Cat_or_Dog
Susie 9lbs Orange Cat
Fido 25lbs Brown Dog
Sparkles 6lbs Black Cat
Fido 9lbs Black Dog
Name Weight Color Cat_or_Dog
Misty 5lbs Orange ?
The training set is used to generate a function
..so we can predict if we have a cat or dog!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Category or Values?
There are several classes of algorithms depending on whether the
prediction is a category (like cat or dog) or a value, like the value of a
home.
Classification algorithms are general well fit for categorization, while
algorithms like Regression and Decision Trees are well suited for
predicting values.
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Regression
• Understanding the relationship between a given set of dependent
variables and independent variables
• Typically regression is used to predict the output of a dependent
variable based on variations in independent variables
• Very popular for prediction and forecasting
Linear Regression
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Decision Trees
• A method for predicting outcomes based on the features of data
• Model is represented a easy to understand tree structure of if-else
statements
Weight > 10lbs
color = orange
cat
yes
no
name = fido
no
no
dogyes
dog
cat
yes
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random
“centroids” and assigns the nearest
items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in
data
Uses:
• Cluster Analysis
• Classification
• Content Filtering
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs.
Memory Based)
• Leveraging collaboration between multiple agents to filter, project,
or detect patterns
• Popular in recommender systems for projecting the “taste” for of
specific individuals for items they have not yet expressed one.
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Item-based
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user
rating
• Then recommendations are created by producing a weighted sum of
top items, based on the users previously rated items
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
6. Deployment
Engineering Time!
• It’s time for the work products of data science to
“graduate” from “new insights” to real applications.
• Processes must be hardened, repeatable, and generally
perform well too!
• Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based
interchange format
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
My Favorite Data Science Project
• Recommendation Engines
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Project Objective
• Create a functional recommendation engine to surface to provide
relevant product recommendations to customers.
• Improve Customer Experience
• Increase Customer Retention
• Increase Customer Purchase Activity
• Accurately suggest relevant products to customers based on their peer
behavior.
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Recommendations
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not have
thought of
• What makes a good recommendation?
• Relevant but not obvious
• Sense of “surprise”
23” LED TV 24” LED TV 25” LED TV
23” LED TV``
SOLD!!
Blu-Ray Home Theater HDMI Cables
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Where do we use recommendations?
• Applications can be found in a wide variety of industries and applications:
• Travel
• Financial Service
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others
Our Example: Movies
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
The Goal of the Recommender
• Create a powerful, scalable recommendation engine with minimal
development
• Make recommendations to users as they are browsing movie titles -
instantaneously
• Recommendation must have context to the movie they are currently
viewing.
OOPS! – too much surprise!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
How do we do it?
We leverage two algorithms:
• Content-Based Filtering – how similar is this particular movie to
other movies based on usage.
• Collaborative Filtering – predict an individuals preference based
on their peers ratings. Spark MLlib implements a collaborative
filtering algorithm called Alternating Least Squares (ALS)
• Both algorithms only require a simple dataset of 3 fields:
“User ID” , “Item ID”, “Rating”
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Content-Based Filtering
“People who liked this movie liked these as well”
• Content Based Filter builds a matrix of items to other items and
calculates similarity (based on user rating)
• The most similar item are then output as a list:
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)
7 100 0.690951001800917
7 50 0.653299445638532
7 117 0.643701303640083
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Collaborative Filtering
“People with similar taste to you liked these movies”
• Collaborative filtering applies weights based on “peer” user preference.
• Essentially it determines the best movie critics for you to follow
• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are
“Seven” and “Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Recommendation Store
• Serving recommendations needs to be instantaneous
• The core to this solution is two reference tables:
• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID
Rec_Item_Similarity
Item_ID
Similar_Item
Similarity_Score
Rec_User_Item_Base
User_ID
Item_ID
Recommendation_Score
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Delivering Recommendations
Item-Based:
Peers like these
Movies
Best
Recommendations
Item Similarity Raw Score Score
Fargo 0.691 1.000
Star Wars 0.653 0.946
Rock, The 0.644 0.932
Pulp Fiction 0.628 0.909
Return of the Jedi 0.627 0.908
Independence Day 0.618 0.894
Willy Wonka 0.603 0.872
Mission: Impossible 0.597 0.864
Silence of the Lambs, The 0.596 0.863
Star Trek: First Contact 0.594 0.859
Raiders of the Lost Ark 0.584 0.845
Terminator, The 0.574 0.831
Blade Runner 0.571 0.826
Usual Suspects, The 0.569 0.823
Seven (Se7en) 0.569 0.823
Item-Base (Peer) Raw Score Score
Seven 5.000 1.000
Donnie Brasco 4.707 0.941
Babe 4.688 0.938
Heat 4.688 0.938
To Kill a Mockingbird 4.686 0.937
Jaws 4.683 0.937
Monty Python, Holy Grail 4.670 0.934
Blade Runner 4.670 0.934
Get Shorty 4.655 0.931
Top 10 Recommendations
So if Johny is viewing “12 Monkeys” we query our recommendation store
and present the results
Seven (Se7en) 1.823
Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
…but the 6th result would have been “Babe” the children's movie
• Tuning the algorithms might help: parameter changes, similarity measures.
• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means
OOPS!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Additional Algorithm – K-Means
We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text
“These movies are similar based on their attributes”
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Delivery Scoring and Filters
• One or more categories must match
• Only children movies will be recommended for children's movies.
Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller
Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0
Babe 0 0 1 1 0 1 0 0 0 0 0
Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1
Star Wars 1 1 0 0 0 0 0 0 1 1 0
Blade Runner 0 0 0 0 0 0 1 0 0 1 0
Fargo 0 0 0 0 1 1 0 0 0 0 1
Willy Wonka 0 1 1 1 0 0 0 0 0 0 0
Monty Python 0 0 0 1 0 0 0 0 0 0 0
Jaws 1 0 0 0 0 0 0 1 0 0 0
Heat 1 0 0 0 1 0 0 0 0 0 1
Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0
To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0
Apply assumptions to control the results of collaborative filtering
Similarly logic could be applied to promote more favorable options
• New Releases
• Retail Case: Items that are on-sale, overstock
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Integrating K-Means into the process
Collaborative Filter
K-Means:
Similar
Content Filter
Best
Recommendations
Movies recommended by more than 1 algorithm are the most highly rated
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
57
Sophisticated Recommendation Model
What items are we
promoting at time
of sale?
What items are
being promoted
by the Store or
Market?
What are people
with similar
characteristics
buying?
57
Peer Based
Item
Clustering
Corporate
Deals/
Offers
Customer
Behavior
Market/
Store
Recommendation
What items have
you bought in the
past?
What did people
who ordered
these items also
order?
The solution
allows balancing
of algorithms to
attain the most
effective
recommendation
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Recommendation Algorithms Exercise
Give it a try!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Some Thoughts – Enable the Future
 Data Science requires the
convergence of data quality,
advanced math, data engineering
and visualization and business smarts
 Make sure your data can be trusted
and people can be held accountable
for impact caused by low data
quality.
 Good data scientists are rare: It will
take a village to achieve all the tasks
required for effective data science
 Get good!
 Be great!
 Blaze new trails!
https://exploredatascience.com/
Data Science Training:
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Sentiment Analysis Exercise (time permitting)
Give it a try!
@joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop
Thank You / Q&A
Joe Caserta
President, Caserta Concepts
joe@casertaconcepts.com
(914) 261-3648
@joe_Caserta

Contenu connexe

Tendances

Data science presentation
Data science presentationData science presentation
Data science presentationMSDEVMTL
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data AnalyticsS P Sajjan
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectbodaceacat
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data ScienceAjay Ohri
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSampath Kumar
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger databodaceacat
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Edureka!
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Simplilearn
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
 

Tendances (20)

Data science presentation
Data science presentationData science presentation
Data science presentation
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Presentation on Big Data Analytics
Presentation on Big Data AnalyticsPresentation on Big Data Analytics
Presentation on Big Data Analytics
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Training in Analytics and Data Science
Training in Analytics and Data ScienceTraining in Analytics and Data Science
Training in Analytics and Data Science
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data science 101
Data science 101Data science 101
Data science 101
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 
Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0Data Scientist Enablement roadmap 1.0
Data Scientist Enablement roadmap 1.0
 
DataHub
DataHubDataHub
DataHub
 
Session 10 handling bigger data
Session 10 handling bigger dataSession 10 handling bigger data
Session 10 handling bigger data
 
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...Data Science Tutorial | Introduction To Data Science | Data Science Training ...
Data Science Tutorial | Introduction To Data Science | Data Science Training ...
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Data Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill SetData Science Project Lifecycle and Skill Set
Data Science Project Lifecycle and Skill Set
 
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
Data Scientist vs Data Analyst vs Data Engineer - Role & Responsibility, Skil...
 
Big data analytics
Big data analyticsBig data analytics
Big data analytics
 
Exploring Big Data Analytics Tools
Exploring Big Data Analytics ToolsExploring Big Data Analytics Tools
Exploring Big Data Analytics Tools
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 

Similaire à Introduction to Data Science

Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on HadoopCaserta
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentCaserta
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation Caserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It? Caserta
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaCaserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data LakeCaserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureCaserta
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseCaserta
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and InnovationCaserta
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Caserta
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteCaserta
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)Syaifuddin Ismail
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsCaserta
 
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatiaSatish Bhatia
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefitsRicky Barron
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneySai Paravastu
 
43948_HPE Big Data Svcs infographic final
43948_HPE Big Data Svcs infographic final43948_HPE Big Data Svcs infographic final
43948_HPE Big Data Svcs infographic finalJoleneDobbin
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsRyan Gross
 

Similaire à Introduction to Data Science (20)

Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatia
 
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
 
BAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, SydneyBAR360 open data platform presentation at DAMA, Sydney
BAR360 open data platform presentation at DAMA, Sydney
 
43948_HPE Big Data Svcs infographic final
43948_HPE Big Data Svcs infographic final43948_HPE Big Data Svcs infographic final
43948_HPE Big Data Svcs infographic final
 
Data summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data opsData summit connect fall 2020 - rise of data ops
Data summit connect fall 2020 - rise of data ops
 

Plus de Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingCaserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Caserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseCaserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for EveryoneCaserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure CloudCaserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the CloudCaserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data LakeCaserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by DatabricksCaserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkCaserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsCaserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupCaserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWSCaserta
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 

Plus de Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 

Dernier

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 

Dernier (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 

Introduction to Data Science

  • 1. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Introduction to Data Science (by a non-data scientist) Joe Caserta President Caserta Concepts
  • 2. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Top 20 Big Data Consulting - CIO Review Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2015 Awarded for getting data out of SAP for data analytics Established best practices for big data ecosystem implementations
  • 3. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy and Implementation • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 4. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Agenda • Why we care about Big Data • Challenges of working with Big Data • Governing Big Data for Data Science • Introducing the Data Pyramid • Why Data Science is Cool? • What does a Data Scientist do? • Standards for Data Science • Business Objective • Data Discovery • Preparation • Models • Evaluation • Deployment • Q & A Hands-on Exercises And Breaks
  • 5. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL Databases ETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… Today’s business environment requires Big Data Data Science
  • 6. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop •Data is coming in so fast, how do we monitor it? •Real real-time analytics •What does “complete” mean •Dealing with sparse, incomplete, volatile, and highly manufactured data. How do you certify sentiment analysis? •Wider breadth of datasets and sources in scope requires larger data governance support •Data governance cannot start at the data warehouse •Data volume is higher so the process must be more reliant on programmatic administration •Less people/process dependence Volume Variety VelocityVeracity The Challenges Building a Data Lake
  • 7. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop What’s Old is New Again  Before Data Warehousing Governance  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Data Lake Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 8. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Data Governance for the Data Lake 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
  • 9. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Process Architecture Communication Organization IFP Governance Administration Compliance Reporting Standards Value Proposition Risk/Reward Information Accountabilities Stewardship Architecture Enterprise Data Council Data Integrity Metrics Control Mechanisms Principles and Standards Information Usability Communication BDG provides vision, oversight and accountability for leveraging corporate information assets to create competitive advantage, and accelerate the vision of integrated delivery. Value Creation • Acts on Requirements Build Capabilities • Does the Work • Responsible for adherence Governance Committees Data Stewards Project Teams Enterprise Data Council • Executive Oversight • Prioritizes work Drives change Accountable for results Definitions Data Governance for the Data Lake
  • 10. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance
  • 11. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certify Data Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Big Data
  • 12. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Data Lake Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semi- structured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
  • 13. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop The Data Scientists Can Help!  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness
  • 14. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Big Data Warehouse Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” The Big Data Analytics Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data- munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Hadoop has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted) User community arbitrary queries and reporting
  • 15. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop What does a Data Scientist Do, Anyway?  Searching for the data they need  Making sense of the data  Figuring why the data looks the way is does and assessing its validity  Cleaning up all the garbage within the data so it represents true business  Combining events with Reference data to give it context  Correlating event data with other events  Finally, they write algorithms to perform mining, clustering and predictive analytics  Writes really cool and sophisticated algorithms that impacts the way the business runs.  Much of the time of a Data Scientist is spent:  NOT
  • 16. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Why Data Science? Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner
  • 17. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Business Expertise Advanced Mathematics/ Statistics
  • 23. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Are there Standards? CRISP-DM: Cross Industry Standard Process for Data Mining 1. Business Understanding • Solve a single business problem 2. Data Understanding • Discovery • Data Munging • Cleansing Requirements 3. Data Preparation • ETL 4. Modeling • Evaluate various models • Iterative experimentation 5. Evaluation • Does the model achieve business objectives? 6. Deployment • PMML; application integration; data platform; Excel
  • 24. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 1. Business Understanding In this initial phase of the project we will need to speak to humans. • It would be premature to jump in to the data, or begin selection of the appropriate model(s) or algorithm • Understand the project objective • Review the business requirements • The output of this phase will be conversion of business requirements into a preliminary technical design (decision model) and plan. Since this is an iterative process, this phase will be revisited throughout the entire process.
  • 25. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 2. Data Understanding • Data Discovery  understand where the data you need comes from • Data Profiling  interrogate the data at the entity level, understand key entities and fields that are relevant to the analysis. • Cleansing Requirements  understand data quality, data density, skew, etc • Data Munging  collocate, blend and analyze data for early insights! Valuable information can be achieved from simple group-by, aggregate queries, and even more with SQL Jujitsu! Significant iteration between Business Understanding and Data Understanding phases.
  • 26. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Data Science Data Quality Priorities Be Corrective Be Fast Be Transparent Be Thorough
  • 27. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Data Science Data Quality Priorities Data Quality SpeedtoValueFast Slow Raw Refined
  • 28. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 3. Data Preparation ETL (Extract Transform Load) 90+% of a Data Scientists time goes into Data Preparation! • Select required entities/fields • Address Data Quality issues: missing or incomplete values, whitespace, bad data-points • Join/Enrich disparate datasets • Transform/Aggregate data for intended use: • Sample • Aggregate • Pivot
  • 29. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row
  • 31. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 4. Modeling Do you love algebra & stats? • Evaluate various models/algorithms • Classification • Clustering • Regression • Many others….. • Tune parameters • Iterative experimentation • Different models may require different data preparation techniques (ie. Sparse Vector Format) • Additionally we may discover the need for additional data points, or uncover additional data quality issues!
  • 32. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Machine Learning The goal of machine learning is to get software to make decisions and learn from data without being programed explicitly to do so Machine Learning algorithms are broadly broken out into two groups: • Supervised learning  inferring functions based on labeled training data • Unsupervised learning  finding hidden structure/patterns within data, no training data is supplied We will review some popular, easy to understand machine learning algorithms
  • 34. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Supervised Learning Name Weight Color Cat_or_Dog Susie 9lbs Orange Cat Fido 25lbs Brown Dog Sparkles 6lbs Black Cat Fido 9lbs Black Dog Name Weight Color Cat_or_Dog Misty 5lbs Orange ? The training set is used to generate a function ..so we can predict if we have a cat or dog!
  • 35. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Category or Values? There are several classes of algorithms depending on whether the prediction is a category (like cat or dog) or a value, like the value of a home. Classification algorithms are general well fit for categorization, while algorithms like Regression and Decision Trees are well suited for predicting values.
  • 36. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Regression • Understanding the relationship between a given set of dependent variables and independent variables • Typically regression is used to predict the output of a dependent variable based on variations in independent variables • Very popular for prediction and forecasting Linear Regression
  • 37. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Decision Trees • A method for predicting outcomes based on the features of data • Model is represented a easy to understand tree structure of if-else statements Weight > 10lbs color = orange cat yes no name = fido no no dogyes dog cat yes
  • 38. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Unsupervised K-Means • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing Clustering of items into logical groups based on natural patterns in data Uses: • Cluster Analysis • Classification • Content Filtering
  • 39. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Collaborative Filtering • A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory Based) • Leveraging collaboration between multiple agents to filter, project, or detect patterns • Popular in recommender systems for projecting the “taste” for of specific individuals for items they have not yet expressed one.
  • 40. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Item-based • A popular and simple memory-based collaborative filtering algorithm • Projects preference based on item similarity (based on ratings): for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u's preference for j, weighted by s, to a running average return the top items, ranked by weighted average • First a matrix of Item to Item similarity is calculated based on user rating • Then recommendations are created by producing a weighted sum of top items, based on the users previously rated items
  • 41. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 5. Evaluation What problem are we trying to solve again? • Our final solution needs to be evaluated against original Business Understanding • Did we meet our objectives? • Did we address all issues?
  • 42. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 6. Deployment Engineering Time! • It’s time for the work products of data science to “graduate” from “new insights” to real applications. • Processes must be hardened, repeatable, and generally perform well too! • Data Governance applied • PMML (Predictive Model Markup Langauge): XML based interchange format
  • 44. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Project Objective • Create a functional recommendation engine to surface to provide relevant product recommendations to customers. • Improve Customer Experience • Increase Customer Retention • Increase Customer Purchase Activity • Accurately suggest relevant products to customers based on their peer behavior.
  • 45. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Recommendations • Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of • What makes a good recommendation? • Relevant but not obvious • Sense of “surprise” 23” LED TV 24” LED TV 25” LED TV 23” LED TV`` SOLD!! Blu-Ray Home Theater HDMI Cables
  • 46. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Where do we use recommendations? • Applications can be found in a wide variety of industries and applications: • Travel • Financial Service • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others Our Example: Movies
  • 47. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop The Goal of the Recommender • Create a powerful, scalable recommendation engine with minimal development • Make recommendations to users as they are browsing movie titles - instantaneously • Recommendation must have context to the movie they are currently viewing. OOPS! – too much surprise!
  • 48. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop How do we do it? We leverage two algorithms: • Content-Based Filtering – how similar is this particular movie to other movies based on usage. • Collaborative Filtering – predict an individuals preference based on their peers ratings. Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS) • Both algorithms only require a simple dataset of 3 fields: “User ID” , “Item ID”, “Rating”
  • 49. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Content-Based Filtering “People who liked this movie liked these as well” • Content Based Filter builds a matrix of items to other items and calculates similarity (based on user rating) • The most similar item are then output as a list: • Item ID, Similar Item ID, Similarity Score • Items with the highest score are most similar • In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100) 7 100 0.690951001800917 7 50 0.653299445638532 7 117 0.643701303640083
  • 50. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Collaborative Filtering “People with similar taste to you liked these movies” • Collaborative filtering applies weights based on “peer” user preference. • Essentially it determines the best movie critics for you to follow • The items with the highest recommendation score are then output as tuples • User ID [Item ID1:Score,…., Item IDn:Score] • Items with the highest recommendation score are the most relevant to this user • For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and “Donnie Brasco” 572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515] 573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019] 574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
  • 51. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Recommendation Store • Serving recommendations needs to be instantaneous • The core to this solution is two reference tables: • When called to make recommendations we query our store • Rec_Item_Similarity based on the Item_ID they are viewing • Rec_User_Item_Base based on their User_ID Rec_Item_Similarity Item_ID Similar_Item Similarity_Score Rec_User_Item_Base User_ID Item_ID Recommendation_Score
  • 52. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Delivering Recommendations Item-Based: Peers like these Movies Best Recommendations Item Similarity Raw Score Score Fargo 0.691 1.000 Star Wars 0.653 0.946 Rock, The 0.644 0.932 Pulp Fiction 0.628 0.909 Return of the Jedi 0.627 0.908 Independence Day 0.618 0.894 Willy Wonka 0.603 0.872 Mission: Impossible 0.597 0.864 Silence of the Lambs, The 0.596 0.863 Star Trek: First Contact 0.594 0.859 Raiders of the Lost Ark 0.584 0.845 Terminator, The 0.574 0.831 Blade Runner 0.571 0.826 Usual Suspects, The 0.569 0.823 Seven (Se7en) 0.569 0.823 Item-Base (Peer) Raw Score Score Seven 5.000 1.000 Donnie Brasco 4.707 0.941 Babe 4.688 0.938 Heat 4.688 0.938 To Kill a Mockingbird 4.686 0.937 Jaws 4.683 0.937 Monty Python, Holy Grail 4.670 0.934 Blade Runner 4.670 0.934 Get Shorty 4.655 0.931 Top 10 Recommendations So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results Seven (Se7en) 1.823 Blade Runner 1.760 Fargo 1.000 Star Wars 0.946 Donnie Brasco 0.941 Babe 0.938 Heat 0.938 To Kill a Mockingbird 0.937 Jaws 0.937 Monty Python, Holy Grail 0.934
  • 53. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop From Good to Great Recommendations • Note that the first 5 recommendations look pretty good …but the 6th result would have been “Babe” the children's movie • Tuning the algorithms might help: parameter changes, similarity measures. • How else can we make it better? 1. Delivery filters 2. Introduce additional algorithms such as K-Means OOPS!
  • 54. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Additional Algorithm – K-Means We would use the major attributes of the Movie to create coordinate points. • Categories • Actors • Director • Synopsis Text “These movies are similar based on their attributes”
  • 55. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Delivery Scoring and Filters • One or more categories must match • Only children movies will be recommended for children's movies. Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0 Babe 0 0 1 1 0 1 0 0 0 0 0 Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1 Star Wars 1 1 0 0 0 0 0 0 1 1 0 Blade Runner 0 0 0 0 0 0 1 0 0 1 0 Fargo 0 0 0 0 1 1 0 0 0 0 1 Willy Wonka 0 1 1 1 0 0 0 0 0 0 0 Monty Python 0 0 0 1 0 0 0 0 0 0 0 Jaws 1 0 0 0 0 0 0 1 0 0 0 Heat 1 0 0 0 1 0 0 0 0 0 1 Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0 To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0 Apply assumptions to control the results of collaborative filtering Similarly logic could be applied to promote more favorable options • New Releases • Retail Case: Items that are on-sale, overstock
  • 56. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Integrating K-Means into the process Collaborative Filter K-Means: Similar Content Filter Best Recommendations Movies recommended by more than 1 algorithm are the most highly rated
  • 57. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop 57 Sophisticated Recommendation Model What items are we promoting at time of sale? What items are being promoted by the Store or Market? What are people with similar characteristics buying? 57 Peer Based Item Clustering Corporate Deals/ Offers Customer Behavior Market/ Store Recommendation What items have you bought in the past? What did people who ordered these items also order? The solution allows balancing of algorithms to attain the most effective recommendation
  • 59. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Some Thoughts – Enable the Future  Data Science requires the convergence of data quality, advanced math, data engineering and visualization and business smarts  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality.  Good data scientists are rare: It will take a village to achieve all the tasks required for effective data science  Get good!  Be great!  Blaze new trails! https://exploredatascience.com/ Data Science Training:
  • 61. @joe_Caserta #DataSummithttps://github.com/Caserta-Concepts/ds-workshop Thank You / Q&A Joe Caserta President, Caserta Concepts joe@casertaconcepts.com (914) 261-3648 @joe_Caserta

Notes de l'éditeur

  1. We focused our attention on building a single version of the truth We mainly applied data governance on the EDW itself and a few primary supporting systems –like MDM. We had a fairly restrictive set of tools for using the EDW data  Enterprise BI tools  It was easier to GOVERN how the data would be used.
  2. Reports  correlations  predictions  recommendations
  3. Data science is not about Hadoop, but it is about modern data engineering. Think polyglot persistence – the right tool for the job. Visualization can be tableau, excel, ggplot2 or d3.js. Or anything.
  4. www.extremepresentation.com
  5. Exploration tools: trifacta, paxata, python, pig, hive, Waterline, hcatalog, hive metastore, solr
  6. Supervised learning: finds patterns over time and predicts what might happen next. Unsupervised learning: organizes, groups, classifies (clusters), categorizes data
  7. Paco nathan made one of these, too.
  8. One of the most respected data scientist I know says 90% of her ML work uses regression analysis Circuit board analogy: all of the circuit boards have their switches flipped in the same direction – and then single out the single characteristic they don’t share. This is how to isolate the true impact of that single switch on the sprawling circuit board. May find Muslims don’t shop on Friday afternoons or females with higher education shop more in the morning than any other
  9. When the outcome is a real number then it is a regression tree
  10. K-means is unsupervised learning K-nearest is supervised learning and needs history
  11. Memory: Uses rating data to compute the similarity between users or items Model: Based on training data
  12. Challenges: sparse data effects performance of recommendation. (performance in ML means how good is it, not how fast is it) Ratings can be crap, biased. Limited history can skew recommendation, long history can mean more sales = higher score (rich get richer)
  13. Cascading, Zementis : Meetup on June 3
  14. Cloudera , Talend , Datameer