SlideShare a Scribd company logo
1 of 33
FEATURE ENGINEERING
FOR DIVERSE DATA TYPES
Alice Zheng
October 10, 2016
Seattle PyLadies Meetup
1
2
MY JOURNEY SO FAR
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book
3
MACHINE LEARNING IS USEFUL!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4
THE MACHINE LEARNING PIPELINE
It is a puppy and
it is extremely
cute.
Raw data
Features
Models
Predictions
Deploy in
production
Models
6
A SIMPLE MODEL
X
Y
X and Y
1
1
1
0
0
0
0 1
1
0 0 0
f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) =
1 if f(x, y) > 0
0 if f(x, y) <= 0
7
VISUALIZING A MODEL
1
1
X
Y
g(x,y)
0
8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions
(e.g., deep neural nets)
9
BETWEEN RAW DATA AND MODELS
• Mathematical models take numeric input
• Raw data are not numeric (or not the right kind of numeric)
• Featurization: the step in-between
• Feature space: multi-dimensional numeric space where modeling happens
Feature Generation
Feature: An individual measurable property
of a phenomenon being observed.
⎯ Christopher Bishop,
“Pattern Recognition and Machine Learning”
TEXT
12
TURNING TEXT INTO FEATURES
It is a puppy and it
is extremely cute.
What are the important
measures? Keywords?
Verb tense? Subject,
object?
it 2
is 2
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Bag of words feature
vector
Raw text
13
VISUALIZING BAG-OF-WORDS
puppy
cute
1
1
It is a puppy and
it is extremely cute
14
CLASSIFYING BAG-OF-WORDS
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
I have a dog
and I have a pen
1
Decision surface
Feature Cleaning and Transformation
16
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
17
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc
Count
Rank Word Doc
Count
357,480 cmtk8xyqg 1 357,470 attractif 1
357,479 tangified 1 357,469 chappagetti 1
357,478 laaaaaaasts 1 357,468 herdy 1
357,477 bailouts 1 357,467 csmpus 1
357,476 feautred 1 357,466 costoso 1
357,475 résine 1 357,465 freebased 1
357,474 chilyl 1 357,464 tikme 1
357,473 cariottis 1 357,463 traditionresort 1
357,472 enfeebled 1 357,462 jallisco 1
357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).
18
FEATURE CLEANING
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
19
FEATURE CLEANING
• Frequency-based pruning
20
STOPWORDS VS. FREQUENCY
FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
21
FEATURE SCALING WITH TD-IDF
• Scaling ”evens out” the features
• A soft filter
• Tf-idf = term frequency x inverse document frequency
• Tf = Number of times a terms appears in a document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
22
VISUALIZING TF-IDF
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
23
VISUALIZING TF-IDF
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
IMAGES
25
REPRESENTING IMAGES
What are the “semantic atoms” of images?
• Semantic atom = a unit of meaning
26
COLOR HISTOGRAM
40%
60%
White Blue
40%
60%
White Blue
27
INFORMATION ABOUT STRUCTURE
Collection of local patches encapsulates global structure
28
IMAGE GRADIENTS AND
ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or
texture
• Image gradient: direction of largest change in
color, starting from a pixel
-45º
0º
45º
-90º
90º
135º
180º
-135º
• Gradient orientation histogram: indicates the
prominent directions of color change in a
patch of pixels
29
SIFT IMAGE FEATURE PIPELINE
Lowe, ICCV 1999
30
DEEP LEARNING APPROACH
• Stack multiple layers – combine local features to form global features
• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012
31
VISUALIZING ALEXNET
Weights of a trained AlexNet. Left– first layer, right – second layer.
32
FEATURIZATION CHALLENGES
It is a puppy and it is
extremely cute.
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
Text
ImageAudio
33
KEY TO FEATURE ENGINEERING
• Features sit in-between data and models
• Need to encapsulate necessary semantic information from raw data
• Distribution of data in feature space should be easily manageable by intended
model
• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio
• Requires ingenuity and intuition!
@RainyData alicez@amazon.com
Amazon Ad Platform is hiring!

More Related Content

What's hot

Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsArtifacia
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...Rizwan Habib
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMustafa Yagmur
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Manohar Mukku
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기NAVER Engineering
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)NamHyuk Ahn
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical ImagingSanghoon Hong
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...남주 김
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice남주 김
 
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionIlya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionAI Frontiers
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooJaeJun Yoo
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksMLReview
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...宏毅 李
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Balázs Hidasi
 

What's hot (14)

Generative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their ApplicationsGenerative Adversarial Networks and Their Applications
Generative Adversarial Networks and Their Applications
 
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
NYAI - A Path To Unsupervised Learning Through Adversarial Networks by Soumit...
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)Generative Adversarial Networks (GAN)
Generative Adversarial Networks (GAN)
 
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
1시간만에 GAN(Generative Adversarial Network) 완전 정복하기
 
Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)Generative Adversarial Network (+Laplacian Pyramid GAN)
Generative Adversarial Network (+Laplacian Pyramid GAN)
 
Generative Adversarial Networks and Their Applications in Medical Imaging
Generative Adversarial Networks  and Their Applications in Medical ImagingGenerative Adversarial Networks  and Their Applications in Medical Imaging
Generative Adversarial Networks and Their Applications in Medical Imaging
 
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
A pixel to-pixel segmentation method of DILD without masks using CNN and perl...
 
Convolutional neural network in practice
Convolutional neural network in practiceConvolutional neural network in practice
Convolutional neural network in practice
 
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI missionIlya Sutskever at AI Frontiers : Progress towards the OpenAI mission
Ilya Sutskever at AI Frontiers : Progress towards the OpenAI mission
 
Variants of GANs - Jaejun Yoo
Variants of GANs - Jaejun YooVariants of GANs - Jaejun Yoo
Variants of GANs - Jaejun Yoo
 
Tutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial NetworksTutorial on Theory and Application of Generative Adversarial Networks
Tutorial on Theory and Application of Generative Adversarial Networks
 
Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...Generative Adversarial Network and its Applications to Speech Processing an...
Generative Adversarial Network and its Applications to Speech Processing an...
 
Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...Deep learning to the rescue - solving long standing problems of recommender ...
Deep learning to the rescue - solving long standing problems of recommender ...
 

Viewers also liked

Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering odsc
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBigML, Inc
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningAlice Zheng
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringDataRobot
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoostDataRobot
 
Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010Yug Contract Company
 
Science presentation
Science presentationScience presentation
Science presentationsams01
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Meir Maor
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationDave Shiple
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsWhat the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsAlice Zheng
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
Deep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDeep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDavid Dao
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaData Con LA
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Ra'Fat Al-Msie'deen
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth StrategyDave Shiple
 
@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - FinalAltamash Khan
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional DataAmit Kapoor
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentationHJ van Veen
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Web Services
 

Viewers also liked (20)

Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
 
BSSML16 L7. Feature Engineering
BSSML16 L7. Feature EngineeringBSSML16 L7. Feature Engineering
BSSML16 L7. Feature Engineering
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Understanding Feature Space in Machine Learning
Understanding Feature Space in Machine LearningUnderstanding Feature Space in Machine Learning
Understanding Feature Space in Machine Learning
 
Make Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature EngineeringMake Sense Out of Data with Feature Engineering
Make Sense Out of Data with Feature Engineering
 
Featurizing log data before XGBoost
Featurizing log data before XGBoostFeaturizing log data before XGBoost
Featurizing log data before XGBoost
 
Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010Yug Contract Company Digital Descent Autumn 2010
Yug Contract Company Digital Descent Autumn 2010
 
Science presentation
Science presentationScience presentation
Science presentation
 
Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks Can automated feature engineering prevent target leaks
Can automated feature engineering prevent target leaks
 
Introduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits RealizationIntroduction &amp; EHR Benefits Realization
Introduction &amp; EHR Benefits Realization
 
What the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and AlgorithmsWhat the Bleep is Big Data? A Holistic View of Data and Algorithms
What the Bleep is Big Data? A Holistic View of Data and Algorithms
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
Deep Learning in Natural Language Processing
Deep Learning in Natural Language ProcessingDeep Learning in Natural Language Processing
Deep Learning in Natural Language Processing
 
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of ToyotaBig Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
Big Data Day LA 2015 - Feature Engineering by Brian Kursar of Toyota
 
Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...Reverse Engineering Feature Models From Software Variants to Build Software P...
Reverse Engineering Feature Models From Software Variants to Build Software P...
 
Enterprise mHealth Strategy
Enterprise mHealth StrategyEnterprise mHealth Strategy
Enterprise mHealth Strategy
 
@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final@ UDRI - Traffic & Transportation Plan - Final
@ UDRI - Traffic & Transportation Plan - Final
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional Data
 
Kaggle presentation
Kaggle presentationKaggle presentation
Kaggle presentation
 
Amazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart ApplicationsAmazon Machine Learning: Empowering Developers to Build Smart Applications
Amazon Machine Learning: Empowering Developers to Build Smart Applications
 

Similar to Feature engineering for diverse data types

Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision TreeSuman Debnath
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Quinn Lathrop
 
Data science in action
Data science in actionData science in action
Data science in actionLonghow Lam
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Toria Gibbs
 
Giving Technical Presentations
Giving Technical PresentationsGiving Technical Presentations
Giving Technical PresentationsMikeKSmith
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesIchigaku Takigawa
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2yannabraham
 
Data generation, the hard parts
Data generation, the hard partsData generation, the hard parts
Data generation, the hard partsEric Torreborre
 
A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
A Deep Journey into Playing Games with Reinforcement Learning - Kim HammarA Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
A Deep Journey into Playing Games with Reinforcement Learning - Kim HammarKim Hammar
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsElasticsearch
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataAnalyticsWeek
 
関数プログラミングことはじめ in 福岡
関数プログラミングことはじめ in 福岡関数プログラミングことはじめ in 福岡
関数プログラミングことはじめ in 福岡Naoki Kitora
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.Alex Powers
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchGeorge Awad
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Treesananth
 

Similar to Feature engineering for diverse data types (20)

Introduction to ML and Decision Tree
Introduction to ML and Decision TreeIntroduction to ML and Decision Tree
Introduction to ML and Decision Tree
 
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
Computer Generated Items, Within-Template Variation, and the Impact on the Pa...
 
Data science in action
Data science in actionData science in action
Data science in action
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
Giving Technical Presentations
Giving Technical PresentationsGiving Technical Presentations
Giving Technical Presentations
 
Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
A Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree EnsemblesA Modern Introduction to Decision Tree Ensembles
A Modern Introduction to Decision Tree Ensembles
 
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation engine
 
Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2Elegant Graphics for Data Analysis with ggplot2
Elegant Graphics for Data Analysis with ggplot2
 
Data generation, the hard parts
Data generation, the hard partsData generation, the hard parts
Data generation, the hard parts
 
A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
A Deep Journey into Playing Games with Reinforcement Learning - Kim HammarA Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
A Deep Journey into Playing Games with Reinforcement Learning - Kim Hammar
 
Geo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic MapsGeo exploration simplified with Elastic Maps
Geo exploration simplified with Elastic Maps
 
Understanding feature-space
Understanding feature-spaceUnderstanding feature-space
Understanding feature-space
 
Making AI efficient
Making AI efficientMaking AI efficient
Making AI efficient
 
Using Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigDataUsing Topological Data Analysis on your BigData
Using Topological Data Analysis on your BigData
 
Deep learning
Deep learningDeep learning
Deep learning
 
関数プログラミングことはじめ in 福岡
関数プログラミングことはじめ in 福岡関数プログラミングことはじめ in 福岡
関数プログラミングことはじめ in 福岡
 
It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.It's Not You. It's Your Data Model.
It's Not You. It's Your Data Model.
 
TRECVID 2016 : Instance Search
TRECVID 2016 : Instance SearchTRECVID 2016 : Instance Search
TRECVID 2016 : Instance Search
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
 

Recently uploaded

trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squaresusmanzain586
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentationtahreemzahra82
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationColumbia Weather Systems
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologycaarthichand2003
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...navyadasi1992
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxEran Akiva Sinbar
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxJorenAcuavera1
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayupadhyaymani499
 

Recently uploaded (20)

trihybrid cross , test cross chi squares
trihybrid cross , test cross chi squarestrihybrid cross , test cross chi squares
trihybrid cross , test cross chi squares
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Harmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms PresentationHarmful and Useful Microorganisms Presentation
Harmful and Useful Microorganisms Presentation
 
Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?Let’s Say Someone Did Drop the Bomb. Then What?
Let’s Say Someone Did Drop the Bomb. Then What?
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
User Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather StationUser Guide: Magellan MX™ Weather Station
User Guide: Magellan MX™ Weather Station
 
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxSTOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx
 
Davis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technologyDavis plaque method.pptx recombinant DNA technology
Davis plaque method.pptx recombinant DNA technology
 
Servosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by PetrovicServosystem Theory / Cybernetic Theory by Petrovic
Servosystem Theory / Cybernetic Theory by Petrovic
 
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptxMicrophone- characteristics,carbon microphone, dynamic microphone.pptx
Microphone- characteristics,carbon microphone, dynamic microphone.pptx
 
Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...Radiation physics in Dental Radiology...
Radiation physics in Dental Radiology...
 
The dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptxThe dark energy paradox leads to a new structure of spacetime.pptx
The dark energy paradox leads to a new structure of spacetime.pptx
 
FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Topic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptxTopic 9- General Principles of International Law.pptx
Topic 9- General Principles of International Law.pptx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》《Queensland毕业文凭-昆士兰大学毕业证成绩单》
《Queensland毕业文凭-昆士兰大学毕业证成绩单》
 
Citronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyayCitronella presentation SlideShare mani upadhyay
Citronella presentation SlideShare mani upadhyay
 

Feature engineering for diverse data types

  • 1. FEATURE ENGINEERING FOR DIVERSE DATA TYPES Alice Zheng October 10, 2016 Seattle PyLadies Meetup 1
  • 2. 2 MY JOURNEY SO FAR Shortage of expertise and good tools in the market. Applied machine learning/ data science Build ML tools Write a book
  • 3. 3 MACHINE LEARNING IS USEFUL! Model data. Make predictions. Build intelligent applications. Play chess and go!
  • 4. 4 THE MACHINE LEARNING PIPELINE It is a puppy and it is extremely cute. Raw data Features Models Predictions Deploy in production
  • 6. 6 A SIMPLE MODEL X Y X and Y 1 1 1 0 0 0 0 1 1 0 0 0 f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) = 1 if f(x, y) > 0 0 if f(x, y) <= 0
  • 8. 8 FROM SIMPLE TO COMPLEX Xn X3 X2 X1 … r1(X1, X2) r2(X2∪X3) rm(X1, Xn) … s1(r1, r2) s2(r1, r3) sm(rm-1, rm) … Use more complicated functions or Stack layers of simple functions (e.g., deep neural nets)
  • 9. 9 BETWEEN RAW DATA AND MODELS • Mathematical models take numeric input • Raw data are not numeric (or not the right kind of numeric) • Featurization: the step in-between • Feature space: multi-dimensional numeric space where modeling happens
  • 10. Feature Generation Feature: An individual measurable property of a phenomenon being observed. ⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
  • 11. TEXT
  • 12. 12 TURNING TEXT INTO FEATURES It is a puppy and it is extremely cute. What are the important measures? Keywords? Verb tense? Subject, object? it 2 is 2 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Bag of words feature vector Raw text
  • 13. 13 VISUALIZING BAG-OF-WORDS puppy cute 1 1 It is a puppy and it is extremely cute
  • 14. 14 CLASSIFYING BAG-OF-WORDS puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten I have a dog and I have a pen 1 Decision surface
  • 15. Feature Cleaning and Transformation
  • 16. 16 AUTO-GENERATED FEATURES ARE NOISY Rank Word Doc Count Rank Word Doc Count 1 the 1,416,058 11 was 929,703 2 and 1,381,324 12 this 844,824 3 a 1,263,126 13 but 822,313 4 i 1,230,214 14 my 786,595 5 to 1,196,238 15 that 777,045 6 it 1,027,835 16 with 775,044 7 of 1,025,638 17 on 735,419 8 for 993,430 18 they 720,994 9 is 988,547 19 you 701,015 10 in 961,518 20 have 692,749 Most popular words in Yelp reviews dataset (~ 6M reviews).
  • 17. 17 AUTO-GENERATED FEATURES ARE NOISY Rank Word Doc Count Rank Word Doc Count 357,480 cmtk8xyqg 1 357,470 attractif 1 357,479 tangified 1 357,469 chappagetti 1 357,478 laaaaaaasts 1 357,468 herdy 1 357,477 bailouts 1 357,467 csmpus 1 357,476 feautred 1 357,466 costoso 1 357,475 résine 1 357,465 freebased 1 357,474 chilyl 1 357,464 tikme 1 357,473 cariottis 1 357,463 traditionresort 1 357,472 enfeebled 1 357,462 jallisco 1 357,471 sparklely 1 357,461 zoawan 1 Least popular words in Yelp reviews dataset (~ 6M reviews).
  • 18. 18 FEATURE CLEANING • Popular words and rare words are not helpful • Manually defined blacklist – stopwords a b c d e f g h i able be came definitely each far get had ie about became can described edu few gets happens if above because cannot despite eg fifth getting hardly ignored according become cant did eight first given has immediately accordingly becomes cause different either five gives have in across becoming causes do else followed go having inasmuch … … … … … … … … …
  • 20. 20 STOPWORDS VS. FREQUENCY FILTERS No training required Stopwords Frequency filters Can be exhaustive Inflexible Adapts to data Also deals with rare words Needs tuning, hard to control Both require manual attention
  • 21. 21 FEATURE SCALING WITH TD-IDF • Scaling ”evens out” the features • A soft filter • Tf-idf = term frequency x inverse document frequency • Tf = Number of times a terms appears in a document • Idf = log(# total docs / # docs containing word w) • Large for uncommon words, small for popular words • Discounts popular words, highlights rare words
  • 22. 22 VISUALIZING TF-IDF puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 23. 23 VISUALIZING TF-IDF puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy
  • 25. 25 REPRESENTING IMAGES What are the “semantic atoms” of images? • Semantic atom = a unit of meaning
  • 27. 27 INFORMATION ABOUT STRUCTURE Collection of local patches encapsulates global structure
  • 28. 28 IMAGE GRADIENTS AND ORIENTATION HISTOGRAM • Color changes indicate edges, patterns, or texture • Image gradient: direction of largest change in color, starting from a pixel -45º 0º 45º -90º 90º 135º 180º -135º • Gradient orientation histogram: indicates the prominent directions of color change in a patch of pixels
  • 29. 29 SIFT IMAGE FEATURE PIPELINE Lowe, ICCV 1999
  • 30. 30 DEEP LEARNING APPROACH • Stack multiple layers – combine local features to form global features • Similar in spirit to SIFT/HOG “AlexNet” – Krizhevsky et al., NIPS 2012
  • 31. 31 VISUALIZING ALEXNET Weights of a trained AlexNet. Left– first layer, right – second layer.
  • 32. 32 FEATURIZATION CHALLENGES It is a puppy and it is extremely cute. “Human native” Conceptually abstract Low Semantic content in data High Higher Difficulty of feature generation Lower Text ImageAudio
  • 33. 33 KEY TO FEATURE ENGINEERING • Features sit in-between data and models • Need to encapsulate necessary semantic information from raw data • Distribution of data in feature space should be easily manageable by intended model • Natural text and logs contain higher level semantic information • Easier to featurize than images and audio • Requires ingenuity and intuition! @RainyData alicez@amazon.com Amazon Ad Platform is hiring!

Editor's Notes

  1. Features sit between raw data and model. They can make or break an application.