Feature engineering for diverse data types

FEATURE ENGINEERING
FOR DIVERSE DATA TYPES
Alice Zheng
October 10, 2016
Seattle PyLadies Meetup
1

2
MY JOURNEY SO FAR
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book

3
MACHINE LEARNING IS USEFUL!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!

4
THE MACHINE LEARNING PIPELINE
It is a puppy and
it is extremely
cute.
Raw data
Features
Models
Predictions
Deploy in
production

6
A SIMPLE MODEL
X
Y
X and Y
1
1
1
0
0
0
0 1
1
0 0 0
f(x, y) = 0.5 x + 0.5 y – 1 g(x, y) =
1 if f(x, y) > 0
0 if f(x, y) <= 0

7
VISUALIZING A MODEL
1
1
X
Y
g(x,y)
0

8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions
(e.g., deep neural nets)

9
BETWEEN RAW DATA AND MODELS
• Mathematical models take numeric input
• Raw data are not numeric (or not the right kind of numeric)
• Featurization: the step in-between
• Feature space: multi-dimensional numeric space where modeling happens

Feature Generation
Feature: An individual measurable property
of a phenomenon being observed.
⎯ Christopher Bishop,
“Pattern Recognition and Machine Learning”

12
TURNING TEXT INTO FEATURES
It is a puppy and it
is extremely cute.
What are the important
measures? Keywords?
Verb tense? Subject,
object?
it 2
is 2
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Bag of words feature
vector
Raw text

13
VISUALIZING BAG-OF-WORDS
puppy
cute
1
1
It is a puppy and
it is extremely cute

14
CLASSIFYING BAG-OF-WORDS
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
I have a dog
and I have a pen
1
Decision surface

Feature Cleaning and Transformation

16
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).

17
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc
Count
Rank Word Doc
Count
357,480 cmtk8xyqg 1 357,470 attractif 1
357,479 tangified 1 357,469 chappagetti 1
357,478 laaaaaaasts 1 357,468 herdy 1
357,477 bailouts 1 357,467 csmpus 1
357,476 feautred 1 357,466 costoso 1
357,475 résine 1 357,465 freebased 1
357,474 chilyl 1 357,464 tikme 1
357,473 cariottis 1 357,463 traditionresort 1
357,472 enfeebled 1 357,462 jallisco 1
357,471 sparklely 1 357,461 zoawan 1
Least popular words in Yelp reviews dataset (~ 6M reviews).

18
FEATURE CLEANING
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …

19
FEATURE CLEANING
• Frequency-based pruning

20
STOPWORDS VS. FREQUENCY
FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention

21
FEATURE SCALING WITH TD-IDF
• Scaling ”evens out” the features
• A soft filter
• Tf-idf = term frequency x inverse document frequency
• Tf = Number of times a terms appears in a document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words

22
VISUALIZING TF-IDF
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1

23
VISUALIZING TF-IDF
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy

25
REPRESENTING IMAGES
What are the “semantic atoms” of images?
• Semantic atom = a unit of meaning

26
COLOR HISTOGRAM
40%
60%
White Blue
40%
60%
White Blue

27
INFORMATION ABOUT STRUCTURE
Collection of local patches encapsulates global structure

28
IMAGE GRADIENTS AND
ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or
texture
• Image gradient: direction of largest change in
color, starting from a pixel
-45º
0º
45º
-90º
90º
135º
180º
-135º
• Gradient orientation histogram: indicates the
prominent directions of color change in a
patch of pixels

29
SIFT IMAGE FEATURE PIPELINE
Lowe, ICCV 1999

30
DEEP LEARNING APPROACH
• Stack multiple layers – combine local features to form global features
• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012

31
VISUALIZING ALEXNET
Weights of a trained AlexNet. Left– first layer, right – second layer.

32
FEATURIZATION CHALLENGES
It is a puppy and it is
extremely cute.
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
Text
ImageAudio

33
KEY TO FEATURE ENGINEERING
• Features sit in-between data and models
• Need to encapsulate necessary semantic information from raw data
• Distribution of data in feature space should be easily manageable by intended
model
• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio
• Requires ingenuity and intuition!
@RainyData alicez@amazon.com
Amazon Ad Platform is hiring!

Feature engineering for diverse data types

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Feature engineering for diverse data types

Similar to Feature engineering for diverse data types (20)

Recently uploaded

Recently uploaded (20)

Feature engineering for diverse data types

Editor's Notes