8. 8
FROM SIMPLE TO COMPLEX
Xn
X3
X2
X1
…
r1(X1, X2)
r2(X2∪X3)
rm(X1, Xn)
…
s1(r1, r2)
s2(r1, r3)
sm(rm-1, rm)
…
Use more complicated functions
or
Stack layers of simple functions
(e.g., deep neural nets)
9. 9
BETWEEN RAW DATA AND MODELS
• Mathematical models take numeric input
• Raw data are not numeric (or not the right kind of numeric)
• Featurization: the step in-between
• Feature space: multi-dimensional numeric space where modeling happens
10. Feature Generation
Feature: An individual measurable property
of a phenomenon being observed.
⎯ Christopher Bishop,
“Pattern Recognition and Machine Learning”
12. 12
TURNING TEXT INTO FEATURES
It is a puppy and it
is extremely cute.
What are the important
measures? Keywords?
Verb tense? Subject,
object?
it 2
is 2
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Bag of words feature
vector
Raw text
16. 16
AUTO-GENERATED FEATURES
ARE NOISY
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
18. 18
FEATURE CLEANING
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
20. 20
STOPWORDS VS. FREQUENCY
FILTERS
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
21. 21
FEATURE SCALING WITH TD-IDF
• Scaling ”evens out” the features
• A soft filter
• Tf-idf = term frequency x inverse document frequency
• Tf = Number of times a terms appears in a document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
28. 28
IMAGE GRADIENTS AND
ORIENTATION HISTOGRAM
• Color changes indicate edges, patterns, or
texture
• Image gradient: direction of largest change in
color, starting from a pixel
-45º
0º
45º
-90º
90º
135º
180º
-135º
• Gradient orientation histogram: indicates the
prominent directions of color change in a
patch of pixels
30. 30
DEEP LEARNING APPROACH
• Stack multiple layers – combine local features to form global features
• Similar in spirit to SIFT/HOG
“AlexNet” – Krizhevsky et al., NIPS 2012
32. 32
FEATURIZATION CHALLENGES
It is a puppy and it is
extremely cute.
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
Text
ImageAudio
33. 33
KEY TO FEATURE ENGINEERING
• Features sit in-between data and models
• Need to encapsulate necessary semantic information from raw data
• Distribution of data in feature space should be easily manageable by intended
model
• Natural text and logs contain higher level semantic information
• Easier to featurize than images and audio
• Requires ingenuity and intuition!
@RainyData alicez@amazon.com
Amazon Ad Platform is hiring!
Editor's Notes
Features sit between raw data and model. They can make or break an application.