Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
1. The How and Why of
Feature Engineering
Alice Zheng, Dato
March 29, 2016
Strata + Hadoop World, San Jose
1
2. 2
My journey so far
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book
3. 3
Machine learning is great!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4. 4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
5. 5
If machine learning were hairstyles
Images courtesy of “A visual history of ancient hairdos” and “An animated history of 20th century hairstyles.”
Models
Magnificent, ornate, high-maintenance
Feature engineering
Street smart, ad-hoc, hacky
6. 6
Making sense of feature engineering
• Feature generation
• Feature cleaning and transformation
• How well do they work?
• Why?
7. Feature Generation
Feature: An individual measurable property of a phenomenon being observed.
⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
8. 8
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
9. 9
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 0
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
10. 10
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
11. 11
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
12. 12
Representing audio
Raw Audio
Spectrogram
features
Classify:
Music or voice?
Type of instrument
t=0 t=1 t=2
6.1917 -0.3411 1.2418
0.2205 0.0214 0.4503
1.0423 0.2214 -1.0017
-0.2340 -0.0392 -0.2617
0.2750 0.0226 0.1229
0.0653 0.0428 -0.4721
0.3169 0.0541 -0.1033
-0.2970 -0.0627 0.1960
Time series of
dense vectors
13. 13
Feature generation for audio, image, text
I fell in love the instant I
laid my eyes on that
puppy. His big eyes and
playful tail, his soft furry
paws, …
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
15. 15
Auto-generated features are noisy
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
17. 17
Feature cleaning
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
19. 19
Stopwords vs. frequency filters
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
20. 20
Tf-Idf: Automatic “soft” filter
• Tf-idf = term frequency x inverse document
frequency
• Tf = Number of times a terms appears in a
document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
30. 30
Classify reviews using logistic regression
• Classify business category of Yelp reviews
• Bag-of-words vs. L2 normalization vs. tf-idf
• Model: logistic regression
31. 31
Observations
• l2 regularization made no difference (with proper tuning)
• L2 normalization made no difference on accuracy
• Tf-idf did better, but barely
• But they are both column scaling methods! Why the
difference?
41. Effect of column scaling
Scaled columns Singular values change
(but zeros stay zero)
Singular vectors may also change
42. 42
Effect of column scaling
• Changes the singular values and vectors, but not the rank
of the null space or column space
• … unless the scaling factor is zero
- Could only happen with tf-idf
• L2 scaling improves the condition number (therefore the
solver converges faster)
43. 43
Mystery resolved
• Tf-idf can emphasize some columns while zeroing out
others—the uninformative features
• L2 normalization makes all features equal in “size”
- Improves the condition number of the matrix
- Solver converges faster
44. 44
Take-away points
• Many tricks for feature generation and transformation
• Features interact with models, making their effects difficult
to predict
• But so much fun to play with!
• New book coming out: Mastering feature engineering
- More tricks, intuition, analysis
@RainyData
Notes de l'éditeur
Features sit between raw data and model. They can make or break an application.