SlideShare une entreprise Scribd logo
1  sur  33
Understanding
Feature Space in
Machine Learning
Alice Zheng, Dato
September 9, 2015
1
2
My journey so far
Applied machine learning
(Data science)
Build ML tools
Shortage of experts
and good tools.
3
Why machine learning?
Model data.
Make predictions.
Build intelligent
applications.
4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
Feature = numeric representation of raw data
6
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
7
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
8
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
9
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
10
Feature space in machine learning
• Raw data  high dimensional vectors
• Collection of data points  point cloud in feature space
• Model = geometric summary of point cloud
• Feature engineering = creating features of the appropriate
granularity for the task
Crudely speaking, mathematicians fall into two
categories: the algebraists, who find it easiest to
reduce all problems to sets of numbers and
variables, and the geometers, who understand the
world through shapes.
-- Masha Gessen, “Perfect Rigor”
12
Algebra vs. Geometry
a
b
c
a2 + b2 = c2
Algebra Geometry
Pythagorean
Theorem
(Euclidean space)
13
Visualizing a sphere in 2D
x2 + y2 = 1
a
b
c
Pythagorean theorem:
a2 + b2 = c2
x
y
1
1
14
Visualizing a sphere in 3D
x2 + y2 + z2 = 1
x
y
z
1
1
1
15
Visualizing a sphere in 4D
x2 + y2 + z2 + t2 = 1
x
y
z
1
1
1
16
Why are we looking at spheres?
= =
= =
Poincaré Conjecture:
All physical objects without holes
is “equivalent” to a sphere.
17
The power of higher dimensions
• A sphere in 4D can model the birth and death process of
physical objects
• Point clouds = approximate geometric shapes
• High dimensional features can model many things
Visualizing Feature Space
19
The challenge of high dimension geometry
• Feature space can have hundreds to millions of
dimensions
• In high dimensions, our geometric imagination is limited
- Algebra comes to our aid
20
Visualizing bag-of-words
puppy
cute
1
1
I have a puppy and
it is extremely cute
I have a puppy and
it is extremely cute
it 1
they 0
I 1
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
zebra 0
cute 1
extremely 1
… …
21
Visualizing bag-of-words
puppy
cute
1
1
1
extremely
I have a puppy and
it is extremely cute
I have an extremely
cute cat
I have a cute
puppy
22
Document point cloud
word 1
word 2
23
What is a model?
• Model = mathematical “summary” of data
• What’s a summary?
- A geometric shape
24
Classification model
Feature 2
Feature 1
Decide between two classes
25
Clustering model
Feature 2
Feature 1
Group data points tightly
26
Regression model
Target
Feature
Fit the target values
Visualizing Feature Engineering
28
When does bag-of-words fail?
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
Task: find a surface that separates
documents about dogs vs. cats
Problem: the word “have” adds fluff
instead of information
I have a dog
and I have a pen
1
29
Improving on bag-of-words
• Idea: “normalize” word counts so that popular words
are discounted
• Term frequency (tf) = Number of times a terms
appears in a document
• Inverse document frequency of word (idf) =
• N = total number of documents
• Tf-idf count = tf x idf
30
From BOW to tf-idf
puppy
cat
2
1
1
have
I have a puppy
I have a cat
I have a kitten
idf(puppy) = log 4
idf(cat) = log 4
idf(have) = log 1 = 0
I have a dog
and I have a pen
1
31
From BOW to tf-idf
puppy
cat1
have
tfidf(puppy) = log 4
tfidf(cat) = log 4
tfidf(have) = 0
I have a dog
and I have a pen,
I have a kitten
1
log 4
log 4
I have a cat
I have a puppy
Decision surface
Tf-idf flattens
uninformative
dimensions in the
BOW point cloud
32
Entry points of feature engineering
• Start from data and task
- What’s the best text representation for classification?
• Start from modeling method
- What kind of features does k-means assume?
- What does linear regression assume about the data?
33
That’s not all, folks!
• There’s a lot more to feature engineering:
- Feature normalization
- Feature transformations
- “Regularizing” models
- Learning the right features
• Dato is hiring! jobs@dato.com
alicez@dato.com @RainyData

Contenu connexe

Tendances

Tendances (20)

Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science Machine Learning Deep Learning AI and Data Science
Machine Learning Deep Learning AI and Data Science
 
Forward checking
Forward checkingForward checking
Forward checking
 
Rate distortion theory
Rate distortion theoryRate distortion theory
Rate distortion theory
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
 
Forward and Backward chaining in AI
Forward and Backward chaining in AIForward and Backward chaining in AI
Forward and Backward chaining in AI
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Hamiltonian cycle in data structure 2
Hamiltonian cycle in data structure  2Hamiltonian cycle in data structure  2
Hamiltonian cycle in data structure 2
 
AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)AI Lecture 3 (solving problems by searching)
AI Lecture 3 (solving problems by searching)
 
Intelligent web applications
Intelligent web applicationsIntelligent web applications
Intelligent web applications
 
Intelligent agent
Intelligent agentIntelligent agent
Intelligent agent
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Word embedding
Word embedding Word embedding
Word embedding
 
AI - Introduction to Bellman Equations
AI - Introduction to Bellman EquationsAI - Introduction to Bellman Equations
AI - Introduction to Bellman Equations
 
Genetic Algorithm in Artificial Intelligence
Genetic Algorithm in Artificial IntelligenceGenetic Algorithm in Artificial Intelligence
Genetic Algorithm in Artificial Intelligence
 
Matrix chain multiplication
Matrix chain multiplicationMatrix chain multiplication
Matrix chain multiplication
 
Classification Algorithm.
Classification Algorithm.Classification Algorithm.
Classification Algorithm.
 
Multi Head, Multi Tape Turing Machine
Multi Head, Multi Tape Turing MachineMulti Head, Multi Tape Turing Machine
Multi Head, Multi Tape Turing Machine
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
 
Planning in AI(Partial order planning)
Planning in AI(Partial order planning)Planning in AI(Partial order planning)
Planning in AI(Partial order planning)
 

En vedette

2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
Ed Chi
 

En vedette (8)

The How and Why of Feature Engineering
The How and Why of Feature EngineeringThe How and Why of Feature Engineering
The How and Why of Feature Engineering
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Horovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made EasyHorovod - Distributed TensorFlow Made Easy
Horovod - Distributed TensorFlow Made Easy
 
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
 
Lessons from 2MM machine learning models
Lessons from 2MM machine learning modelsLessons from 2MM machine learning models
Lessons from 2MM machine learning models
 
Large-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at FacebookLarge-Scale Training with GPUs at Facebook
Large-Scale Training with GPUs at Facebook
 
Parameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at TwitterParameter Server Approach for Online Learning at Twitter
Parameter Server Approach for Online Learning at Twitter
 
2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...2017 10-10 (netflix ml platform meetup) learning item and user representation...
2017 10-10 (netflix ml platform meetup) learning item and user representation...
 

Similaire à Understanding Feature Space in Machine Learning

Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
Michael Scovetta
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
Gregory Renard
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
Mel Chua
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
butest
 

Similaire à Understanding Feature Space in Machine Learning (20)

Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up SeattleUnderstanding Feature Space in Machine Learning - Data Science Pop-up Seattle
Understanding Feature Space in Machine Learning - Data Science Pop-up Seattle
 
Maths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the ArtsMaths in the PYP - A Journey through the Arts
Maths in the PYP - A Journey through the Arts
 
Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,Introduction to LLMs, Prompt Engineering fundamentals,
Introduction to LLMs, Prompt Engineering fundamentals,
 
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
[D2 COMMUNITY] Spark User Group - 머신러닝 인공지능 기법
 
CO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptxCO Quadratic Inequalties.pptx
CO Quadratic Inequalties.pptx
 
Latent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modelingLatent dirichlet allocation_and_topic_modeling
Latent dirichlet allocation_and_topic_modeling
 
Ml3
Ml3Ml3
Ml3
 
Overview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature EngineeringOverview of Machine Learning and Feature Engineering
Overview of Machine Learning and Feature Engineering
 
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
Infrastructures et recommandations pour les Humanités Numériques - Big Data e...
 
Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
 
CSCE181 Big ideas in NLP
CSCE181 Big ideas in NLPCSCE181 Big ideas in NLP
CSCE181 Big ideas in NLP
 
Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013Peter Norvig - NYC Machine Learning 2013
Peter Norvig - NYC Machine Learning 2013
 
syntherella feedback synthesizer
syntherella feedback synthesizersyntherella feedback synthesizer
syntherella feedback synthesizer
 
Deep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do ItDeep Learning Class #0 - You Can Do It
Deep Learning Class #0 - You Can Do It
 
DL Classe 0 - You can do it
DL Classe 0 - You can do itDL Classe 0 - You can do it
DL Classe 0 - You can do it
 
Word2vec ultimate beginner
Word2vec ultimate beginnerWord2vec ultimate beginner
Word2vec ultimate beginner
 
Edutalk f2013
Edutalk f2013Edutalk f2013
Edutalk f2013
 
Collegeteaching102
Collegeteaching102Collegeteaching102
Collegeteaching102
 
Using binary classifiers
Using binary classifiersUsing binary classifiers
Using binary classifiers
 
Translation to QL Part 1
Translation to QL Part 1Translation to QL Part 1
Translation to QL Part 1
 

Dernier

Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
AlMamun560346
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 

Dernier (20)

COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit flypumpkin fruit fly, water melon fruit fly, cucumber fruit fly
pumpkin fruit fly, water melon fruit fly, cucumber fruit fly
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 

Understanding Feature Space in Machine Learning

  • 1. Understanding Feature Space in Machine Learning Alice Zheng, Dato September 9, 2015 1
  • 2. 2 My journey so far Applied machine learning (Data science) Build ML tools Shortage of experts and good tools.
  • 3. 3 Why machine learning? Model data. Make predictions. Build intelligent applications.
  • 4. 4 The machine learning pipeline I fell in love the instant I laid my eyes on that puppy. His big eyes and playful tail, his soft furry paws, … Raw data Features Models Predictions Deploy in production
  • 5. Feature = numeric representation of raw data
  • 6. 6 Representing natural text It is a puppy and it is extremely cute. What’s important? Phrases? Specific words? Ordering? Subject, object, verb? Classify: puppy or not? Raw Text {“it”:2, “is”:2, “a”:1, “puppy”:1, “and”:1, “extremely”:1, “cute”:1 } Bag of Words
  • 7. 7 Representing natural text It is a puppy and it is extremely cute. Classify: puppy or not? Raw Text Bag of Words it 2 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 cute 1 extremely 1 … … Sparse vector representation
  • 8. 8 Representing images Image source: “Recognizing and learning object categories,” Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009. Raw image: millions of RGB triplets, one for each pixel Classify: person or animal? Raw Image Bag of Visual Words
  • 9. 9 Representing images Classify: person or animal? Raw Image Deep learning features 3.29 -15 -5.24 48.3 1.36 47.1 - 1.92 36.5 2.83 95.4 -19 -89 5.09 37.8 Dense vector representation
  • 10. 10 Feature space in machine learning • Raw data  high dimensional vectors • Collection of data points  point cloud in feature space • Model = geometric summary of point cloud • Feature engineering = creating features of the appropriate granularity for the task
  • 11. Crudely speaking, mathematicians fall into two categories: the algebraists, who find it easiest to reduce all problems to sets of numbers and variables, and the geometers, who understand the world through shapes. -- Masha Gessen, “Perfect Rigor”
  • 12. 12 Algebra vs. Geometry a b c a2 + b2 = c2 Algebra Geometry Pythagorean Theorem (Euclidean space)
  • 13. 13 Visualizing a sphere in 2D x2 + y2 = 1 a b c Pythagorean theorem: a2 + b2 = c2 x y 1 1
  • 14. 14 Visualizing a sphere in 3D x2 + y2 + z2 = 1 x y z 1 1 1
  • 15. 15 Visualizing a sphere in 4D x2 + y2 + z2 + t2 = 1 x y z 1 1 1
  • 16. 16 Why are we looking at spheres? = = = = Poincaré Conjecture: All physical objects without holes is “equivalent” to a sphere.
  • 17. 17 The power of higher dimensions • A sphere in 4D can model the birth and death process of physical objects • Point clouds = approximate geometric shapes • High dimensional features can model many things
  • 19. 19 The challenge of high dimension geometry • Feature space can have hundreds to millions of dimensions • In high dimensions, our geometric imagination is limited - Algebra comes to our aid
  • 20. 20 Visualizing bag-of-words puppy cute 1 1 I have a puppy and it is extremely cute I have a puppy and it is extremely cute it 1 they 0 I 1 am 0 how 0 puppy 1 and 1 cat 0 aardvark 0 zebra 0 cute 1 extremely 1 … …
  • 21. 21 Visualizing bag-of-words puppy cute 1 1 1 extremely I have a puppy and it is extremely cute I have an extremely cute cat I have a cute puppy
  • 23. 23 What is a model? • Model = mathematical “summary” of data • What’s a summary? - A geometric shape
  • 24. 24 Classification model Feature 2 Feature 1 Decide between two classes
  • 25. 25 Clustering model Feature 2 Feature 1 Group data points tightly
  • 28. 28 When does bag-of-words fail? puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten Task: find a surface that separates documents about dogs vs. cats Problem: the word “have” adds fluff instead of information I have a dog and I have a pen 1
  • 29. 29 Improving on bag-of-words • Idea: “normalize” word counts so that popular words are discounted • Term frequency (tf) = Number of times a terms appears in a document • Inverse document frequency of word (idf) = • N = total number of documents • Tf-idf count = tf x idf
  • 30. 30 From BOW to tf-idf puppy cat 2 1 1 have I have a puppy I have a cat I have a kitten idf(puppy) = log 4 idf(cat) = log 4 idf(have) = log 1 = 0 I have a dog and I have a pen 1
  • 31. 31 From BOW to tf-idf puppy cat1 have tfidf(puppy) = log 4 tfidf(cat) = log 4 tfidf(have) = 0 I have a dog and I have a pen, I have a kitten 1 log 4 log 4 I have a cat I have a puppy Decision surface Tf-idf flattens uninformative dimensions in the BOW point cloud
  • 32. 32 Entry points of feature engineering • Start from data and task - What’s the best text representation for classification? • Start from modeling method - What kind of features does k-means assume? - What does linear regression assume about the data?
  • 33. 33 That’s not all, folks! • There’s a lot more to feature engineering: - Feature normalization - Feature transformations - “Regularizing” models - Learning the right features • Dato is hiring! jobs@dato.com alicez@dato.com @RainyData

Notes de l'éditeur

  1. Features sit between raw data and model. They can make or break an application.