SlideShare une entreprise Scribd logo
1  sur  49
Télécharger pour lire hors ligne
Self-Organising Maps for
Customer Segmentation
Theory and worked examples using census and
customer data sets.
Talk for Dublin R Users Group
20/01/2014
Shane Lynn Ph.D. – Data Scientist
www.shanelynn.ie / @shane_a_lynn
Overview
1. Why customer
segmentation?
2. What is a self-organising
map?
– Algorithm
– Uses
– Advantages

3. Example using Irish
Census Data
4. Example using Ta-Feng
Grocery Shopping Data
Dublin R / SOMs / Shane Lynn

2
Customer Segmentation
• Customer segmentation is the application of clustering
techniques to customer data
• Identify cohorts of “similar” customers – common
needs, priorities, characteristics.
“Agent Loyal”

Homogeneous
Customer Base

“Non-traditional”
Dublin R / SOMs / Shane Lynn

“Budget-concious”
3
Customer Segmentation
Typical Uses
• Targeted marketing
• Customer retention
• Debt recovery

Techniques
•
•
•
•
•

Single discrete variable
K-means clustering
Hierarchical clustering
Finite mixture modelling
Self Organising Maps

Dublin R / SOMs / Shane Lynn

4
Self-Organising Maps
A Self-Organising Map (SOM) is a form of unsupervised neural
network that produces a low (typically two) dimensional
representation of the input space of the set of training samples.
• First described by Teuvo Kohonen
(1982) (“Kohonen Map”)
• Over 10k citations referencing SOMs –
most cited Finnish scientist.
• Multi-dimensional input data is
represented by a 2-D “map” of nodes
• Topological properties of the input
space are maintained in map

Dublin R / SOMs / Shane Lynn

5
Self-Organising Maps
• The SOM visualisation is made up of several nodes
• Input samples are “mapped” to the most similar node on the SOM.
All attributes in input data are used to determine similarity.
• Each node has a weight vector of same size as the input space
• There is no variable / meaning to the x and y axes.

All nodes have:
Position
Weight Vector
Associated input
samples
Dublin R / SOMs / Shane Lynn

6
Self-Organising Maps
• Everyone in this room stands in Croke Park.
• Each person compares attributes – e.g. age,
gender, salary, height.
• Everyone moves until they are closest to
other people with the most similar
attributes. (using a Euclidean distance)
• If everyone holds up a card indicating their
age – the result is a SOM heatmap
< 20

30

40

50

> 60

Dublin R / SOMs / Shane Lynn

7
Self-Organising Maps
Algorithm
1.
2.

3.
4.

Initialise all weight vectors randomly.
Choose a random datapoint from training data and present it to
the SOM
Find the “Best Matching Unit” (BMU) in the map – the most
similar node.
Determine the nodes within the “neighbourhood” of the BMU.
–

5.

The size of the neighbourhood decreases with each iteration.

Adjust weights of nodes in the BMU neighbourhood towards the
chosen datapoint.
–
–

6.

First Choose: Map size, map type

The learning rate decreases with each iteration
The magnitude of the adjustment is proportional to the proximity of
the node to the BMU.

Repeat Steps 2-5 for N iterations / convergence.
Dublin R / SOMs / Shane Lynn

8
Self-Organising Maps
Algorithm
1.
2.

3.
4.

Initialise all weight vectors randomly.
Choose a random datapoint from training data and present it to
the SOM
Find the “Best Matching Unit” (BMU) in the map – the most
similar node.
Determine the nodes within the “neighbourhood” of the BMU.
–

5.

The size of the neighbourhood decreases with each iteration.

Adjust weights of nodes in the BMU neighbourhood towards the
chosen datapoint.
–
–

6.

First Choose: Map size, map type

The learning rate decreases with each iteration
The magnitude of the adjustment is proportional to the proximity of
the node to the BMU.

Repeat Steps 2-5 for N iterations / convergence.
Dublin R / SOMs / Shane Lynn

9
Self-Organising Maps
Example – Color Classification

• SOM training on RGB values. (R,G,B) (255,0,0)
• 3-D dataset -> 2-D SOM representation
• Similar colours have similar RGB values / similar
position

Dublin R / SOMs / Shane Lynn

10
Self-Organising Maps

Dublin R / SOMs / Shane Lynn

11
Self-Organising Maps
• On the SOM, actual distance between nodes is not a
measure for quantitative (dis)similarity
• Unified distance matrix
– Distance between node weights of neighbouring nodes.
– Used to determine clusters of nodes on map

http://davis.wpi.edu/~matt/courses/soms/applet.html
SOM

U-matrix

Dublin R / SOMs / Shane Lynn

12
Self-Organising Maps
Heatmaps
• Heatmaps are used to discover patterns
between variables
• Heatmaps colour the map by chosen
variables – Each node coloured using
average value of all linked data points
• Can use variables not in in the training
set
Gender

Dummy Variables
• SOMs only operate with
numeric variables
• Categorical variables
must be converted to
dummy variables.

Gender_M

Gender_F

Male

1

0

Female

0

1

Female

0

1

Male

1

0

Dublin R / SOMs / Shane Lynn

13
Self-Organising Maps
Data collection
Form data with
samples relevant to
subject

Data scaling
Center and scale
data. Prevents any
one variable
dominating model

Data “capping”
Removing outliers
improves heatmap
scales

Examine heatmaps
Build SOM
Examine clusters

Dublin R / SOMs / Shane Lynn

Choose SOM
parameters and run
iterative algorithm

14
Self-Organising Maps
• “Kohonen” package in R
• Well documented with examples
• Key functions:
– somgrid()
• Initialise a SOM node set

– som()

Other Software
•
•
•
•
•
•

Viscovery
SPSS Modeler
MATLAB (NN toolbox)
WEKA
Synapse
Spice-SOM

• Can change – radius, learning rate, iterations,
neighbourhood type

– plot.kohonen()
• Visualisation of resulting SOM
• Supports heatmaps, node counts, properties, u-matrix etc.

Dublin R / SOMs / Shane Lynn

15
Census Data Example
• Data from Census 2011
• Population divided in 18,488 “small areas” (SA)
• 767 variables
available on 15
themes.
• Interactive viewing
software online.
• Data available via
csv download.
• Focusing on Dublin
area for this talk.
• Aim to segment
based on selected
variables.
Dublin R / SOMs / Shane Lynn

16
Census Data Example
data <- read.csv("../data/census-data-smallarea.csv")

• Data is formatted as person counts per SA
• Need to extract comparable statistics from count data
– Change count data to percentages / numbers etc.
marital_data <- data[, c("T1_2SGLT", "T1_2MART", "T1_2SEPT",
"T1_2DIVT", "T1_2WIDT", "T1_2T")]
marital_percents <- data.frame(t(apply(marital_data, 1,
function(x) {x[1:5]/x[6]})) * 100)

– Change “ranked” variables to numeric values
i.e. Education: [None, Primary, Secondary, 3rd Level,
Postgrad] changed to [0,1,2,3,4]
Allows us to calculate “mean education level”
Dublin R / SOMs / Shane Lynn

17
Census Data Example
# average household size
household_data <- data[, 331:339] #data on household sizes and totals
mean_household_size <- c(1:8)
#sizes 1-8 persons
results$avr_household_size <- apply(household_data, 1,
function(x){
size <- sum(x[1:length(mean_household_size)] *
mean_household_size) / x[length(x)]
}
)

• Calculated features for small areas:
> names(data)
[1] "id"
"avr_age"
"avr_education_level" "avr_num_cars"
[6] "avr_health"
"rented_percent"
"internet_percent"
"single_percent"
[11] "married_percent"
"separated_percent"
"widow_percent"

"avr_household_size"
"unemployment_percent"
"divorced_percent"

• Filtered for Fingal, Dublin City, South Dublin, & Dun Laoghaire (4806 SAs)
Dublin R / SOMs / Shane Lynn

18
Census Data Example
• SOM creation step
require(kohonen)
data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment
data_train_matrix <- as.matrix(scale(data_train))
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
som_model <- som(data_train_matrix,
grid=som_grid,
rlen=100,
alpha=c(0.05,0.01),
keep.data = TRUE,
n.hood=“circular” )

Dublin R / SOMs / Shane Lynn

19
Census Data Example
Kohonen functions accept
numeric matrices

• SOM creation step

scale() mean centers and
normalises all columns

require(kohonen)

data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment
data_train_matrix <- as.matrix(scale(data_train))
som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal")
som_model <- som(data_train_matrix,
grid=som_grid,
rlen=100,
alpha=c(0.05,0.01),
keep.data = TRUE,
n.hood=“circular” )

Creating hexagonal SOM grid with
400 nodes (10 samples per node)
Params:
• Rlen – times to present data
• Alpha – learning rate
• Keep.data – store data in model
• n.hood – neighbourhood shape

Dublin R / SOMs / Shane Lynn

20
Census Data Example
• som_model object
>> summary(som_model)
som map of size 20x20 with a hexagonal topology.
Training data included; dimension is 4806 by 4
Mean distance to the closest unit in the map: 0.1158323

• Contains all mapping details between samples
and nodes.
• Accessible via “$” operator
• Various plotting options to visualise map
results.
Dublin R / SOMs / Shane Lynn

21
Census Data Example
plot(som_model, type = "changes")

plot(som_model, type = “counts")

Dublin R / SOMs / Shane Lynn

22
Census Data Example
Minimising “mean
distance to closest unit”

plot(som_model, type = "changes")

plot(som_model, type = “counts")

Not default colour palette. (coolBlueHotRed.R)
plot(… , palette.name=coolBluehotRed)

Dublin R / SOMs / Shane Lynn

23
Census Data Example
plot(som_model, type =
“dist.neighbours")

plot(som_model, type = “codes")

Dublin R / SOMs / Shane Lynn

24
• Fan diagram shows
distribution of variables
across map.
• Can see patterns by
examining dominant
colours etc.
• This type of
representation is useful
for SOMs when the
number of variables is
less than ~ 5
• Good to get a grasp of
general patterns in SOM

Dublin R / SOMs / Shane Lynn

25
Census Data Exampleof highest
Area
education

Area with highest
average ages

Area of highest
unemployment

Dublin R / SOMs / Shane Lynn

26
Census Data Example
plot(som_model, type = "property", property = som_model$codes[,4],
main=names(som_model$data)[4], palette.name=coolBlueHotRed)

Using training data, default
scales are normalised – not as
meaningful for users.

Dublin R / SOMs / Shane Lynn

27
Census Data Example
var <- 2 #define the variable to plot
var_unscaled <- aggregate(as.numeric(data_train[,var]),
by=list(som_model$unit.classif),
FUN=mean, simplify=TRUE)[,2]
plot(som_model, type = "property", property=var_unscaled,
main=names(data_train)[var], palette.name=coolBlueHotRed)

Dublin R / SOMs / Shane Lynn

28
Census Data Example
• Very simple to manually identify clusters of nodes.
• By visualising heatmaps, can spot relationships between variables.
• View heatmaps of variables not used to drive the SOM generation.
plotHeatMap(som_model, data, variable=0) #interactive window for selection

Dublin R / SOMs / Shane Lynn

29
Census Data Example
• Very simple to manually identify clusters of nodes.
• By visualising heatmaps, can spot relationships between variables.
• View heatmaps of variables not used to drive the SOM generation.
plotHeatMap(som_model, data, variable=0) #interactive window for selection

Dublin R / SOMs / Shane Lynn

30
Census Data Example
mydata <- som_model$codes
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)

• Cluster on som codebooks
• Hierarchical clustering
– Start with every node as a
cluster
– Combine most similar
nodes (once they are
neighbours)
– Continue until all combined

• User decision as to how
many clusters suit
application
Dublin R / SOMs / Shane Lynn

31
Census Data Example
som_cluster <- clusterSOM(ClusterCodebooks=som_model$codes,
LengthClustering=nrow(som_model$codes),
Mapping=som_model$unit.classif,
MapRow=som_model$grid$xdim,
MapColumn=som_model$grid$ydim,
StoppingCluster=2 )[[5]]
plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main =
"Clusters")
add.cluster.boundaries(som_model, som_cluster)

Cluster 5

Cluster 4
Cluster 3

Clusters
numbered from
bottom left
- Cluster 1

Cluster 2

Dublin R / SOMs / Shane Lynn

32
Dublin R / SOMs / Shane Lynn

33
Census Data Example
- High level of unemployment
- Low average education
- Lower than average car
ownership
“Lower Demographics”

Dublin R / SOMs / Shane Lynn

34
Census Data Example
Cluster 5 details

Dublin R / SOMs / Shane Lynn

- High car ownership
- Well educated
- Relatively older demographic
- Larger households
“Commuter Belt”
35
Census Data Example

Dublin R / SOMs / Shane Lynn

36
Census Data Example
Conclusions
• Can build meaningful names
/ stories for clusters in
population
• Multiple iterations with
different variable
combinations and heatmaps
• Similar to work by
commercial data providers
“Experian”

Dublin R / SOMs / Shane Lynn

37
Ta-Feng Grocery Shopping
• More complex and realistic example of customer data
• Data set contains information on
– 817,741 grocery shopping transactions.
– 32,266 customers
customer_i
date
d
age_group
01/01/2001 141833
F
01/01/2001 1376753
E
01/01/2001 1603071
E
01/01/2001 1738667
E
01/01/2001 2141497
A
01/01/2001 1868685
J

address
F
E
G
F
B
E

product_su
bclass
130207
110217
100201
530105
320407
110109

product_id quantity
4.71E+12
2
4.71E+12
1
4.71E+12
1
4.71E+12
1
4.71E+12
1
4.71E+12
1

asset
44
150
35
94
100
144

price
52
129
39
119
159
190

• Similar steps as with census example, but with some
additional requirements
Dublin R / SOMs / Shane Lynn

38
Ta-Feng Grocery Shopping
• Feature generation using data.table - Data table
has significant speed enhancements over
standard data frame, or ddply.
> system.time(test1 <- data[, sum(price), by=customer_id])
user system elapsed
0.11
0.00
0.11
> system.time(test2 <- aggregate(data$price,
by=list(data$customer_id), FUN=sum))
user system elapsed
1.81
0.03
1.84
> system.time(test3 <- ddply(data[,c("price", "customer_id"),
with=F],
+
.variables="customer_id",
+
.fun=function(x){sum(x$price)}))
user system elapsed
5.86
0.00
5.86
Dublin R / SOMs / Shane Lynn

39
Ta-Feng Grocery Shopping
• Customer characteristics generated:
Customer ID

Total Spend

Max Item

Num Baskets

Total Items

Items per basket

Mean item cost

Profit per basket

Total Profit

Sales Items (%)

Profit Margin

Budget items (%)

Premium items (%)

Item rankings

Weekend baskets
(%)

Max basket cost

Outliers here will skew all
heat map colour scales

Dublin R / SOMs / Shane Lynn

40
Ta-Feng Grocery Shopping
• Solution is to cap all variables at 98% percentiles
• This removes extreme outliers from dataset and
prevents one node discolouring heatmaps
capVector <- function(x, probs = c(0.02,0.98)){
#remove things above and below the percentiles specified
ranges <- quantile(x, probs=probs, na.rm=T)
x[x < ranges[1]] <- ranges[1]
x[x > ranges[2]] <- ranges[2]
return(x)
}

Dublin R / SOMs / Shane Lynn

41
Ta-Feng Grocery Shopping

• Prior to capping the
variables – outliers
expand the range of
the variables
considerably

Dublin R / SOMs / Shane Lynn

42
Ta-Feng Grocery Shopping

• Histograms have
more manageable
ranges after
capping the
variables.

Dublin R / SOMs / Shane Lynn

43
Ta-Feng Grocery Shopping
• Explore heatmaps
and variable
distributions to find
patterns

Dublin R / SOMs / Shane Lynn

44
Dublin R / SOMs / Shane Lynn

45
Ta-Feng Grocery Shopping
“Steady shoppers”
• Many visits to store
• Profitable overall
• Smaller Baskets
“Bargain Hunters”
- Almost exclusively sale items
- Few visits
- Not profitable

“The Big Baskets”
- Few visits
- Largest baskets
Dublin R
- Large profit per basket/ SOMs / Shane Lynn

“The great unwashed”
- Single / small repeat visitors
- Small baskets
- Large number of customers
46
Conclusions
• SOMs are a powerful tool to have in your repertoire.
• Advantages include:
– Intuitive method to develop customer segmentation profiles.
– Relatively simple algorithm, easy to explain results to non-data
scientists
– New data points can be mapped to trained model for predictive
purposes.

• Disadvantages:
– Lack of parallelisation capabilities for VERY large data sets
– Difficult to represent very many variables in two dimensional
plane
– Requires clean, numeric data.

• Slides and code will be posted online shortly.
Dublin R / SOMs / Shane Lynn

47
Questions

Shane Lynn
www.shanelynn.ie / @shane_a_lynn
Dublin R / SOMs / Shane Lynn

48
Useful links
• Som package
– http://cran.r-project.org/web/packages/kohonen/kohonen.pdf

• Census data and online viewer
– http://census.cso.ie/sapmap/
– http://www.cso.ie/en/census/census2011smallareapopulationstatisticssaps/

• Ta Feng data set download
– http://recsyswiki.com/wiki/Grocery_shopping_datasets

Dublin R / SOMs / Shane Lynn

49

Contenu connexe

Tendances

Deep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationDeep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationVijaylaxmiNagurkar
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for ClassificationPrakash Pimpale
 
Time series clustering using self-organizing maps
Time series clustering using self-organizing mapsTime series clustering using self-organizing maps
Time series clustering using self-organizing mapsLorena Santos
 
Machine learning
Machine learningMachine learning
Machine learningShreyas G S
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function홍배 김
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learningHaris Jamil
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...Edge AI and Vision Alliance
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reductionYan Xu
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
SLAM-베이즈필터와 칼만필터
SLAM-베이즈필터와 칼만필터SLAM-베이즈필터와 칼만필터
SLAM-베이즈필터와 칼만필터jdo
 

Tendances (20)

Deep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentationDeep learning for 3 d point clouds presentation
Deep learning for 3 d point clouds presentation
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation MaximizationLecture 18: Gaussian Mixture Models and Expectation Maximization
Lecture 18: Gaussian Mixture Models and Expectation Maximization
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
 
Em Algorithm | Statistics
Em Algorithm | StatisticsEm Algorithm | Statistics
Em Algorithm | Statistics
 
Euclidean Distance And Manhattan Distance
Euclidean Distance And Manhattan DistanceEuclidean Distance And Manhattan Distance
Euclidean Distance And Manhattan Distance
 
Time series clustering using self-organizing maps
Time series clustering using self-organizing mapsTime series clustering using self-organizing maps
Time series clustering using self-organizing maps
 
Machine learning
Machine learningMachine learning
Machine learning
 
The world of loss function
The world of loss functionThe world of loss function
The world of loss function
 
Gradient Boosting
Gradient BoostingGradient Boosting
Gradient Boosting
 
Ensemble learning
Ensemble learningEnsemble learning
Ensemble learning
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Graph Based Pattern Recognition
Graph Based Pattern RecognitionGraph Based Pattern Recognition
Graph Based Pattern Recognition
 
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
“An Introduction to Data Augmentation Techniques in ML Frameworks,” a Present...
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Random forest
Random forestRandom forest
Random forest
 
Nonlinear dimension reduction
Nonlinear dimension reductionNonlinear dimension reduction
Nonlinear dimension reduction
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
SLAM-베이즈필터와 칼만필터
SLAM-베이즈필터와 칼만필터SLAM-베이즈필터와 칼만필터
SLAM-베이즈필터와 칼만필터
 

Similaire à Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R

Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analyticsCollin Bennett
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2malathieswaran29
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.pptchatbot9
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineersIBM Analytics
 
Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014DUSPviz
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data scienceTanujaSomvanshi1
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013Sanjeev Mishra
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Dhilsath Fathima
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmYu Liu
 
Algorithm explanations
Algorithm explanationsAlgorithm explanations
Algorithm explanationsnikita kapil
 
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPAAir Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPASTEP_scotland
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologyVladyslav Frolov
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unitbhagathk
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxPerumalPitchandi
 

Similaire à Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R (20)

19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
Building and deploying analytics
Building and deploying analyticsBuilding and deploying analytics
Building and deploying analytics
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
 
project
projectproject
project
 
project
projectproject
project
 
Data science tips for data engineers
Data science tips for data engineersData science tips for data engineers
Data science tips for data engineers
 
Review presentation for Orientation 2014
Review presentation for Orientation 2014Review presentation for Orientation 2014
Review presentation for Orientation 2014
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
 
Silicon valleycodecamp2013
Silicon valleycodecamp2013Silicon valleycodecamp2013
Silicon valleycodecamp2013
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
FLY-SMOTE.pdf
FLY-SMOTE.pdfFLY-SMOTE.pdf
FLY-SMOTE.pdf
 
Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16Data extraction, cleanup &amp; transformation tools 29.1.16
Data extraction, cleanup &amp; transformation tools 29.1.16
 
EDA.pptx
EDA.pptxEDA.pptx
EDA.pptx
 
Start From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize AlgorithmStart From A MapReduce Graph Pattern-recognize Algorithm
Start From A MapReduce Graph Pattern-recognize Algorithm
 
Algorithm explanations
Algorithm explanationsAlgorithm explanations
Algorithm explanations
 
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPAAir Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
Air Quality Modelling Tools (Aberdeen Pilot Project) Dr. Alan Hills, SEPA
 
Salford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of TechnologySalford Systems - On the Cutting Edge of Technology
Salford Systems - On the Cutting Edge of Technology
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
 
AlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptxAlgorithmsModelsNov13.pptx
AlgorithmsModelsNov13.pptx
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Dernier (20)

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Self-Organising Maps for Customer Segmentation using R - Shane Lynn - Dublin R

  • 1. Self-Organising Maps for Customer Segmentation Theory and worked examples using census and customer data sets. Talk for Dublin R Users Group 20/01/2014 Shane Lynn Ph.D. – Data Scientist www.shanelynn.ie / @shane_a_lynn
  • 2. Overview 1. Why customer segmentation? 2. What is a self-organising map? – Algorithm – Uses – Advantages 3. Example using Irish Census Data 4. Example using Ta-Feng Grocery Shopping Data Dublin R / SOMs / Shane Lynn 2
  • 3. Customer Segmentation • Customer segmentation is the application of clustering techniques to customer data • Identify cohorts of “similar” customers – common needs, priorities, characteristics. “Agent Loyal” Homogeneous Customer Base “Non-traditional” Dublin R / SOMs / Shane Lynn “Budget-concious” 3
  • 4. Customer Segmentation Typical Uses • Targeted marketing • Customer retention • Debt recovery Techniques • • • • • Single discrete variable K-means clustering Hierarchical clustering Finite mixture modelling Self Organising Maps Dublin R / SOMs / Shane Lynn 4
  • 5. Self-Organising Maps A Self-Organising Map (SOM) is a form of unsupervised neural network that produces a low (typically two) dimensional representation of the input space of the set of training samples. • First described by Teuvo Kohonen (1982) (“Kohonen Map”) • Over 10k citations referencing SOMs – most cited Finnish scientist. • Multi-dimensional input data is represented by a 2-D “map” of nodes • Topological properties of the input space are maintained in map Dublin R / SOMs / Shane Lynn 5
  • 6. Self-Organising Maps • The SOM visualisation is made up of several nodes • Input samples are “mapped” to the most similar node on the SOM. All attributes in input data are used to determine similarity. • Each node has a weight vector of same size as the input space • There is no variable / meaning to the x and y axes. All nodes have: Position Weight Vector Associated input samples Dublin R / SOMs / Shane Lynn 6
  • 7. Self-Organising Maps • Everyone in this room stands in Croke Park. • Each person compares attributes – e.g. age, gender, salary, height. • Everyone moves until they are closest to other people with the most similar attributes. (using a Euclidean distance) • If everyone holds up a card indicating their age – the result is a SOM heatmap < 20 30 40 50 > 60 Dublin R / SOMs / Shane Lynn 7
  • 8. Self-Organising Maps Algorithm 1. 2. 3. 4. Initialise all weight vectors randomly. Choose a random datapoint from training data and present it to the SOM Find the “Best Matching Unit” (BMU) in the map – the most similar node. Determine the nodes within the “neighbourhood” of the BMU. – 5. The size of the neighbourhood decreases with each iteration. Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint. – – 6. First Choose: Map size, map type The learning rate decreases with each iteration The magnitude of the adjustment is proportional to the proximity of the node to the BMU. Repeat Steps 2-5 for N iterations / convergence. Dublin R / SOMs / Shane Lynn 8
  • 9. Self-Organising Maps Algorithm 1. 2. 3. 4. Initialise all weight vectors randomly. Choose a random datapoint from training data and present it to the SOM Find the “Best Matching Unit” (BMU) in the map – the most similar node. Determine the nodes within the “neighbourhood” of the BMU. – 5. The size of the neighbourhood decreases with each iteration. Adjust weights of nodes in the BMU neighbourhood towards the chosen datapoint. – – 6. First Choose: Map size, map type The learning rate decreases with each iteration The magnitude of the adjustment is proportional to the proximity of the node to the BMU. Repeat Steps 2-5 for N iterations / convergence. Dublin R / SOMs / Shane Lynn 9
  • 10. Self-Organising Maps Example – Color Classification • SOM training on RGB values. (R,G,B) (255,0,0) • 3-D dataset -> 2-D SOM representation • Similar colours have similar RGB values / similar position Dublin R / SOMs / Shane Lynn 10
  • 11. Self-Organising Maps Dublin R / SOMs / Shane Lynn 11
  • 12. Self-Organising Maps • On the SOM, actual distance between nodes is not a measure for quantitative (dis)similarity • Unified distance matrix – Distance between node weights of neighbouring nodes. – Used to determine clusters of nodes on map http://davis.wpi.edu/~matt/courses/soms/applet.html SOM U-matrix Dublin R / SOMs / Shane Lynn 12
  • 13. Self-Organising Maps Heatmaps • Heatmaps are used to discover patterns between variables • Heatmaps colour the map by chosen variables – Each node coloured using average value of all linked data points • Can use variables not in in the training set Gender Dummy Variables • SOMs only operate with numeric variables • Categorical variables must be converted to dummy variables. Gender_M Gender_F Male 1 0 Female 0 1 Female 0 1 Male 1 0 Dublin R / SOMs / Shane Lynn 13
  • 14. Self-Organising Maps Data collection Form data with samples relevant to subject Data scaling Center and scale data. Prevents any one variable dominating model Data “capping” Removing outliers improves heatmap scales Examine heatmaps Build SOM Examine clusters Dublin R / SOMs / Shane Lynn Choose SOM parameters and run iterative algorithm 14
  • 15. Self-Organising Maps • “Kohonen” package in R • Well documented with examples • Key functions: – somgrid() • Initialise a SOM node set – som() Other Software • • • • • • Viscovery SPSS Modeler MATLAB (NN toolbox) WEKA Synapse Spice-SOM • Can change – radius, learning rate, iterations, neighbourhood type – plot.kohonen() • Visualisation of resulting SOM • Supports heatmaps, node counts, properties, u-matrix etc. Dublin R / SOMs / Shane Lynn 15
  • 16. Census Data Example • Data from Census 2011 • Population divided in 18,488 “small areas” (SA) • 767 variables available on 15 themes. • Interactive viewing software online. • Data available via csv download. • Focusing on Dublin area for this talk. • Aim to segment based on selected variables. Dublin R / SOMs / Shane Lynn 16
  • 17. Census Data Example data <- read.csv("../data/census-data-smallarea.csv") • Data is formatted as person counts per SA • Need to extract comparable statistics from count data – Change count data to percentages / numbers etc. marital_data <- data[, c("T1_2SGLT", "T1_2MART", "T1_2SEPT", "T1_2DIVT", "T1_2WIDT", "T1_2T")] marital_percents <- data.frame(t(apply(marital_data, 1, function(x) {x[1:5]/x[6]})) * 100) – Change “ranked” variables to numeric values i.e. Education: [None, Primary, Secondary, 3rd Level, Postgrad] changed to [0,1,2,3,4] Allows us to calculate “mean education level” Dublin R / SOMs / Shane Lynn 17
  • 18. Census Data Example # average household size household_data <- data[, 331:339] #data on household sizes and totals mean_household_size <- c(1:8) #sizes 1-8 persons results$avr_household_size <- apply(household_data, 1, function(x){ size <- sum(x[1:length(mean_household_size)] * mean_household_size) / x[length(x)] } ) • Calculated features for small areas: > names(data) [1] "id" "avr_age" "avr_education_level" "avr_num_cars" [6] "avr_health" "rented_percent" "internet_percent" "single_percent" [11] "married_percent" "separated_percent" "widow_percent" "avr_household_size" "unemployment_percent" "divorced_percent" • Filtered for Fingal, Dublin City, South Dublin, & Dun Laoghaire (4806 SAs) Dublin R / SOMs / Shane Lynn 18
  • 19. Census Data Example • SOM creation step require(kohonen) data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment data_train_matrix <- as.matrix(scale(data_train)) som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal") som_model <- som(data_train_matrix, grid=som_grid, rlen=100, alpha=c(0.05,0.01), keep.data = TRUE, n.hood=“circular” ) Dublin R / SOMs / Shane Lynn 19
  • 20. Census Data Example Kohonen functions accept numeric matrices • SOM creation step scale() mean centers and normalises all columns require(kohonen) data_train <- data[, c(2,4,5,8)] #age, education, num cars, unemployment data_train_matrix <- as.matrix(scale(data_train)) som_grid <- somgrid(xdim = 20, ydim=20, topo="hexagonal") som_model <- som(data_train_matrix, grid=som_grid, rlen=100, alpha=c(0.05,0.01), keep.data = TRUE, n.hood=“circular” ) Creating hexagonal SOM grid with 400 nodes (10 samples per node) Params: • Rlen – times to present data • Alpha – learning rate • Keep.data – store data in model • n.hood – neighbourhood shape Dublin R / SOMs / Shane Lynn 20
  • 21. Census Data Example • som_model object >> summary(som_model) som map of size 20x20 with a hexagonal topology. Training data included; dimension is 4806 by 4 Mean distance to the closest unit in the map: 0.1158323 • Contains all mapping details between samples and nodes. • Accessible via “$” operator • Various plotting options to visualise map results. Dublin R / SOMs / Shane Lynn 21
  • 22. Census Data Example plot(som_model, type = "changes") plot(som_model, type = “counts") Dublin R / SOMs / Shane Lynn 22
  • 23. Census Data Example Minimising “mean distance to closest unit” plot(som_model, type = "changes") plot(som_model, type = “counts") Not default colour palette. (coolBlueHotRed.R) plot(… , palette.name=coolBluehotRed) Dublin R / SOMs / Shane Lynn 23
  • 24. Census Data Example plot(som_model, type = “dist.neighbours") plot(som_model, type = “codes") Dublin R / SOMs / Shane Lynn 24
  • 25. • Fan diagram shows distribution of variables across map. • Can see patterns by examining dominant colours etc. • This type of representation is useful for SOMs when the number of variables is less than ~ 5 • Good to get a grasp of general patterns in SOM Dublin R / SOMs / Shane Lynn 25
  • 26. Census Data Exampleof highest Area education Area with highest average ages Area of highest unemployment Dublin R / SOMs / Shane Lynn 26
  • 27. Census Data Example plot(som_model, type = "property", property = som_model$codes[,4], main=names(som_model$data)[4], palette.name=coolBlueHotRed) Using training data, default scales are normalised – not as meaningful for users. Dublin R / SOMs / Shane Lynn 27
  • 28. Census Data Example var <- 2 #define the variable to plot var_unscaled <- aggregate(as.numeric(data_train[,var]), by=list(som_model$unit.classif), FUN=mean, simplify=TRUE)[,2] plot(som_model, type = "property", property=var_unscaled, main=names(data_train)[var], palette.name=coolBlueHotRed) Dublin R / SOMs / Shane Lynn 28
  • 29. Census Data Example • Very simple to manually identify clusters of nodes. • By visualising heatmaps, can spot relationships between variables. • View heatmaps of variables not used to drive the SOM generation. plotHeatMap(som_model, data, variable=0) #interactive window for selection Dublin R / SOMs / Shane Lynn 29
  • 30. Census Data Example • Very simple to manually identify clusters of nodes. • By visualising heatmaps, can spot relationships between variables. • View heatmaps of variables not used to drive the SOM generation. plotHeatMap(som_model, data, variable=0) #interactive window for selection Dublin R / SOMs / Shane Lynn 30
  • 31. Census Data Example mydata <- som_model$codes wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(mydata, centers=i)$withinss) • Cluster on som codebooks • Hierarchical clustering – Start with every node as a cluster – Combine most similar nodes (once they are neighbours) – Continue until all combined • User decision as to how many clusters suit application Dublin R / SOMs / Shane Lynn 31
  • 32. Census Data Example som_cluster <- clusterSOM(ClusterCodebooks=som_model$codes, LengthClustering=nrow(som_model$codes), Mapping=som_model$unit.classif, MapRow=som_model$grid$xdim, MapColumn=som_model$grid$ydim, StoppingCluster=2 )[[5]] plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters") add.cluster.boundaries(som_model, som_cluster) Cluster 5 Cluster 4 Cluster 3 Clusters numbered from bottom left - Cluster 1 Cluster 2 Dublin R / SOMs / Shane Lynn 32
  • 33. Dublin R / SOMs / Shane Lynn 33
  • 34. Census Data Example - High level of unemployment - Low average education - Lower than average car ownership “Lower Demographics” Dublin R / SOMs / Shane Lynn 34
  • 35. Census Data Example Cluster 5 details Dublin R / SOMs / Shane Lynn - High car ownership - Well educated - Relatively older demographic - Larger households “Commuter Belt” 35
  • 36. Census Data Example Dublin R / SOMs / Shane Lynn 36
  • 37. Census Data Example Conclusions • Can build meaningful names / stories for clusters in population • Multiple iterations with different variable combinations and heatmaps • Similar to work by commercial data providers “Experian” Dublin R / SOMs / Shane Lynn 37
  • 38. Ta-Feng Grocery Shopping • More complex and realistic example of customer data • Data set contains information on – 817,741 grocery shopping transactions. – 32,266 customers customer_i date d age_group 01/01/2001 141833 F 01/01/2001 1376753 E 01/01/2001 1603071 E 01/01/2001 1738667 E 01/01/2001 2141497 A 01/01/2001 1868685 J address F E G F B E product_su bclass 130207 110217 100201 530105 320407 110109 product_id quantity 4.71E+12 2 4.71E+12 1 4.71E+12 1 4.71E+12 1 4.71E+12 1 4.71E+12 1 asset 44 150 35 94 100 144 price 52 129 39 119 159 190 • Similar steps as with census example, but with some additional requirements Dublin R / SOMs / Shane Lynn 38
  • 39. Ta-Feng Grocery Shopping • Feature generation using data.table - Data table has significant speed enhancements over standard data frame, or ddply. > system.time(test1 <- data[, sum(price), by=customer_id]) user system elapsed 0.11 0.00 0.11 > system.time(test2 <- aggregate(data$price, by=list(data$customer_id), FUN=sum)) user system elapsed 1.81 0.03 1.84 > system.time(test3 <- ddply(data[,c("price", "customer_id"), with=F], + .variables="customer_id", + .fun=function(x){sum(x$price)})) user system elapsed 5.86 0.00 5.86 Dublin R / SOMs / Shane Lynn 39
  • 40. Ta-Feng Grocery Shopping • Customer characteristics generated: Customer ID Total Spend Max Item Num Baskets Total Items Items per basket Mean item cost Profit per basket Total Profit Sales Items (%) Profit Margin Budget items (%) Premium items (%) Item rankings Weekend baskets (%) Max basket cost Outliers here will skew all heat map colour scales Dublin R / SOMs / Shane Lynn 40
  • 41. Ta-Feng Grocery Shopping • Solution is to cap all variables at 98% percentiles • This removes extreme outliers from dataset and prevents one node discolouring heatmaps capVector <- function(x, probs = c(0.02,0.98)){ #remove things above and below the percentiles specified ranges <- quantile(x, probs=probs, na.rm=T) x[x < ranges[1]] <- ranges[1] x[x > ranges[2]] <- ranges[2] return(x) } Dublin R / SOMs / Shane Lynn 41
  • 42. Ta-Feng Grocery Shopping • Prior to capping the variables – outliers expand the range of the variables considerably Dublin R / SOMs / Shane Lynn 42
  • 43. Ta-Feng Grocery Shopping • Histograms have more manageable ranges after capping the variables. Dublin R / SOMs / Shane Lynn 43
  • 44. Ta-Feng Grocery Shopping • Explore heatmaps and variable distributions to find patterns Dublin R / SOMs / Shane Lynn 44
  • 45. Dublin R / SOMs / Shane Lynn 45
  • 46. Ta-Feng Grocery Shopping “Steady shoppers” • Many visits to store • Profitable overall • Smaller Baskets “Bargain Hunters” - Almost exclusively sale items - Few visits - Not profitable “The Big Baskets” - Few visits - Largest baskets Dublin R - Large profit per basket/ SOMs / Shane Lynn “The great unwashed” - Single / small repeat visitors - Small baskets - Large number of customers 46
  • 47. Conclusions • SOMs are a powerful tool to have in your repertoire. • Advantages include: – Intuitive method to develop customer segmentation profiles. – Relatively simple algorithm, easy to explain results to non-data scientists – New data points can be mapped to trained model for predictive purposes. • Disadvantages: – Lack of parallelisation capabilities for VERY large data sets – Difficult to represent very many variables in two dimensional plane – Requires clean, numeric data. • Slides and code will be posted online shortly. Dublin R / SOMs / Shane Lynn 47
  • 48. Questions Shane Lynn www.shanelynn.ie / @shane_a_lynn Dublin R / SOMs / Shane Lynn 48
  • 49. Useful links • Som package – http://cran.r-project.org/web/packages/kohonen/kohonen.pdf • Census data and online viewer – http://census.cso.ie/sapmap/ – http://www.cso.ie/en/census/census2011smallareapopulationstatisticssaps/ • Ta Feng data set download – http://recsyswiki.com/wiki/Grocery_shopping_datasets Dublin R / SOMs / Shane Lynn 49