SlideShare une entreprise Scribd logo
1  sur  65
Télécharger pour lire hors ligne
Data Mining
Eng. Mahmoud Ouf
Lecture 01
mmouf@2017
Data Mining Definition
There are several definitions for Data Mining:
• Mining is a term characterizing the process that finds a small set of
important knowledge from a great deal of raw material.
• Knowledge mining from data
• Knowledge extraction
• Data/Pattern analysis.
Data mining is an essential step in the process of knowledge discovery
from Data (KDD).
mmouf@2017
Knowledge Discovery from Data (KDD)
The knowledge discovery (KDD) process is an iterative sequence of the
following steps:
1. Data cleaning
2. Data integration
3. Data selection
4. Data transformation
5. Data mining
6. Pattern evaluation
7. Knowledge presentation
Steps 1 through 4 are different forms of data preprocessing, where data
are prepared for mining.
mmouf@2017
What Kinds of Data Can Be Mined?
• Database Data:
searching for trends or data patterns.
detect deviations
• Data Warehouses
Although data warehouse tools help support data analysis, additional
tools for data mining are often needed for in-depth analysis
• Transactional Data
Market basket Data Analysis
mmouf@2017
What Kinds of Patterns Can Be Mined?
• There are a number of data mining functionalities includes:
• Characterization and discrimination
• The mining of frequent patterns, Associations, and correlations,
• Classification and regression,
• Clustering analysis, and outlier analysis.
Data mining functionality can be classified into two categories:
• Descriptive mining tasks characterize properties of the data in a target
data set.
• Predictive mining tasks perform induction on the current data in order
to make predictions.
mmouf@2017
What Kinds of Patterns Can Be Mined?
mmouf@2017
Data Mining
Descriptive Predictive
Association Rule
Clustering
Summarization
Classification
Regression
Time Series
Descriptive Data mining
• This is used to generate correlation, frequency, cross tabulation.
• It can be used to discover regularities in the data and to uncover
patterns.
• It is also used to find subgroups in the bulk of data.
mmouf@2017
Association Rules:
What is Association rule?
Association rule is a method for discovering interesting relations
between variables in large databases. It is intended to identify strong
rules discovered in databases
To select interesting rules, constraints on various measures of
significance are used.
The best-known constraints are minimum thresholds on support and
confidence.
mmouf@2017
Association Rules
Example:
Assume X = {Bread, Butter}
Assume Y = {Milk}
Transaction_ID milk Bread Butter Bear Diapers
1 1 1 0 0 0
2 0 1 1 0 0
3 0 0 0 1 1
4 1 1 1 0 0
5 0 1 0 0 0
Association Rules
Support:
The support value of X with respect to T is defined as the proportion of
transactions in the database which contains the item-set X.
Supp(X) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐼𝑡𝑒𝑚 𝑜𝑓 𝑋 {𝐵𝑟𝑒𝑎𝑑,𝐵𝑢𝑡𝑡𝑒𝑟}
𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠
Transaction Contains Item of X: Transaction 2 and Transaction 4
Total number of transaction = 5
Supp(X) = 2/5 = 0.4
This Means 40% of all transaction contains itemSet X
Association Rules
Confidence:
The confidence value of a rule, X => Y , with respect to a set of
transactions T, is the proportion of the transactions that contains X
which also contains Y.
Conf(X=>Y) =
𝑆𝑢𝑝𝑝(𝑋∪𝑌)
𝑆𝑢𝑝𝑝(𝑋)
𝑋∪𝑌 = {Bread, Butter, Milk}
Supp(𝑋∪𝑌) = 1 / 5 = 0.2
Conf(X=>Y) = 0.2 / 0.4 = 0.5
This means 50% the transactions containing butter and bread contains
Milk.
Association Rules:
Mining one level Association (Apriori)
Apriori is an algorithm for frequent item set mining and association rule
learning over transactional databases.
It proceeds by identifying the frequent individual items in the database
and extending them to larger and larger item sets as long as those item
sets appear sufficiently often in the database.
The frequent item sets determined by Apriori can be used to determine
association rules which highlight general trends in the database.
This has applications in domains such as market basket analysis.
mmouf@2017
Association Rules:
Mining one level Association (Apriori)
Example:
Assume the following Database transaction:
With minimum support = 0.5 (2)
mmouf@2017
Transaction Items
T1 Milk, Bread, Cookies, Juice
T2 Milk, Juice
T3 Milk, Egg
T4 Bread, Cookies, Coffee
Association Rules:
Mining one level Association (Apriori)
Solution:
Step1: Create 1st Level Item set
mmouf@2017
Item Support
Milk 3
Bread 2
Cookies 2
Juice 2
Egg 1
Coffee 1
Rejected as they are Below
the minimum support
Association Rules:
Mining one level Association (Apriori)
Solution:
Step2: Create 2nd Level Item set
mmouf@2017
Rejected as they are Below
the minimum support
Items Support
Milk, Bread 1
Milk, Cookies 1
Milk, Juice 2
Bread, Cookies 2
Bread, Juice 1
Cookies, Juice 1
Rejected as they are Below
the minimum support
Association Rules:
Mining one level Association (Apriori)
Solution:
Step3: Create 3rd Level Item set
There is no association at the 3rd level item set
mmouf@2017
Rejected as they are Below
the minimum support
Items Support
Milk, Juice, Bread 1
Milk, Juice, Cookies 1
Milk, Bread, Cookies 1
Juice, Bread, Cookies 1
Association Rules:
Mining one level Association (Apriori)
Solution:
We stop the combination of itemset in one of two cases:
• All the last level items are neglected as they are less than the min
support
• Reach Level Item set contains all element
Last Step: Association Rules
Milk=>Juice [support = 0.5, confidence = 0.67]
Juice=>Milk [support = 0.5, confidence = 1]
Bread=>Cookies [support = 0.5, confidence = 1]
Cookies=>Bread [support = 0.5, confidence = 1]
mmouf@2017
Association Rules
Mining Multilevel Associations
Usually the data are in form of hierarchy. So, Strong association
discovered in high level can be unreal
We may want to drill down to find novel patterns at more detailed
levels.
On the other hand, there could be too many scattered patterns at low or
primitive abstraction levels, some of which are just trivial
specializations of patterns at higher levels.
Association rules generated from mining data at multiple abstraction
levels are called multiple-level or multilevel association rules.
Multilevel association rules can be mined efficiently using concept
hierarchies under a support-confidence framework.
Association Rules
Mining Multilevel Associations
In general, a top-down strategy is used, where counts are accumulated
for the calculation of frequent itemsets at each concept level, starting at
concept level 1 and working downward in the hierarchy toward the
more specific concept levels, until no more frequent itemsets can be
found.
For each level, any algorithm for discovering frequent itemsets may be
used, such as Apriori.
Only the descendants of the frequent items at level 1 (L[1, 1]) are
considered as candidates in level 2 frequent 1 Itemset
Food
milk bread jam juice
skim2% 4% bran white cherry plum apple grape prune
kfarm smartsmith
Old
mile
wond
er
Dairy
Land
Fore
most
1 2 3 4
1 2 3 1 2 3
1 2 3
1 2 1 2
1 2 1 2
Association Rules
Mining Multilevel Associations
Example:
Assume the following Encoded Transaction Table
The Item is defined by its hierarchy.
Example: item 111 the first 1 “hundred” represent “milk”,
the second 1 “tenth” represent second level “2%” and
the third 1 “unit” represent third level “Dairy Land”
Minimum Support = 4 (For Level 1) minsup[1] = 4
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222, 323}
T3 {112, 122, 221, 411}
T4 {111, 121}
T5 {111, 122, 211, 221, 413}
T6 {211, 323, 524}
T7 {323, 411, 524, 713}
Association Rules
Mining Multilevel Associations
Level 1: (minsupp[1] = 4)
•Level 1 Frequent 1 item set L[1, 1]
ItemSet Support
{1**} 5
{2**} 5
{3**} 3
{4**} 2
{5**} 1
{7**} 1
Canceled as they are Below
the minimum support
Association Rules
Mining Multilevel Associations
Level 1: (minsupp[1] = 4)
•Level 1 Frequent 2 item set: L[1, 2].
•Produce Filtered Transaction Table
Remove any infrequent itemSet in transaction”
Remove any transaction contains infrequent itemSet only
infrequent itemset, is itemset that has a support less than the minsup[1]
ItemSet Support
{1**, 2**} 4
Association Rules
Mining Multilevel Associations
Level 1: (minsupp[1] = 4)
Filtered Transactional Table
TID Items
T1 {111, 121, 211, 221}
T2 {111, 211, 222}
T3 {112, 122, 221}
T4 {111, 121}
T5 {111, 122, 211, 221}
T6 {211}
Association Rules
Mining Multilevel Associations
Level 2: (minsupp[2] = 3)
• Level 2 Frequent 1 item set L[2, 1]
ItemSet Support
{11*} 5
{12*} 4
{13*} 0
{21*} 4
{22*} 4
Association Rules
Mining Multilevel Associations
Level 2: (minsupp[2] = 3)
• Level 2 Frequent 2 item set L[2, 2]
ItemSet Support
{11*, 12*} 4
{11*, 21*} 3
{11*, 22*} 4
{12*, 21*} 2
{12*, 22*} 3
{21*, 22*} 3
Association Rules
Mining Multilevel Associations
Level 2: (minsupp[2] = 3)
• Level 2 Frequent 3 item set: L[2, 3]
ItemSet Support
{11*, 12*, 21*} 2
{11*, 12*, 22*} 3
{12*, 21*, 22*} 3
Association Rules
Mining Multilevel Associations
Level 3: (minsupp[3] = 3)
• Level 3 Frequent 1 item set L[3, 1]
ItemSet Support
111 4
112 1
121 2
122 2
211 4
221 3
222 1
Association Rules
Mining Multilevel Associations
Level 3: (minsupp[3] = 3)
• Level 3 Frequent 2 item set L[3, 2]
We will stop at this level and frequently
111=> 211 {support = 0.43, confidence = 0.75}
211=> 111 {support = 0.43, confidence = 0.75}
ItemSet Support
{111, 211} 3
{111, 221} 2
{211, 221} 2
Association Rules
Mining Multilevel Associations
Low Level Minsupp:
There are 2 Approaches taken to identify the low level minimum
support:
Uniform support: The same minimum support threshold is used when
mining at each abstraction level.
The method is simple in that users are required to specify only one
minimum support threshold.
If the minimum support threshold is set too high, it could miss some
meaningful associations occurring at low abstraction levels.
If the threshold is set too low, it may generate many uninteresting
associations occurring at high abstraction levels.
Association Rules
Mining Multilevel Associations
Low Level Minsupp:
There are 2 Approaches taken to identify the low level minimum
support:
Reduced support: Each abstraction level has its own minimum support
threshold. The deeper the abstraction level, the smaller the
corresponding threshold.
Data Mining
Eng. Mahmoud Ouf
Lecture 02
mmouf@2017
Association Rules
Mining Multidimensional Associations
The data may be in a form of multidimensional and data warehouse, rather
than 2 dimension or multi-level.
In multidimensional databases, we refer to each distinct predicate in a rule as
a dimension.
Ex: buys(X, “digital camera”) => buys(X, “HP Printer”)
We can refer to it as a single-dimensional or intradimensional association rule
(single distinct predicate (buys) with multiple occurrences
Multidimensional data representation, in addition to keeping track of the
items purchased in sales transactions, a relational database may record other
attributes associated with the items and/or transactions such as the item
description or the branch location of the sale
Ex: age(X, “20..29”) Ʌoccupation(X, “Student”)=>buys(X, “laptop”)
Association Rules
Mining Multidimensional Associations
Ex: age(X, “20..29”) Ʌoccupation(X, “Student”)=>buys(X, “laptop”)
This refer as multidimensional association rules and contains three
predicates (age, occupation, and buys), each of which occurs only once
in the rule (no repeated predicates)
Multidimensional association rules with no repeated predicates are
called interdimensional association rules
Ex: age(X, “20..29”) Ʌ buys(X, “laptop”) =>buys(X, “HP Printer”)
Multidimensional association rules with repeated predicates, which
contain multiple occurrences of some predicates (buys repeated).
These rules are called hybrid-dimensional association rules.
Clustering:
Clustering is the process of grouping a set of data objects into multiple
groups or clusters so that objects within a cluster have high similarity,
but are very dissimilar to objects in other clusters.
Cluster analysis or simply clustering is the process of partitioning a set
of data objects (or observations) into subsets. Each subset is a cluster,
such that objects in a cluster are similar to one another, yet dissimilar to
objects in other clusters.
Different clustering methods may generate different clustering on the
same data set.
Clustering is useful in that it can lead to the discovery of previously
unknown groups within the data.
mmouf@2017
Clustering:
Cluster analysis can be used as a standalone tool to gain insight into the
distribution of data, to observe the characteristics of each cluster, and to
focus on a particular set of clusters for further analysis.
It may serve as a preprocessing step for other algorithms, such as
characterization, attribute subset selection, and classification, which
would then operate on the detected clusters and the selected attributes or
features.
mmouf@2017
Clustering:
k-means cluster
k-means is one of the simplest unsupervised learning algorithms that
solve the well known clustering problem.
The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters)
The main idea is to define k centers, one for each cluster.
These centers should be placed in a cunning way because of different
location causes different result.
So, the better choice is to place them as much as possible far away from
each other.
The next step is to take each point belonging to a given data set and
associate it to the nearest center.
mmouf@2017
Clustering:
k-means cluster
When no point is pending, the first step is completed and an early group
age is done.
At this point we need to re-calculate k new centroids as barycenter of
the clusters resulting from the previous step.
After we have these k new centroids, a new binding has to be done
between the same data set points and the nearest new center.
A loop has been generated.
As a result of this loop we may notice that the k centers change their
location step by step until no more changes are done or in other words
centers do not move any more.
mmouf@2017
Clustering:
k-means cluster
Example:
We have the following 5 points and we want to group them in 2 clusters:
mmouf@2017
i X Y
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
Clustering:
k-means cluster
Solution:
Choose 2 points to be the center of each cluster (selected Randomly)
“A, C”
Step1: Calculate the distance between each point and the 2 selected
point
𝑙𝑒𝑛𝑔𝑡ℎ = (𝑋1 − 𝑋2)2+(𝑌1 − 𝑌2)2
mmouf@2017
i A (Cluster 1) C (Cluster 2)
A 0 1.4
B 1 2.2
C 1.4 0
D 3.2 2.8
E 4.5 4.2
Clustering:
k-means cluster
Compare the distance between each point and the 2 selected groups.
This point will belong to the cluster which has the smallest distance to it
Point B, belong to the Cluster of Point “A” (1 less than 2.2)
Point D, belong to the Cluster of Point “C” (2.8 less than 2.2)
Point E, belong to the Cluster of Point “C” (4.2 less than 4.5)
mmouf@2017
i X Y Cluster
A 1 1 1
B 1 0 1
C 0 2 2
D 2 4 2
E 3 5 2
Clustering:
k-means cluster
Calculate the mean of Cluster 1:
X = (1 + 1) / 2 = 1
Y = (1 + 0) / 2 = 0.5
Mean Cluster1 (1, 0.5)
Calculate the mean of Cluster 2:
X = (0 + 2 + 3) / 3 = 1.7
Y = (2 + 4 + 5) / 3 = 3.7
Mean Cluster2 (1.7, 3.7)
mmouf@2017
Clustering:
k-means cluster
Step2: Recalculate the distance from each point to the cluster means
Compare the distance between each point and the 2 cluster mean. This
point will belong to the cluster which has the smallest distance to it
Point A, belong to the Cluster 1 (0.5 less than 2.7)
Point B, belong to the Cluster 1 (0.5 less than 3.7)
mmouf@2017
I Cluster 1 Cluster 2
A 0.5 2.7
B 0.5 3.7
C 1.8 2.4
D 3.6 0.5
E 4.9 1.9
Clustering:
k-means cluster
Point C, belong to the Cluster 1 (1.8 less than 2.4)
Point D, belong to the Cluster 2 (0.5 less than 3.6)
Point E, belong to the Cluster 2 (1.9 less than 4.9)
mmouf@2017
i X Y Cluster
A 1 1 1
B 1 0 1
C 0 2 1
D 2 4 2
E 3 5 2
Clustering:
k-means cluster
Calculate the mean of Cluster 1: (0.7, 1)
Calculate the mean of Cluster 2: (2.5, 4.5)
Step3: Recalculate the distance from each point to the cluster means. In
this example we will find no change, so it is the final solution
mmouf@2017
Predictive Data mining
• The purpose of Predictive mining model is mainly to predict the future
outcome than current behavior.
• The prediction output can be numeric value or in categorized form.
The predictive models are the supervised learning functions which
predicts the target value.
mmouf@2017
Classification:
Among Predictive data mining technique, Classification model is
considered as the best-understood technique of all data mining
approaches.
The common characteristics of classification tasks are as supervised
learning, categories dependent variable and assigning new data to one of
a set of well-defined classes.
Classification technique is used in customer segmentation, modeling
businesses, credit analysis, and many other applications.
In a classification technique, you typically have historical data called
labeled examples and new examples. Each labeled example consists of
multiple predictor attributes and one target attribute that is a class label.
mmouf@2017
Classification:
The goal of classification is to construct a model using the data from
history and accurately predicts the new class of examples.
A classification task begins with build data in database also known as
training data for which the target values are known.
There are different classification algorithms available that uses their
different techniques for finding relations between the predictor attributes
values and the target values in the build data.
After getting the targeted data, these relations are summarized in a
model, so that they can be applied to new cases further with unknown
target values for predicting target values.
mmouf@2017
Classification:
The classification is: grouping the data to classes and each class
contains the similar data.
It is supervised learning, this mean that a part of the available data
(which I know its class ”Labeled data”) are used for training.
Then we use the second part of data for testing the classifier.
Example:
As “Unlabeled data”:
Age:56, Income: 45K classifier Budget Spender
mmouf@2017
Training Data
Age Income Class label
27 28K Budget Spender
35 36K Big Spender
65 45K Budget Spender
Classification:
Classification Steps:
Classification passes through 2 steps:
Step1: Model Construction (Learning step, training step)
Step2: Model Usage
Before using the model, we need to test its accuracy.
We have some data acting as test data (Labeled data), we pass them to
the classifier model and compare their result with the result we have.
The accuracy rate, is the percentage of test set samples that are correctly
classified by the model.
If the classifier rate is accepted, then use the model for “Unlabeled
data”.
mmouf@2017
Classification:
Decision Tree
A decision tree is a structure that includes a root node, branches, and
leaf nodes.
Each internal node denotes a test on an attribute, each branch denotes
the outcome of a test, and each leaf node holds a class label.
The topmost node in the tree is the root node.
The following decision tree is for the concept buy_computer that
indicates whether a customer at a company is likely to buy a computer
or not.
Each internal node represents a test on an attribute.
Each leaf node represents a class.
mmouf@2017
Classification:
Decision Tree
mmouf@2017
Age
Student Credit rating
Yes
YesNo No Yes
youth senior
middle
no yes fair excellent
Classification:
Decision Tree
Why decision trees classifiers are so popular?
• The construction of a decision tree does not require any domain
knowledge or parameter setting
• They can handle high dimensional data
• Intuitive representation that is easily understood by humans
• Learning and classification are simple and fast
• They have a good accuracy
mmouf@2017
Classification:
Decision Tree
Example:
Draw the Decision Tree
mmouf@2017
RID Age Student Credit-rating Class:buyComputer
1 Youth Yes Fair Yes
2 Youth Yes Fair Yes
3 Youth Yes Fair No
4 Youth No Fair No
5 Middle No Excellent Yes
6 Senior Yes Fair No
7 senior yes excellent Yes
Classification:
Decision Tree
Solution:
Start with the Age
mmouf@2017
Age
RID Class
1 Yes
2 Yes
3 No
4 No
RID Class
5 Yes
RID Class
6 No
7 Yes
Youth Middle Senior
Classification:
Decision Tree
Solution:
The middle has only one record, this mean all the middle age, will get
the decision of buying a computer
mmouf@2017
Age
RID Class
1 Yes
2 Yes
3 No
4 No
RID Class
6 No
7 Yes
Youth
Middle
Senior
Yes
Classification:
Decision Tree
Solution:
The youth has part of its record goes to buy, and part goes to don’t buy.
So, we need to have another attribute. Here, the 4 records have the same
credit rating, so it is not effective, while it has difference in student
attribute
Also, the senior attribute, but here the student is not effective, while the
credit rating is changed
When the attribute Age = youth and the attribute Student = yes, there are
3 records (2 yes and 1 no), so we need another attribute, which the same
for these 3 records, here we will assign it to the majority of record.
mmouf@2017
Classification:
Decision Tree
Solution:
mmouf@2017
Age
Youth
Middle
Senior
Yes
Credit-RatingStudent
No Yes No Yes
no yes fair excellent
Classification:
Decision Tree
Solution:
Usage: Find the class of the following data
Start with the root (age = youth), go to the left branch, then (student =
no) So this record will be added to class No
mmouf@2017
RID age student Credit-rating Class:buyComputer
8 youth no fair ?
Classification:
Naïve Bayes
The Bayes classifier is based on the Bayes theorem for conditional
probabilities.
This theorem quantifies the conditional probability of a random variable
(class variable), given known observations about the value of another
set of random variables (feature variables).
The Bayes theorem is used widely in probability and statistics.
In a Bayesian classifier, the learning agent builds a probabilistic model
of the features and uses that model to predict the classification of a new
example.
mmouf@2017
Classification:
Naïve Bayes
Example:
Tuple to Classify is:
X(age = youth, income = medium, student = yes, credit = fair), Maximize P(X|Ci) P(Ci)
mmouf@2017
RID age Income student Credit-rating Class: buyComputer
1 Youth High No fair No
2 Youth High No Excellent No
3 Middle High No Fair Yes
4 Senior Medium No Fair Yes
5 Senior Low Yes Fair Yes
6 Senior Low Yes Excellent No
7 Middle Low Yes Excellent Yes
8 Youth Medium No Fair No
9 Youth Low Yes Fair Yes
10 Senior Medium Yes Fair Yes
11 Youth Medium Yes Excellent Yes
12 Middle Medium No Excellent Yes
13 Middle High Yes Fair Yes
14 senior Medium no Excellent No
Classification:
Naïve Bayes
Solution:
Step 1: P(Ci)
P(buyComputer = Yes) = number of “Yes” / Total number
= 9 / 14 = 0.643
P(buyComputer = No) = number of “No” / Total number
= 5 / 14 = 0.357
Step 2: P(X|Ci)
Calculate the probability of X for each class, but here I will not going to get
the whole X, I will compute the probability of each attribute to each class
P(age = youth | buyComputer = yes) = 2 / 9 = 0.222
P(age = youth | buyComputer = no) = 3 / 5 = 0.666
mmouf@2017
Classification:
Naïve Bayes
Solution:
P(income=medium|buys_computer=yes) = 4 / 9 = 0.444
P(income=medium|buys_computer=no) = 2 / 5 = 0.400
P(student=yes|buys_computer=yes) = 6 / 9 = 0.667
P(tudent=yes|buys_computer=no) = 1/ 5 = 0.200
P(credit_rating=fair|buys_computer=yes) = 6 / 9 = 0.667
P(credit_rating=fair|buys_computer=no) = 2 / 5 = 0.400
mmouf@2017
Classification:
Naïve Bayes
Solution:
P(X | buyComputer = Yes) = P(age=youth|buys_computer=yes) *
P(income=medium|buys_computer=yes)*
P(student=yes|buys_computer=yes)*
P(credit_rating=fair|buys_computer=yes)
= 0.044
P(X | buyComputer = No) = P(age=youth|buys_computer=No) *
P(income=medium|buys_computer=No)*
P(student=yes|buys_computer=No)*
P(credit_rating=fair|buys_computer=No)
= 0.019
mmouf@2017
Classification:
Naïve Bayes
Solution:
Step 3: P(X|Ci) P(Ci)
P(X|buys_computer=yes)P(buys_computer=yes) = 0.044 * 0.643
= 0.028
P(X|buys_computer=no)P(buys_computer=no) = 0.019 * 0.357
= 0.007
The naïve Bayesian Classifier predicts buys_computer=yes for tuple
X
mmouf@2017

Contenu connexe

Tendances

Paper id 212014126
Paper id 212014126Paper id 212014126
Paper id 212014126
IJRAT
 
Hiding sensitive association_rule.pdf.
Hiding sensitive association_rule.pdf.Hiding sensitive association_rule.pdf.
Hiding sensitive association_rule.pdf.
Dhyanendra Jain
 

Tendances (11)

Understanding Association Rule Mining
Understanding Association Rule MiningUnderstanding Association Rule Mining
Understanding Association Rule Mining
 
RDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-rRDataMining slides-association-rule-mining-with-r
RDataMining slides-association-rule-mining-with-r
 
Rmining
RminingRmining
Rmining
 
Association Rule Mining || Data Mining
Association Rule Mining || Data MiningAssociation Rule Mining || Data Mining
Association Rule Mining || Data Mining
 
Association 04.03.14
Association   04.03.14Association   04.03.14
Association 04.03.14
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Paper id 212014126
Paper id 212014126Paper id 212014126
Paper id 212014126
 
Hiding sensitive association_rule.pdf.
Hiding sensitive association_rule.pdf.Hiding sensitive association_rule.pdf.
Hiding sensitive association_rule.pdf.
 
1.11.association mining 3
1.11.association mining 31.11.association mining 3
1.11.association mining 3
 
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset CreationTop Down Approach to find Maximal Frequent Item Sets using Subset Creation
Top Down Approach to find Maximal Frequent Item Sets using Subset Creation
 
Lecture 04 Association Rules Basics
Lecture 04 Association Rules BasicsLecture 04 Association Rules Basics
Lecture 04 Association Rules Basics
 

Similaire à Intake 37 DM

Cluster2
Cluster2Cluster2
Cluster2
work
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
badirh
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule Mining
Wan Aezwani Wab
 

Similaire à Intake 37 DM (20)

MODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptxMODULE 5 _ Mining frequent patterns and associations.pptx
MODULE 5 _ Mining frequent patterns and associations.pptx
 
AssociationRule.pdf
AssociationRule.pdfAssociationRule.pdf
AssociationRule.pdf
 
Cluster2
Cluster2Cluster2
Cluster2
 
apriori.pptx
apriori.pptxapriori.pptx
apriori.pptx
 
Mining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association RulesMining Frequent Patterns And Association Rules
Mining Frequent Patterns And Association Rules
 
Associative Learning
Associative LearningAssociative Learning
Associative Learning
 
Data Mining Concepts 15061
Data Mining Concepts 15061Data Mining Concepts 15061
Data Mining Concepts 15061
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
Apriori Algorithm.pptx
Apriori Algorithm.pptxApriori Algorithm.pptx
Apriori Algorithm.pptx
 
Lec6_Association.ppt
Lec6_Association.pptLec6_Association.ppt
Lec6_Association.ppt
 
Association Rule Mining
Association Rule MiningAssociation Rule Mining
Association Rule Mining
 
Chapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptxChapter 01 Introduction DM.pptx
Chapter 01 Introduction DM.pptx
 
Association rules and frequent pattern growth algorithms
Association rules and frequent pattern growth algorithmsAssociation rules and frequent pattern growth algorithms
Association rules and frequent pattern growth algorithms
 
ASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptxASSOCIATION Rule plus MArket basket Analysis.pptx
ASSOCIATION Rule plus MArket basket Analysis.pptx
 
Introduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its MethodsIntroduction To Multilevel Association Rule And Its Methods
Introduction To Multilevel Association Rule And Its Methods
 
APRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptxAPRIORI ALGORITHM -PPT.pptx
APRIORI ALGORITHM -PPT.pptx
 
Lecture 14
Lecture 14Lecture 14
Lecture 14
 
Association rule mining.pptx
Association rule mining.pptxAssociation rule mining.pptx
Association rule mining.pptx
 
datamining and warehousing ppt
datamining  and warehousing pptdatamining  and warehousing ppt
datamining and warehousing ppt
 
Apriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule MiningApriori and Eclat algorithm in Association Rule Mining
Apriori and Eclat algorithm in Association Rule Mining
 

Plus de Mahmoud Ouf (20)

Relation between classes in arabic
Relation between classes in arabicRelation between classes in arabic
Relation between classes in arabic
 
Intake 38 data access 5
Intake 38 data access 5Intake 38 data access 5
Intake 38 data access 5
 
Intake 38 data access 4
Intake 38 data access 4Intake 38 data access 4
Intake 38 data access 4
 
Intake 38 data access 3
Intake 38 data access 3Intake 38 data access 3
Intake 38 data access 3
 
Intake 38 data access 1
Intake 38 data access 1Intake 38 data access 1
Intake 38 data access 1
 
Intake 38 12
Intake 38 12Intake 38 12
Intake 38 12
 
Intake 38 11
Intake 38 11Intake 38 11
Intake 38 11
 
Intake 38 10
Intake 38 10Intake 38 10
Intake 38 10
 
Intake 38 9
Intake 38 9Intake 38 9
Intake 38 9
 
Intake 38 8
Intake 38 8Intake 38 8
Intake 38 8
 
Intake 38 7
Intake 38 7Intake 38 7
Intake 38 7
 
Intake 38 6
Intake 38 6Intake 38 6
Intake 38 6
 
Intake 38 5 1
Intake 38 5 1Intake 38 5 1
Intake 38 5 1
 
Intake 38 5
Intake 38 5Intake 38 5
Intake 38 5
 
Intake 38 4
Intake 38 4Intake 38 4
Intake 38 4
 
Intake 38 3
Intake 38 3Intake 38 3
Intake 38 3
 
Intake 38 2
Intake 38 2Intake 38 2
Intake 38 2
 
Intake 38_1
Intake 38_1Intake 38_1
Intake 38_1
 
Intake 37 ef2
Intake 37 ef2Intake 37 ef2
Intake 37 ef2
 
Intake 37 ef1
Intake 37 ef1Intake 37 ef1
Intake 37 ef1
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Intake 37 DM

  • 1. Data Mining Eng. Mahmoud Ouf Lecture 01 mmouf@2017
  • 2. Data Mining Definition There are several definitions for Data Mining: • Mining is a term characterizing the process that finds a small set of important knowledge from a great deal of raw material. • Knowledge mining from data • Knowledge extraction • Data/Pattern analysis. Data mining is an essential step in the process of knowledge discovery from Data (KDD). mmouf@2017
  • 3. Knowledge Discovery from Data (KDD) The knowledge discovery (KDD) process is an iterative sequence of the following steps: 1. Data cleaning 2. Data integration 3. Data selection 4. Data transformation 5. Data mining 6. Pattern evaluation 7. Knowledge presentation Steps 1 through 4 are different forms of data preprocessing, where data are prepared for mining. mmouf@2017
  • 4. What Kinds of Data Can Be Mined? • Database Data: searching for trends or data patterns. detect deviations • Data Warehouses Although data warehouse tools help support data analysis, additional tools for data mining are often needed for in-depth analysis • Transactional Data Market basket Data Analysis mmouf@2017
  • 5. What Kinds of Patterns Can Be Mined? • There are a number of data mining functionalities includes: • Characterization and discrimination • The mining of frequent patterns, Associations, and correlations, • Classification and regression, • Clustering analysis, and outlier analysis. Data mining functionality can be classified into two categories: • Descriptive mining tasks characterize properties of the data in a target data set. • Predictive mining tasks perform induction on the current data in order to make predictions. mmouf@2017
  • 6. What Kinds of Patterns Can Be Mined? mmouf@2017 Data Mining Descriptive Predictive Association Rule Clustering Summarization Classification Regression Time Series
  • 7. Descriptive Data mining • This is used to generate correlation, frequency, cross tabulation. • It can be used to discover regularities in the data and to uncover patterns. • It is also used to find subgroups in the bulk of data. mmouf@2017
  • 8. Association Rules: What is Association rule? Association rule is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases To select interesting rules, constraints on various measures of significance are used. The best-known constraints are minimum thresholds on support and confidence. mmouf@2017
  • 9. Association Rules Example: Assume X = {Bread, Butter} Assume Y = {Milk} Transaction_ID milk Bread Butter Bear Diapers 1 1 1 0 0 0 2 0 1 1 0 0 3 0 0 0 1 1 4 1 1 1 0 0 5 0 1 0 0 0
  • 10. Association Rules Support: The support value of X with respect to T is defined as the proportion of transactions in the database which contains the item-set X. Supp(X) = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 𝐶𝑜𝑛𝑡𝑎𝑖𝑛𝑠 𝐼𝑡𝑒𝑚 𝑜𝑓 𝑋 {𝐵𝑟𝑒𝑎𝑑,𝐵𝑢𝑡𝑡𝑒𝑟} 𝑇𝑜𝑡𝑎𝑙 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑇𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛𝑠 Transaction Contains Item of X: Transaction 2 and Transaction 4 Total number of transaction = 5 Supp(X) = 2/5 = 0.4 This Means 40% of all transaction contains itemSet X
  • 11. Association Rules Confidence: The confidence value of a rule, X => Y , with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y. Conf(X=>Y) = 𝑆𝑢𝑝𝑝(𝑋∪𝑌) 𝑆𝑢𝑝𝑝(𝑋) 𝑋∪𝑌 = {Bread, Butter, Milk} Supp(𝑋∪𝑌) = 1 / 5 = 0.2 Conf(X=>Y) = 0.2 / 0.4 = 0.5 This means 50% the transactions containing butter and bread contains Milk.
  • 12. Association Rules: Mining one level Association (Apriori) Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database. This has applications in domains such as market basket analysis. mmouf@2017
  • 13. Association Rules: Mining one level Association (Apriori) Example: Assume the following Database transaction: With minimum support = 0.5 (2) mmouf@2017 Transaction Items T1 Milk, Bread, Cookies, Juice T2 Milk, Juice T3 Milk, Egg T4 Bread, Cookies, Coffee
  • 14. Association Rules: Mining one level Association (Apriori) Solution: Step1: Create 1st Level Item set mmouf@2017 Item Support Milk 3 Bread 2 Cookies 2 Juice 2 Egg 1 Coffee 1 Rejected as they are Below the minimum support
  • 15. Association Rules: Mining one level Association (Apriori) Solution: Step2: Create 2nd Level Item set mmouf@2017 Rejected as they are Below the minimum support Items Support Milk, Bread 1 Milk, Cookies 1 Milk, Juice 2 Bread, Cookies 2 Bread, Juice 1 Cookies, Juice 1 Rejected as they are Below the minimum support
  • 16. Association Rules: Mining one level Association (Apriori) Solution: Step3: Create 3rd Level Item set There is no association at the 3rd level item set mmouf@2017 Rejected as they are Below the minimum support Items Support Milk, Juice, Bread 1 Milk, Juice, Cookies 1 Milk, Bread, Cookies 1 Juice, Bread, Cookies 1
  • 17. Association Rules: Mining one level Association (Apriori) Solution: We stop the combination of itemset in one of two cases: • All the last level items are neglected as they are less than the min support • Reach Level Item set contains all element Last Step: Association Rules Milk=>Juice [support = 0.5, confidence = 0.67] Juice=>Milk [support = 0.5, confidence = 1] Bread=>Cookies [support = 0.5, confidence = 1] Cookies=>Bread [support = 0.5, confidence = 1] mmouf@2017
  • 18. Association Rules Mining Multilevel Associations Usually the data are in form of hierarchy. So, Strong association discovered in high level can be unreal We may want to drill down to find novel patterns at more detailed levels. On the other hand, there could be too many scattered patterns at low or primitive abstraction levels, some of which are just trivial specializations of patterns at higher levels. Association rules generated from mining data at multiple abstraction levels are called multiple-level or multilevel association rules. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
  • 19. Association Rules Mining Multilevel Associations In general, a top-down strategy is used, where counts are accumulated for the calculation of frequent itemsets at each concept level, starting at concept level 1 and working downward in the hierarchy toward the more specific concept levels, until no more frequent itemsets can be found. For each level, any algorithm for discovering frequent itemsets may be used, such as Apriori. Only the descendants of the frequent items at level 1 (L[1, 1]) are considered as candidates in level 2 frequent 1 Itemset
  • 20. Food milk bread jam juice skim2% 4% bran white cherry plum apple grape prune kfarm smartsmith Old mile wond er Dairy Land Fore most 1 2 3 4 1 2 3 1 2 3 1 2 3 1 2 1 2 1 2 1 2
  • 21. Association Rules Mining Multilevel Associations Example: Assume the following Encoded Transaction Table The Item is defined by its hierarchy. Example: item 111 the first 1 “hundred” represent “milk”, the second 1 “tenth” represent second level “2%” and the third 1 “unit” represent third level “Dairy Land” Minimum Support = 4 (For Level 1) minsup[1] = 4 TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222, 323} T3 {112, 122, 221, 411} T4 {111, 121} T5 {111, 122, 211, 221, 413} T6 {211, 323, 524} T7 {323, 411, 524, 713}
  • 22. Association Rules Mining Multilevel Associations Level 1: (minsupp[1] = 4) •Level 1 Frequent 1 item set L[1, 1] ItemSet Support {1**} 5 {2**} 5 {3**} 3 {4**} 2 {5**} 1 {7**} 1 Canceled as they are Below the minimum support
  • 23. Association Rules Mining Multilevel Associations Level 1: (minsupp[1] = 4) •Level 1 Frequent 2 item set: L[1, 2]. •Produce Filtered Transaction Table Remove any infrequent itemSet in transaction” Remove any transaction contains infrequent itemSet only infrequent itemset, is itemset that has a support less than the minsup[1] ItemSet Support {1**, 2**} 4
  • 24. Association Rules Mining Multilevel Associations Level 1: (minsupp[1] = 4) Filtered Transactional Table TID Items T1 {111, 121, 211, 221} T2 {111, 211, 222} T3 {112, 122, 221} T4 {111, 121} T5 {111, 122, 211, 221} T6 {211}
  • 25. Association Rules Mining Multilevel Associations Level 2: (minsupp[2] = 3) • Level 2 Frequent 1 item set L[2, 1] ItemSet Support {11*} 5 {12*} 4 {13*} 0 {21*} 4 {22*} 4
  • 26. Association Rules Mining Multilevel Associations Level 2: (minsupp[2] = 3) • Level 2 Frequent 2 item set L[2, 2] ItemSet Support {11*, 12*} 4 {11*, 21*} 3 {11*, 22*} 4 {12*, 21*} 2 {12*, 22*} 3 {21*, 22*} 3
  • 27. Association Rules Mining Multilevel Associations Level 2: (minsupp[2] = 3) • Level 2 Frequent 3 item set: L[2, 3] ItemSet Support {11*, 12*, 21*} 2 {11*, 12*, 22*} 3 {12*, 21*, 22*} 3
  • 28. Association Rules Mining Multilevel Associations Level 3: (minsupp[3] = 3) • Level 3 Frequent 1 item set L[3, 1] ItemSet Support 111 4 112 1 121 2 122 2 211 4 221 3 222 1
  • 29. Association Rules Mining Multilevel Associations Level 3: (minsupp[3] = 3) • Level 3 Frequent 2 item set L[3, 2] We will stop at this level and frequently 111=> 211 {support = 0.43, confidence = 0.75} 211=> 111 {support = 0.43, confidence = 0.75} ItemSet Support {111, 211} 3 {111, 221} 2 {211, 221} 2
  • 30. Association Rules Mining Multilevel Associations Low Level Minsupp: There are 2 Approaches taken to identify the low level minimum support: Uniform support: The same minimum support threshold is used when mining at each abstraction level. The method is simple in that users are required to specify only one minimum support threshold. If the minimum support threshold is set too high, it could miss some meaningful associations occurring at low abstraction levels. If the threshold is set too low, it may generate many uninteresting associations occurring at high abstraction levels.
  • 31. Association Rules Mining Multilevel Associations Low Level Minsupp: There are 2 Approaches taken to identify the low level minimum support: Reduced support: Each abstraction level has its own minimum support threshold. The deeper the abstraction level, the smaller the corresponding threshold.
  • 32. Data Mining Eng. Mahmoud Ouf Lecture 02 mmouf@2017
  • 33. Association Rules Mining Multidimensional Associations The data may be in a form of multidimensional and data warehouse, rather than 2 dimension or multi-level. In multidimensional databases, we refer to each distinct predicate in a rule as a dimension. Ex: buys(X, “digital camera”) => buys(X, “HP Printer”) We can refer to it as a single-dimensional or intradimensional association rule (single distinct predicate (buys) with multiple occurrences Multidimensional data representation, in addition to keeping track of the items purchased in sales transactions, a relational database may record other attributes associated with the items and/or transactions such as the item description or the branch location of the sale Ex: age(X, “20..29”) Ʌoccupation(X, “Student”)=>buys(X, “laptop”)
  • 34. Association Rules Mining Multidimensional Associations Ex: age(X, “20..29”) Ʌoccupation(X, “Student”)=>buys(X, “laptop”) This refer as multidimensional association rules and contains three predicates (age, occupation, and buys), each of which occurs only once in the rule (no repeated predicates) Multidimensional association rules with no repeated predicates are called interdimensional association rules Ex: age(X, “20..29”) Ʌ buys(X, “laptop”) =>buys(X, “HP Printer”) Multidimensional association rules with repeated predicates, which contain multiple occurrences of some predicates (buys repeated). These rules are called hybrid-dimensional association rules.
  • 35. Clustering: Clustering is the process of grouping a set of data objects into multiple groups or clusters so that objects within a cluster have high similarity, but are very dissimilar to objects in other clusters. Cluster analysis or simply clustering is the process of partitioning a set of data objects (or observations) into subsets. Each subset is a cluster, such that objects in a cluster are similar to one another, yet dissimilar to objects in other clusters. Different clustering methods may generate different clustering on the same data set. Clustering is useful in that it can lead to the discovery of previously unknown groups within the data. mmouf@2017
  • 36. Clustering: Cluster analysis can be used as a standalone tool to gain insight into the distribution of data, to observe the characteristics of each cluster, and to focus on a particular set of clusters for further analysis. It may serve as a preprocessing step for other algorithms, such as characterization, attribute subset selection, and classification, which would then operate on the detected clusters and the selected attributes or features. mmouf@2017
  • 37. Clustering: k-means cluster k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) The main idea is to define k centers, one for each cluster. These centers should be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest center. mmouf@2017
  • 38. Clustering: k-means cluster When no point is pending, the first step is completed and an early group age is done. At this point we need to re-calculate k new centroids as barycenter of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new center. A loop has been generated. As a result of this loop we may notice that the k centers change their location step by step until no more changes are done or in other words centers do not move any more. mmouf@2017
  • 39. Clustering: k-means cluster Example: We have the following 5 points and we want to group them in 2 clusters: mmouf@2017 i X Y A 1 1 B 1 0 C 0 2 D 2 4 E 3 5
  • 40. Clustering: k-means cluster Solution: Choose 2 points to be the center of each cluster (selected Randomly) “A, C” Step1: Calculate the distance between each point and the 2 selected point 𝑙𝑒𝑛𝑔𝑡ℎ = (𝑋1 − 𝑋2)2+(𝑌1 − 𝑌2)2 mmouf@2017 i A (Cluster 1) C (Cluster 2) A 0 1.4 B 1 2.2 C 1.4 0 D 3.2 2.8 E 4.5 4.2
  • 41. Clustering: k-means cluster Compare the distance between each point and the 2 selected groups. This point will belong to the cluster which has the smallest distance to it Point B, belong to the Cluster of Point “A” (1 less than 2.2) Point D, belong to the Cluster of Point “C” (2.8 less than 2.2) Point E, belong to the Cluster of Point “C” (4.2 less than 4.5) mmouf@2017 i X Y Cluster A 1 1 1 B 1 0 1 C 0 2 2 D 2 4 2 E 3 5 2
  • 42. Clustering: k-means cluster Calculate the mean of Cluster 1: X = (1 + 1) / 2 = 1 Y = (1 + 0) / 2 = 0.5 Mean Cluster1 (1, 0.5) Calculate the mean of Cluster 2: X = (0 + 2 + 3) / 3 = 1.7 Y = (2 + 4 + 5) / 3 = 3.7 Mean Cluster2 (1.7, 3.7) mmouf@2017
  • 43. Clustering: k-means cluster Step2: Recalculate the distance from each point to the cluster means Compare the distance between each point and the 2 cluster mean. This point will belong to the cluster which has the smallest distance to it Point A, belong to the Cluster 1 (0.5 less than 2.7) Point B, belong to the Cluster 1 (0.5 less than 3.7) mmouf@2017 I Cluster 1 Cluster 2 A 0.5 2.7 B 0.5 3.7 C 1.8 2.4 D 3.6 0.5 E 4.9 1.9
  • 44. Clustering: k-means cluster Point C, belong to the Cluster 1 (1.8 less than 2.4) Point D, belong to the Cluster 2 (0.5 less than 3.6) Point E, belong to the Cluster 2 (1.9 less than 4.9) mmouf@2017 i X Y Cluster A 1 1 1 B 1 0 1 C 0 2 1 D 2 4 2 E 3 5 2
  • 45. Clustering: k-means cluster Calculate the mean of Cluster 1: (0.7, 1) Calculate the mean of Cluster 2: (2.5, 4.5) Step3: Recalculate the distance from each point to the cluster means. In this example we will find no change, so it is the final solution mmouf@2017
  • 46. Predictive Data mining • The purpose of Predictive mining model is mainly to predict the future outcome than current behavior. • The prediction output can be numeric value or in categorized form. The predictive models are the supervised learning functions which predicts the target value. mmouf@2017
  • 47. Classification: Among Predictive data mining technique, Classification model is considered as the best-understood technique of all data mining approaches. The common characteristics of classification tasks are as supervised learning, categories dependent variable and assigning new data to one of a set of well-defined classes. Classification technique is used in customer segmentation, modeling businesses, credit analysis, and many other applications. In a classification technique, you typically have historical data called labeled examples and new examples. Each labeled example consists of multiple predictor attributes and one target attribute that is a class label. mmouf@2017
  • 48. Classification: The goal of classification is to construct a model using the data from history and accurately predicts the new class of examples. A classification task begins with build data in database also known as training data for which the target values are known. There are different classification algorithms available that uses their different techniques for finding relations between the predictor attributes values and the target values in the build data. After getting the targeted data, these relations are summarized in a model, so that they can be applied to new cases further with unknown target values for predicting target values. mmouf@2017
  • 49. Classification: The classification is: grouping the data to classes and each class contains the similar data. It is supervised learning, this mean that a part of the available data (which I know its class ”Labeled data”) are used for training. Then we use the second part of data for testing the classifier. Example: As “Unlabeled data”: Age:56, Income: 45K classifier Budget Spender mmouf@2017 Training Data Age Income Class label 27 28K Budget Spender 35 36K Big Spender 65 45K Budget Spender
  • 50. Classification: Classification Steps: Classification passes through 2 steps: Step1: Model Construction (Learning step, training step) Step2: Model Usage Before using the model, we need to test its accuracy. We have some data acting as test data (Labeled data), we pass them to the classifier model and compare their result with the result we have. The accuracy rate, is the percentage of test set samples that are correctly classified by the model. If the classifier rate is accepted, then use the model for “Unlabeled data”. mmouf@2017
  • 51. Classification: Decision Tree A decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node. The following decision tree is for the concept buy_computer that indicates whether a customer at a company is likely to buy a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a class. mmouf@2017
  • 52. Classification: Decision Tree mmouf@2017 Age Student Credit rating Yes YesNo No Yes youth senior middle no yes fair excellent
  • 53. Classification: Decision Tree Why decision trees classifiers are so popular? • The construction of a decision tree does not require any domain knowledge or parameter setting • They can handle high dimensional data • Intuitive representation that is easily understood by humans • Learning and classification are simple and fast • They have a good accuracy mmouf@2017
  • 54. Classification: Decision Tree Example: Draw the Decision Tree mmouf@2017 RID Age Student Credit-rating Class:buyComputer 1 Youth Yes Fair Yes 2 Youth Yes Fair Yes 3 Youth Yes Fair No 4 Youth No Fair No 5 Middle No Excellent Yes 6 Senior Yes Fair No 7 senior yes excellent Yes
  • 55. Classification: Decision Tree Solution: Start with the Age mmouf@2017 Age RID Class 1 Yes 2 Yes 3 No 4 No RID Class 5 Yes RID Class 6 No 7 Yes Youth Middle Senior
  • 56. Classification: Decision Tree Solution: The middle has only one record, this mean all the middle age, will get the decision of buying a computer mmouf@2017 Age RID Class 1 Yes 2 Yes 3 No 4 No RID Class 6 No 7 Yes Youth Middle Senior Yes
  • 57. Classification: Decision Tree Solution: The youth has part of its record goes to buy, and part goes to don’t buy. So, we need to have another attribute. Here, the 4 records have the same credit rating, so it is not effective, while it has difference in student attribute Also, the senior attribute, but here the student is not effective, while the credit rating is changed When the attribute Age = youth and the attribute Student = yes, there are 3 records (2 yes and 1 no), so we need another attribute, which the same for these 3 records, here we will assign it to the majority of record. mmouf@2017
  • 59. Classification: Decision Tree Solution: Usage: Find the class of the following data Start with the root (age = youth), go to the left branch, then (student = no) So this record will be added to class No mmouf@2017 RID age student Credit-rating Class:buyComputer 8 youth no fair ?
  • 60. Classification: Naïve Bayes The Bayes classifier is based on the Bayes theorem for conditional probabilities. This theorem quantifies the conditional probability of a random variable (class variable), given known observations about the value of another set of random variables (feature variables). The Bayes theorem is used widely in probability and statistics. In a Bayesian classifier, the learning agent builds a probabilistic model of the features and uses that model to predict the classification of a new example. mmouf@2017
  • 61. Classification: Naïve Bayes Example: Tuple to Classify is: X(age = youth, income = medium, student = yes, credit = fair), Maximize P(X|Ci) P(Ci) mmouf@2017 RID age Income student Credit-rating Class: buyComputer 1 Youth High No fair No 2 Youth High No Excellent No 3 Middle High No Fair Yes 4 Senior Medium No Fair Yes 5 Senior Low Yes Fair Yes 6 Senior Low Yes Excellent No 7 Middle Low Yes Excellent Yes 8 Youth Medium No Fair No 9 Youth Low Yes Fair Yes 10 Senior Medium Yes Fair Yes 11 Youth Medium Yes Excellent Yes 12 Middle Medium No Excellent Yes 13 Middle High Yes Fair Yes 14 senior Medium no Excellent No
  • 62. Classification: Naïve Bayes Solution: Step 1: P(Ci) P(buyComputer = Yes) = number of “Yes” / Total number = 9 / 14 = 0.643 P(buyComputer = No) = number of “No” / Total number = 5 / 14 = 0.357 Step 2: P(X|Ci) Calculate the probability of X for each class, but here I will not going to get the whole X, I will compute the probability of each attribute to each class P(age = youth | buyComputer = yes) = 2 / 9 = 0.222 P(age = youth | buyComputer = no) = 3 / 5 = 0.666 mmouf@2017
  • 63. Classification: Naïve Bayes Solution: P(income=medium|buys_computer=yes) = 4 / 9 = 0.444 P(income=medium|buys_computer=no) = 2 / 5 = 0.400 P(student=yes|buys_computer=yes) = 6 / 9 = 0.667 P(tudent=yes|buys_computer=no) = 1/ 5 = 0.200 P(credit_rating=fair|buys_computer=yes) = 6 / 9 = 0.667 P(credit_rating=fair|buys_computer=no) = 2 / 5 = 0.400 mmouf@2017
  • 64. Classification: Naïve Bayes Solution: P(X | buyComputer = Yes) = P(age=youth|buys_computer=yes) * P(income=medium|buys_computer=yes)* P(student=yes|buys_computer=yes)* P(credit_rating=fair|buys_computer=yes) = 0.044 P(X | buyComputer = No) = P(age=youth|buys_computer=No) * P(income=medium|buys_computer=No)* P(student=yes|buys_computer=No)* P(credit_rating=fair|buys_computer=No) = 0.019 mmouf@2017
  • 65. Classification: Naïve Bayes Solution: Step 3: P(X|Ci) P(Ci) P(X|buys_computer=yes)P(buys_computer=yes) = 0.044 * 0.643 = 0.028 P(X|buys_computer=no)P(buys_computer=no) = 0.019 * 0.357 = 0.007 The naïve Bayesian Classifier predicts buys_computer=yes for tuple X mmouf@2017