SlideShare une entreprise Scribd logo
1  sur  6
Base paper Title: A Comparative Analysis of Sampling Techniques for Click-Through Rate
Prediction in Native Advertising
Abstract
Native advertising is a popular form of online advertisements that has similar styles and
functions with the native content displayed on online platforms, such as news, sports and social
websites. It can better capture users’ attention, and they have gained increasing popularity in
many online platforms and among advertisers. In advertising, Click Trough Rate (CTR)
prediction is essential but challenging due to data sparsity: the non-clicks constitute most of
the data, whereas clicks form a significantly smaller portion. The performance of 19 class
imbalance approaches is compared in this study with the use of four traditional classifiers, to
determine the most effective imbalance methods for our native ads dataset. The data used is
real traffic data from Finland over the course of seven days provided by the native advertising
platform ReadPeak. The resampling methods used include seven undersampling techniques,
four oversampling techniques, four hybrid sampling techniques, and four ensemble systems.
The findings demonstrate that class imbalance learning can enhance the model’s capacity for
classification by as much as 20%. In general, oversampling is more stable comparatively. But,
undersampling performed the best with Random Forest. Our study also demonstrates that the
imbalance ratio plays an important role in the performance of the model and the features
importance.
Existing System
In a class imbalanced dataset, there are significantly less samples in one of its classes
than the other [3]. The difficulties of learning from such imbalanced data are inevitable.
Standard learning classifiers are biased toward the majority class due to the skewed distribution
of the training samples, making them unable to recognize unusual occurrences. It is possible to
mistake noise for rare minority samples and vise-versa [24]. This kind of problem with uneven
data is particularly prevalent in the advertising industry. The dataset contains a lot more non-
clicked advertising than clicked advertisements, and the difference between the two is typically
significant. To solve these issues, researchers have created numerous class imbalance
techniques and performance evaluation criteria that play an important role in this paper. In this
context, we implemented several class imbalance methods on a real-world class imbalanced
dataset provided by the native advertising platform ReadPeak [1]. The dataset used contains
information about the in-screen advertisements and whether they have been clicked or not.
According to the dataset, there is an extreme imbalance between clicks and non-clicks, with
250 non-clicks for every 1 click. Advertising is a key and crucial part of corporate operations.
In 2021, advertisers are estimated to have spent 118.72 billion dollars on display advertising
[52]. Demand-side platform (DSP)’s cost per click pricing model makes advertisers earnings
directly correlated to the number of clicks. Predicting performance indicators such as the click-
through rate (CTR) is crucial for DSP research [35]. The available literature presents a valid
solution to the problem at hand, but addressing the unbalance of data specifically and the
benefits we can get either in terms of training speed and prediction accuracy has been neglected.
In that spirit, the purpose of the current work is to explore the effect of sampling techniques on
prediction accuracy with the combination of a number of classical machine learning models
applied to the prediction of CTR in native advertisement. The contributions of this paper are
summarized as follows: • The performance of 19 class imbalance methods with 4 classical
classifiers is evaluated. The best performing methods are identified based on the AUC and
LogLoss metrics. • An analysis of the unbalanced data problem was conducted using real-world
data from ReadPeak, including complete outline of the steps for feature engineering, selection,
and data cleaning. • Through experimentation, it was found that feature importance varies with
different Imbalance Ratios. The Boruta algorithm was used to compute feature importance
using three levels of sampling, highlighting the importance of balancing the data for accurate
results. • Using Random Forest and random undersampling, it is shown that as the balance
between positive and negative samples is improved, the performance also improves. However,
there is a threshold of imbalance ratio beyond which the performance improvement becomes
marginal. • We available the community access to a real world datasets that can help
researchers study the phenomenon and find solutions to CTR Prediction in the native
advertisement field.
Drawback in Existing System
 Data Representation and Generalization:
Drawback: The choice of a particular sampling technique may lead to biases in the
representation of the dataset. Oversampling or undersampling may impact the generalization
of the models to real-world scenarios, especially when the dataset is not fully representative of
the target population.
 Impact on Model Complexity:
Drawback: Some sampling techniques may inadvertently impact the complexity of the
predictive models. For instance, oversampling may lead to overfitting, especially when dealing
with small datasets, while undersampling might result in loss of valuable information.
 Limited Transparency:
Drawback: Some advanced sampling techniques, especially those involving synthetic
data generation, may lack transparency in terms of how they modify the underlying data
distribution. This can make it challenging to interpret the model's behavior and decision-
making processes.
 Impact on Evaluation Metrics:
Drawback: The choice of sampling technique may influence the performance metrics
used for evaluation. For instance, oversampling may artificially boost metrics like accuracy
but may not necessarily improve the model's ability to generalize to new, unseen data.
Proposed System
 Sampling Techniques:
Implement and analyze various sampling techniques, including but not limited to
random sampling, stratified sampling, undersampling, oversampling, and hybrid methods.
Investigate advanced techniques such as SMOTE (Synthetic Minority Over-sampling
Technique) and ADASYN (Adaptive Synthetic Sampling).
 Predictive Models:
Employ state-of-the-art machine learning models for click-through rate prediction,
including logistic regression, decision trees, random forests, and gradient boosting
machines.
Utilize deep learning models, such as neural networks, to assess the impact of sampling
techniques on more complex predictive architectures.
 Evaluation Metrics:
Measure the performance of each sampling technique using standard evaluation metrics
such as precision, recall, F1 score, and area under the receiver operating characteristic curve
(AUC-ROC).
Assess the impact on model calibration and reliability.
 Comparison and Recommendations:
Present a comparative analysis of the various sampling techniques, highlighting their
strengths, weaknesses, and suitability for different scenarios.
Offer recommendations for selecting an appropriate sampling strategy based on specific
campaign characteristics and data distributions.
Algorithm
 Stratified Sampling:
Algorithm: Stratified sampling based on target classes.
Description: Divides the dataset into strata (subgroups) based on target classes and
samples proportionally from each stratum, ensuring representation across classes.
 ADASYN (Adaptive Synthetic Sampling):
Algorithm: ADASYN algorithm.
Description: An adaptive extension of SMOTE that focuses on generating synthetic
instances for regions of the minority class that are harder to learn.
 Cluster-Based Sampling:
Algorithm: K-means clustering for cluster-based sampling.
Description: Groups instances into clusters and samples proportionally from each
cluster, facilitating localized sampling.
Advantages
 Addressing Class Imbalance:
Advantage: Class imbalance is a common issue in CTR prediction, where the number
of non-click instances often significantly outweighs click instances. Sampling techniques,
such as oversampling or undersampling, help balance class distribution, ensuring that the
model is not biased towards the majority class.
 Enhanced Generalization to Unseen Data:
Advantage: Effective sampling techniques contribute to the generalization of
predictive models to new, unseen data. This is crucial in native advertising where the goal
is to predict CTR for new ad campaigns.
 Mitigation of Bias and Unfairness:
Advantage: Carefully chosen sampling techniques can help mitigate biases in predictive
models, ensuring fair representation and treatment of different classes. This is essential for
ethical considerations in advertising.
 Increased Interpretability:
Advantage: Certain sampling techniques, like undersampling or clustering-based
sampling, may lead to a reduction in the number of instances, making it easier to interpret
model decisions. Enhanced interpretability is valuable for stakeholders who need to
understand the factors influencing CTR predictions.
Software Specification
 Processor : I3 core processor
 Ram : 4 GB
 Hard disk : 500 GB
Software Specification
 Operating System : Windows 10 /11
 Frond End : Python
 Back End : Mysql Server
 IDE Tools : Pycharm

Contenu connexe

Similaire à A Comparative Analysis of Sampling Techniques for Click-Through Rate Prediction in Native Advertising.

direct marketing in banking using data mining
direct marketing in banking using data miningdirect marketing in banking using data mining
direct marketing in banking using data miningHossein Malekinezhad
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONaciijournal
 
Causal Attribution - Proposing a better industry standard for measuring digit...
Causal Attribution - Proposing a better industry standard for measuring digit...Causal Attribution - Proposing a better industry standard for measuring digit...
Causal Attribution - Proposing a better industry standard for measuring digit...Peter Weingard
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .tsysglobalsolutions
 
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYPROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYIJITCA Journal
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Applicationaciijournal
 
Data Mining on Customer Churn Classification
Data Mining on Customer Churn ClassificationData Mining on Customer Churn Classification
Data Mining on Customer Churn ClassificationKaushik Rajan
 
Customer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesCustomer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesSindhujanDhayalan
 
Driving Results through Strategic Data Sourcing and Optimization: Life Line G...
Driving Results through Strategic Data Sourcing and Optimization: Life Line G...Driving Results through Strategic Data Sourcing and Optimization: Life Line G...
Driving Results through Strategic Data Sourcing and Optimization: Life Line G...Vivastream
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modelingEsteban Ribero
 
Top 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxTop 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxAnanthReddy38
 
Data Science Meetup 2017
Data Science Meetup 2017Data Science Meetup 2017
Data Science Meetup 2017Adello
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4Luis Borbon
 
Simply Data driven behavioural algorithms
Simply Data driven behavioural algorithmsSimply Data driven behavioural algorithms
Simply Data driven behavioural algorithmsNana Bianca
 
Dwdm chapter 5 data mining a closer look
Dwdm chapter 5  data mining a closer lookDwdm chapter 5  data mining a closer look
Dwdm chapter 5 data mining a closer lookShengyou Lin
 
Top 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdfTop 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdfAnanthReddy38
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptxVishalLabde
 

Similaire à A Comparative Analysis of Sampling Techniques for Click-Through Rate Prediction in Native Advertising. (20)

direct marketing in banking using data mining
direct marketing in banking using data miningdirect marketing in banking using data mining
direct marketing in banking using data mining
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATIONANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
ANALYSIS OF COMMON SUPERVISED LEARNING ALGORITHMS THROUGH APPLICATION
 
Causal Attribution - Proposing a better industry standard for measuring digit...
Causal Attribution - Proposing a better industry standard for measuring digit...Causal Attribution - Proposing a better industry standard for measuring digit...
Causal Attribution - Proposing a better industry standard for measuring digit...
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
 
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRYPROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
PROVIDING A METHOD FOR DETERMINING THE INDEX OF CUSTOMER CHURN IN INDUSTRY
 
Analysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through ApplicationAnalysis of Common Supervised Learning Algorithms Through Application
Analysis of Common Supervised Learning Algorithms Through Application
 
Data Mining on Customer Churn Classification
Data Mining on Customer Churn ClassificationData Mining on Customer Churn Classification
Data Mining on Customer Churn Classification
 
Customer churn classification using machine learning techniques
Customer churn classification using machine learning techniquesCustomer churn classification using machine learning techniques
Customer churn classification using machine learning techniques
 
Driving Results through Strategic Data Sourcing and Optimization: Life Line G...
Driving Results through Strategic Data Sourcing and Optimization: Life Line G...Driving Results through Strategic Data Sourcing and Optimization: Life Line G...
Driving Results through Strategic Data Sourcing and Optimization: Life Line G...
 
Campaign response modeling
Campaign response modelingCampaign response modeling
Campaign response modeling
 
Top 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptxTop 20 Data Science Interview Questions and Answers in 2023.pptx
Top 20 Data Science Interview Questions and Answers in 2023.pptx
 
Data Science Meetup 2017
Data Science Meetup 2017Data Science Meetup 2017
Data Science Meetup 2017
 
Manuscript dss
Manuscript dssManuscript dss
Manuscript dss
 
Machine learning - session 4
Machine learning - session 4Machine learning - session 4
Machine learning - session 4
 
Simply Data driven behavioural algorithms
Simply Data driven behavioural algorithmsSimply Data driven behavioural algorithms
Simply Data driven behavioural algorithms
 
Dwdm chapter 5 data mining a closer look
Dwdm chapter 5  data mining a closer lookDwdm chapter 5  data mining a closer look
Dwdm chapter 5 data mining a closer look
 
Media synergy studies_methodology
Media synergy studies_methodologyMedia synergy studies_methodology
Media synergy studies_methodology
 
Top 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdfTop 20 Data Science Interview Questions and Answers in 2023.pdf
Top 20 Data Science Interview Questions and Answers in 2023.pdf
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 

Plus de Shakas Technologies

A Review on Deep-Learning-Based Cyberbullying Detection
A Review on Deep-Learning-Based Cyberbullying DetectionA Review on Deep-Learning-Based Cyberbullying Detection
A Review on Deep-Learning-Based Cyberbullying DetectionShakas Technologies
 
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...Shakas Technologies
 
A Novel Framework for Credit Card.
A Novel Framework for Credit Card.A Novel Framework for Credit Card.
A Novel Framework for Credit Card.Shakas Technologies
 
NS2 Final Year Project Titles 2023- 2024
NS2 Final Year Project Titles 2023- 2024NS2 Final Year Project Titles 2023- 2024
NS2 Final Year Project Titles 2023- 2024Shakas Technologies
 
MATLAB Final Year IEEE Project Titles 2023-2024
MATLAB Final Year IEEE Project Titles 2023-2024MATLAB Final Year IEEE Project Titles 2023-2024
MATLAB Final Year IEEE Project Titles 2023-2024Shakas Technologies
 
Latest Python IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024Latest Python IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024Shakas Technologies
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...Shakas Technologies
 
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSECYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSEShakas Technologies
 
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
Detecting Mental Disorders in social Media through Emotional patterns-The cas...Detecting Mental Disorders in social Media through Emotional patterns-The cas...
Detecting Mental Disorders in social Media through Emotional patterns-The cas...Shakas Technologies
 
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTIONCOMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTIONShakas Technologies
 
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCECO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCEShakas Technologies
 
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...Shakas Technologies
 
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...Shakas Technologies
 
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...Shakas Technologies
 
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...Shakas Technologies
 
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Shakas Technologies
 
Fighting Money Laundering With Statistics and Machine Learning.docx
Fighting Money Laundering With Statistics and Machine Learning.docxFighting Money Laundering With Statistics and Machine Learning.docx
Fighting Money Laundering With Statistics and Machine Learning.docxShakas Technologies
 
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...Shakas Technologies
 
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...Shakas Technologies
 
Effective Software Effort Estimation Leveraging Machine Learning for Digital ...
Effective Software Effort Estimation Leveraging Machine Learning for Digital ...Effective Software Effort Estimation Leveraging Machine Learning for Digital ...
Effective Software Effort Estimation Leveraging Machine Learning for Digital ...Shakas Technologies
 

Plus de Shakas Technologies (20)

A Review on Deep-Learning-Based Cyberbullying Detection
A Review on Deep-Learning-Based Cyberbullying DetectionA Review on Deep-Learning-Based Cyberbullying Detection
A Review on Deep-Learning-Based Cyberbullying Detection
 
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
A Personal Privacy Data Protection Scheme for Encryption and Revocation of Hi...
 
A Novel Framework for Credit Card.
A Novel Framework for Credit Card.A Novel Framework for Credit Card.
A Novel Framework for Credit Card.
 
NS2 Final Year Project Titles 2023- 2024
NS2 Final Year Project Titles 2023- 2024NS2 Final Year Project Titles 2023- 2024
NS2 Final Year Project Titles 2023- 2024
 
MATLAB Final Year IEEE Project Titles 2023-2024
MATLAB Final Year IEEE Project Titles 2023-2024MATLAB Final Year IEEE Project Titles 2023-2024
MATLAB Final Year IEEE Project Titles 2023-2024
 
Latest Python IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024Latest Python IEEE Project Titles 2023-2024
Latest Python IEEE Project Titles 2023-2024
 
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
EMOTION RECOGNITION BY TEXTUAL TWEETS CLASSIFICATION USING VOTING CLASSIFIER ...
 
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSECYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
CYBER THREAT INTELLIGENCE MINING FOR PROACTIVE CYBERSECURITY DEFENSE
 
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
Detecting Mental Disorders in social Media through Emotional patterns-The cas...Detecting Mental Disorders in social Media through Emotional patterns-The cas...
Detecting Mental Disorders in social Media through Emotional patterns-The cas...
 
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTIONCOMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
COMMERCE FAKE PRODUCT REVIEWS MONITORING AND DETECTION
 
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCECO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
CO2 EMISSION RATING BY VEHICLES USING DATA SCIENCE
 
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
Toward Effective Evaluation of Cyber Defense Threat Based Adversary Emulation...
 
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
Optimizing Numerical Weather Prediction Model Performance Using Machine Learn...
 
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
Nature-Based Prediction Model of Bug Reports Based on Ensemble Machine Learni...
 
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
Multi-Class Stress Detection Through Heart Rate Variability A Deep Neural Net...
 
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
Identifying Hot Topic Trends in Streaming Text Data Using News Sequential Evo...
 
Fighting Money Laundering With Statistics and Machine Learning.docx
Fighting Money Laundering With Statistics and Machine Learning.docxFighting Money Laundering With Statistics and Machine Learning.docx
Fighting Money Laundering With Statistics and Machine Learning.docx
 
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
Explainable Artificial Intelligence for Patient Safety A Review of Applicatio...
 
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
Ensemble Deep Learning-Based Prediction of Fraudulent Cryptocurrency Transact...
 
Effective Software Effort Estimation Leveraging Machine Learning for Digital ...
Effective Software Effort Estimation Leveraging Machine Learning for Digital ...Effective Software Effort Estimation Leveraging Machine Learning for Digital ...
Effective Software Effort Estimation Leveraging Machine Learning for Digital ...
 

Dernier

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 

Dernier (20)

Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 

A Comparative Analysis of Sampling Techniques for Click-Through Rate Prediction in Native Advertising.

  • 1. Base paper Title: A Comparative Analysis of Sampling Techniques for Click-Through Rate Prediction in Native Advertising Abstract Native advertising is a popular form of online advertisements that has similar styles and functions with the native content displayed on online platforms, such as news, sports and social websites. It can better capture users’ attention, and they have gained increasing popularity in many online platforms and among advertisers. In advertising, Click Trough Rate (CTR) prediction is essential but challenging due to data sparsity: the non-clicks constitute most of the data, whereas clicks form a significantly smaller portion. The performance of 19 class imbalance approaches is compared in this study with the use of four traditional classifiers, to determine the most effective imbalance methods for our native ads dataset. The data used is real traffic data from Finland over the course of seven days provided by the native advertising platform ReadPeak. The resampling methods used include seven undersampling techniques, four oversampling techniques, four hybrid sampling techniques, and four ensemble systems. The findings demonstrate that class imbalance learning can enhance the model’s capacity for classification by as much as 20%. In general, oversampling is more stable comparatively. But, undersampling performed the best with Random Forest. Our study also demonstrates that the imbalance ratio plays an important role in the performance of the model and the features importance. Existing System In a class imbalanced dataset, there are significantly less samples in one of its classes than the other [3]. The difficulties of learning from such imbalanced data are inevitable. Standard learning classifiers are biased toward the majority class due to the skewed distribution of the training samples, making them unable to recognize unusual occurrences. It is possible to mistake noise for rare minority samples and vise-versa [24]. This kind of problem with uneven data is particularly prevalent in the advertising industry. The dataset contains a lot more non- clicked advertising than clicked advertisements, and the difference between the two is typically significant. To solve these issues, researchers have created numerous class imbalance techniques and performance evaluation criteria that play an important role in this paper. In this context, we implemented several class imbalance methods on a real-world class imbalanced dataset provided by the native advertising platform ReadPeak [1]. The dataset used contains
  • 2. information about the in-screen advertisements and whether they have been clicked or not. According to the dataset, there is an extreme imbalance between clicks and non-clicks, with 250 non-clicks for every 1 click. Advertising is a key and crucial part of corporate operations. In 2021, advertisers are estimated to have spent 118.72 billion dollars on display advertising [52]. Demand-side platform (DSP)’s cost per click pricing model makes advertisers earnings directly correlated to the number of clicks. Predicting performance indicators such as the click- through rate (CTR) is crucial for DSP research [35]. The available literature presents a valid solution to the problem at hand, but addressing the unbalance of data specifically and the benefits we can get either in terms of training speed and prediction accuracy has been neglected. In that spirit, the purpose of the current work is to explore the effect of sampling techniques on prediction accuracy with the combination of a number of classical machine learning models applied to the prediction of CTR in native advertisement. The contributions of this paper are summarized as follows: • The performance of 19 class imbalance methods with 4 classical classifiers is evaluated. The best performing methods are identified based on the AUC and LogLoss metrics. • An analysis of the unbalanced data problem was conducted using real-world data from ReadPeak, including complete outline of the steps for feature engineering, selection, and data cleaning. • Through experimentation, it was found that feature importance varies with different Imbalance Ratios. The Boruta algorithm was used to compute feature importance using three levels of sampling, highlighting the importance of balancing the data for accurate results. • Using Random Forest and random undersampling, it is shown that as the balance between positive and negative samples is improved, the performance also improves. However, there is a threshold of imbalance ratio beyond which the performance improvement becomes marginal. • We available the community access to a real world datasets that can help researchers study the phenomenon and find solutions to CTR Prediction in the native advertisement field. Drawback in Existing System  Data Representation and Generalization: Drawback: The choice of a particular sampling technique may lead to biases in the representation of the dataset. Oversampling or undersampling may impact the generalization of the models to real-world scenarios, especially when the dataset is not fully representative of the target population.
  • 3.  Impact on Model Complexity: Drawback: Some sampling techniques may inadvertently impact the complexity of the predictive models. For instance, oversampling may lead to overfitting, especially when dealing with small datasets, while undersampling might result in loss of valuable information.  Limited Transparency: Drawback: Some advanced sampling techniques, especially those involving synthetic data generation, may lack transparency in terms of how they modify the underlying data distribution. This can make it challenging to interpret the model's behavior and decision- making processes.  Impact on Evaluation Metrics: Drawback: The choice of sampling technique may influence the performance metrics used for evaluation. For instance, oversampling may artificially boost metrics like accuracy but may not necessarily improve the model's ability to generalize to new, unseen data. Proposed System  Sampling Techniques: Implement and analyze various sampling techniques, including but not limited to random sampling, stratified sampling, undersampling, oversampling, and hybrid methods. Investigate advanced techniques such as SMOTE (Synthetic Minority Over-sampling Technique) and ADASYN (Adaptive Synthetic Sampling).  Predictive Models:
  • 4. Employ state-of-the-art machine learning models for click-through rate prediction, including logistic regression, decision trees, random forests, and gradient boosting machines. Utilize deep learning models, such as neural networks, to assess the impact of sampling techniques on more complex predictive architectures.  Evaluation Metrics: Measure the performance of each sampling technique using standard evaluation metrics such as precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC). Assess the impact on model calibration and reliability.  Comparison and Recommendations: Present a comparative analysis of the various sampling techniques, highlighting their strengths, weaknesses, and suitability for different scenarios. Offer recommendations for selecting an appropriate sampling strategy based on specific campaign characteristics and data distributions. Algorithm  Stratified Sampling: Algorithm: Stratified sampling based on target classes. Description: Divides the dataset into strata (subgroups) based on target classes and samples proportionally from each stratum, ensuring representation across classes.  ADASYN (Adaptive Synthetic Sampling): Algorithm: ADASYN algorithm. Description: An adaptive extension of SMOTE that focuses on generating synthetic instances for regions of the minority class that are harder to learn.
  • 5.  Cluster-Based Sampling: Algorithm: K-means clustering for cluster-based sampling. Description: Groups instances into clusters and samples proportionally from each cluster, facilitating localized sampling. Advantages  Addressing Class Imbalance: Advantage: Class imbalance is a common issue in CTR prediction, where the number of non-click instances often significantly outweighs click instances. Sampling techniques, such as oversampling or undersampling, help balance class distribution, ensuring that the model is not biased towards the majority class.  Enhanced Generalization to Unseen Data: Advantage: Effective sampling techniques contribute to the generalization of predictive models to new, unseen data. This is crucial in native advertising where the goal is to predict CTR for new ad campaigns.  Mitigation of Bias and Unfairness: Advantage: Carefully chosen sampling techniques can help mitigate biases in predictive models, ensuring fair representation and treatment of different classes. This is essential for ethical considerations in advertising.  Increased Interpretability: Advantage: Certain sampling techniques, like undersampling or clustering-based sampling, may lead to a reduction in the number of instances, making it easier to interpret model decisions. Enhanced interpretability is valuable for stakeholders who need to understand the factors influencing CTR predictions.
  • 6. Software Specification  Processor : I3 core processor  Ram : 4 GB  Hard disk : 500 GB Software Specification  Operating System : Windows 10 /11  Frond End : Python  Back End : Mysql Server  IDE Tools : Pycharm