SlideShare une entreprise Scribd logo
1  sur  24
DATA SCIENCE
PROJECT
NYC SHOOTINGS CLUSTER ANALYSIS
• OBJECTIVE
• DATAPREPROCESSING
• DATACLEANING &FORMATTING
• DROPPING COLUMNS
• FEATUREENGINEERING
• CREATINGDUMMY VARIABLES
• DATASUMMARIZATION/ DESCRIPTIVESTATISTICS
• FINDINGKVALUE
• CLUSTERINGMODEL DEVELOPMENT
• VISUALIZATION
• CONCLUSION
AGENDA:
OBJECTIVE:
• THEGOALOFTHISPROJECTISTODEVELOPAMACHINELEARNINGMODEL
THATCANCLUSTERSHOOTINGINCIDENTSINNEWYORKCITYBASEDON
RELEVANTATTRIBUTESSUCHASOCCURRENCEDATEANDTIME,LOCATION,
DEMOGRAPHICINFORMATIONOFPERPETRATORSANDVICTIMS,AND
JURISDICTION.
• BYIDENTIFYINGCLUSTERSOFSIMILARINCIDENTS,LAWENFORCEMENT
AGENCIESCANBETTERUNDERSTANDTHEUNDERLYINGDYNAMICSOFGUN
VIOLENCEANDTAILORTHEIR INTERVENTIONSACCORDINGLY.
DATA PREPROCESSING
• DATAPREPROCESSINGINVOLVESCLEANING,FORMATTING,AND
TRANSFORMINGRAWDATAINTOA MORESUITABLEFORMATFORANALYSISAND
MODELING.
• COMMONTASKSINDATAPREPROCESSINGINCLUDEHANDLINGMISSING
VALUES,DEALINGWITHOUTLIERS,SCALINGFEATURES,ENCODINGCATEGORICAL
VARIABLES,ANDSPLITTINGTHEDATAINTOTRAININGANDTESTINGSETS.
• THEGOALOFDATAPREPROCESSINGISTOMAKETHEDATAREADYFOR
ANALYSISANDMODELING BYENSURINGITSQUALITY,CONSISTENCY,AND
COMPATIBILITYWITHTHEMACHINELEARNINGALGORITHMS.
DATA CLEANING & FORMATTING
• FIRSTICREATEDACOPY OFTHEGIVENDATAINORDER TOPERFORMTHE
CLEANING PROCESSBYHAVING THE ORGINALDATAUNALTERED
• THENIIMPORTEDTHE DATAANDCHECKED FORTHEREQUIREDCOLUMNS
• THEN IDROPPEDTHE COLUMNS THATARENOTREQUIREDANDTHE
COLUMNS THATCONTAINEDMOREBLANKVALUES
• THENINTHEGIVEN DATASETICHANGEDTHE “NULL” AND“UNIDENTIFIED”
VALUESINTO“UNKNOWN”FOREASYIDENTIFICATION
DROPPING COLUMNS
• THESE WERE THELISTOF THECOLUMNS THATIHAVEDROPPED
• THESE COLUMNSWERE DROPPEDBECAUSE THESE COLUMNSCONTAINED EITHERLESS
DATANORUNWANTED DATA
DROPPING COLUMNS
• DROPPING COLUMNS REFERS TOTHEPROCESS OF REMOVING CERTAIN COLUMNS OR
VARIABLES FROMADATASET. THISISOFTEN DONEDURINGTHEDATAPREPROCESSING PHASE
WHENSOME COLUMNS AREDEEMED UNNECESSARY ORREDUNDANT FORTHEANALYSIS OR
MODELING TASK ATHAND.HERE ARESOME COMMON SCENARIOS WHERE DROPPING COLUMNS
MIGHTBENECESSARY:
• IRRELEVANT FEATURES: SOME COLUMNS MAYNOTCONTRIBUTE RELEVANTINFORMATION TO
THEANALYSIS ORPREDICTION TASK
• HIGHLYCORRELATED FEATURES: IFTWOORMORECOLUMNS AREHIGHLYCORRELATED,
MEANINGTHEY CONTAIN SIMILARINFORMATION, DROPPING ONEOFTHEMCAN REDUCE
REDUNDANCY ANDMULTICOLLINEARITYINTHE DATASET. THISCAN IMPROVE THESTABILITY
AND INTERPRETABILITYOF THEMODELS.
• MISSING VALUES: IFACOLUMNHAS AHIGHPERCENTAGE OFMISSING VALUESANDIMPUTATION
ISN'TFEASIBLE ORAPPROPRIATE, DROPPING THECOLUMN MIGHTBENECESSARY TOMAINTAIN
THEINTEGRITY OFTHEDATASET.
• DATA LEAKAGE: COLUMNS THATCONTAIN INFORMATIONABOUTTHE TARGETVARIABLE ORARE
DERIVED FROM THETARGETVARIABLE SHOULDBEREMOVED TOPREVENT DATALEAKAGE,
WHICHCOULD ARTIFICIALLYINFLATETHEMODEL'SPERFORMANCE DURING TRAINING.
• COMPUTATIONAL EFFICIENCY: LARGEDATASETS WITHALARGENUMBEROF COLUMNS CAN BE
COMPUTATIONALLY EXPENSIVE TOPROCESS ANDTRAINMODELSON. DROPPING IRRELEVANT
ORREDUNDANT COLUMNS CANHELPREDUCE THEDIMENSIONALITY OF THEDATASET AND
IMPROVE COMPUTATIONAL EFFICIENCY
FEATURE ENGINEERING
• FEATURE ENGINEERING FOCUSES ONCREATING NEW FEATURES ORMODIFYING EXISTING
ONES TOIMPROVE THEPERFORMANCE OFMACHINE LEARNING MODELS. THIS PROCESS
INVOLVES SELECTING, TRANSFORMING, ORCOMBINING FEATURES TOEXTRACT USEFUL
INFORMATION AND REPRESENT THE DATAMORE EFFECTIVELY.
• FEATURE ENGINEERING TECHNIQUES INCLUDECREATING POLYNOMIALFEATURES,
BINNING, DISCRETIZATION, DIMENSIONALITY REDUCTION (E.G., PCA), FEATURE SCALING,
AND CREATING INTERACTION TERMS.
• THEGOALOFFEATURE ENGINEERING ISTOENHANCE THEPREDICTIVE POWEROFTHE
MODEL BYPROVIDING ITWITHMOREINFORMATIVE ANDDISCRIMINATIVE FEATURES,
ULTIMATELYIMPROVING ITSACCURACY AND GENERALIZATIONABILITY.
CREATING DUMMY VARIABLES
• CREATINGDUMMYVARIABLESREFERSTOTHEPROCESSOFCONVERTING
CATEGORICALVARIABLESINTOASETOFBINARYVARIABLES,ALSOKNOWN
ASDUMMIES,THATREPRESENTTHEDIFFERENTCATEGORIESORLEVELSOF
THEORIGINALVARIABLE.
• INSUMMARY,CREATINGDUMMYVARIABLESISATECHNIQUEUSEDTO
ENCODECATEGORICALVARIABLESINTOAFORMATTHATCANBEUTILIZEDBY
MACHINELEARNINGALGORITHMS.
CREATING DUMMY VARIABLES
• BY USING THEABOVE CODE I’VECREATED DUMMYVARIABLES FORCERTAIN
COLUMNS.
• SINCE THESE COLUMNS PLAYAMAJOR ROLEINDEVELOPING AMODELTHESE
SHOULD NOTBEDROPPED BUTCANNOT BEINSTRING FORMATEITHER.
• THUSDUMMYVARIABLES ARE CREATED.
• HERE INTHECOLOUMN“BORO”THE VALUES ARESTRING SINCE ITHASTO BE IN
NUMERICALFORMATTHEDUMMYVARIABLES ARE CREATED
• THESTRING VALUES BECOMES ACOLUMNAND THENTHE VALUE ARE GIVEN IN0 AND1
FORMAT BASED ONTRUEORFALSE
CREATING DUMMY VARIABLES
DATA SUMMARIZATION / DESCRIPTIVE
STATISTICS
• DATASUMMARIZATION ISUSEDTODESCRIBE THEPROCESSOFCONDENSING AND PRESENTING KEY
CHARACTERISTICSORINSIGHTSFROMADATASET. ITINVOLVESVARIOUSTECHNIQUESFOR
SUMMARIZING AND ANALYZING DATATOGAINABETTERUNDERSTANDING OFITSSTRUCTURE,
PATTERNS, ANDRELATIONSHIPS.
• THEDF.DESCRIBE() FUNCTIONISCOMMONLYUSEDINPYTHON WITHLIBRARIES LIKEPANDAS TO
GENERATE DESCRIPTIVESTATISTICSOFADATAFRAME. ITPROVIDESSUMMARY STATISTICSFOR
NUMERICAL COLUMNSINTHEDATAFRAME SUCHASCOUNT,MEAN, STANDARD DEVIATION,MINIMUM,
MAXIMUM, AND QUARTILE VALUES.
• USING DF.DESCRIBE() ISAQUICKWAYTOGETANOVERVIEWOFTHEDISTRIBUTIONAND CENTRAL
TENDENCY OFNUMERICAL DATAINADATA FRAME. ITHELPS INUNDERSTANDING THERANGE OF
VALUES, PRESENCE OFOUTLIERS,AND OVERALL SHAPE OFTHEDATA.
DATA SUMMARIZATION / DESCRIPTIVE
STATISTICS
HERE'S WHATEACHSTATISTICREPRESENTS :
• COUNT:NUMBER OFNON-NULL VALUES INEACHCOLUMN.
• MEAN: AVERAGE VALUE OFEACHCOLUMN.
• STD:STANDARD DEVIATION,AMEASURE OFTHE DISPERSION OFVALUES AROUNDTHEMEAN.
• MIN:MINIMUMVALUE INEACHCOLUMN.
• 25%:FIRSTQUARTILE, OR25THPERCENTILE.
• 50%:MEDIAN, OR50THPERCENTILE.
• 75%:THIRDQUARTILE, OR75THPERCENTILE.
• MAX: MAXIMUM VALUE INEACHCOLUMN.
FINDING K VALUE
• IMPORT LIBRARIES:"FROMSKLEARN.CLUSTER IMPORTKMEANS" THISLINEIMPORTSTHE KMEANS
CLUSTERINGALGORITHM FROM THESCIKIT-LEARN LIBRARY, WHICHISAWIDELYUSED MACHINE
LEARNING LIBRARY INPYTHON.
• INITIALIZEANEMPTY LIST:"WCSS=[]"THISLINEINITIALIZESAN EMPTYLISTCALLEDWCSS.ITWILLBE
USEDTOSTORE THEWITHIN-CLUSTERSUMOFSQUARES (WCSS) FORDIFFERENTVALUES OFK.
• LOOPOVERKVALUES: THISLOOPITERATESOVERARANGE OFVALUES FORKFROM1TO10.
• INSTANTIATEKMEANS MODEL:"KMEANS =KMEANS(N_CLUSTERS=K, INIT="K-MEANS++")"INSIDE THE
LOOP,AKMEANS MODEL ISINSTANTIATED WITHN_CLUSTERS=K, WHERE KISTHE CURRENT VALUE OFK.
• THENSPECIFYTHEINITIALIZATIONMETHODFORCENTROIDS,WHICHIS "K-MEANS++". THIS
INITIALIZATIONMETHODHELPSINCHOOSINGINITIALCLUSTERCENTROIDSINAWAYTHATSPEEDS UP
CONVERGENCE.
FINDING K VALUE
• FITKMEANS MODEL:THEKMEANS MODEL ISFITTEDTOTHEDATA USINGTHE FITMETHOD.THE DATA
USEDFORCLUSTERINGISOBTAINEDFROMTHEDATAFRAME DFBYEXCLUDINGTHE FIRSTCOLUMN.
THISASSUMES THATTHEFIRSTCOLUMNCONTAINSLABELSORIDENTIFIERSAND THEREMAINING
COLUMNS AREFEATURES USED FORCLUSTERING.
• COMPUTEWCSS:THE WITHIN-CLUSTERSUM OFSQUARES (WCSS)IS COMPUTED.WCSSREPRESENTS
THESUM OFSQUARED DISTANCESOFSAMPLES TOTHEIRCLOSESTCLUSTERCENTER.
• AFTERCOMPUTINGWCSSFORALL VALUES OFK,ALINEPLOTIS CREATED.
• THEX-AXIS REPRESENTS THE VALUES OFK(FROM1TO10), AND THEY-AXIS REPRESENTS THE
CORRESPONDING WCSSVALUES.THE PLOTVISUALIZESTHERELATIONSHIP BETWEENKVALUESAND
WCSS..
• FINALLY, THEPLOTISDISPLAYED.
FINDING K VALUE
1. KMEANS CLUSTERING:
• STARTWITHINITIALIZINGAKMEANS CLUSTERINGMODELWITH2CLUSTERS.
• THENTHEFIT_PREDICTMETHODISUSEDTOBOTHFITTHEMODEL TOTHEDATAAND PREDICTTHE
CLUSTERLABELSFOREACH DATAPOINT. THECLUSTERLABELSARE ASSIGNED TOTHEDATAFRAME DF
ASANEWCOLUMNNAMED "LABEL".
2.3DSCATTERPLOTVISUALIZATION:
• FOR3DSCATTERPLOTVISUALIZATIONCREATE ANEW FIGURE WITHASPECIFIEDSIZEFORTHEPLOT.
• THENCREATEA3DSUBPLOTWITHINTHE FIGURE.
• SCATTERPLOTSARECREATEDFOREACHCLUSTER LABEL.
CLUSTERING MODEL DEVELOPMENT
CLUSTERING MODEL DEVELOPMENT
• THENCREATEASCATTERPLOTFORDATAPOINTSBELONGING TOCLUSTERLABEL 0.THEX,Y,AND Z
COORDINATESARESPECIFIEDASOCCUR_YEAR,LONGITUDE, ANDLATITUDE, RESPECTIVELY.
• DATA POINTSBELONGING TOTHISCLUSTERARE PLOTTEDINBLUE. SIMILAR SCATTERPLOTS ARE
CREATEDFOROTHERCLUSTERLABELS(E.G., CLUSTER1)WITHDIFFERENTCOLORS(E.G., RED).
• THENADJUSTTHE VIEWANGLE OFTHE3DPLOT.
• THENDISPLAYALEGEND SHOWINGTHECLUSTER LABELS.
• THUS THE3DSCATTERPLOTTINGISSUCESSFULLY COMPLETED
CLUSTERING MODEL DEVELOPMENT
CLUSTERING MODEL DEVELOPMENT
VISUALIZATION
• DATAEXPLORATIONAND PREPROCESSING INVOLVEUTILIZINGVISUALIZATIONTECHNIQUES SUCHAS
HISTOGRAMS, SCATTERPLOTS, BOXPLOTS, AND HEATMAPS. THESE VISUALIZATIONSAIDIN
UNDERSTANDING THEDISTRIBUTION,RELATIONSHIPS, AND POTENTIALOUTLIERSINTHEDATA. THEY
ARECRUCIALFORMAKING DECISIONSABOUTPREPROCESSING STEPS SUCHAS FEATURE SCALING,
OUTLIERREMOVAL, ANDFEATURE ENGINEERING.
• INTHIS PROJECTIHAVE VISUALIZEDTHE GIVENDATAINPOWERBI.THELINKFORMY POWERBI
REPRESENTATION ISGIVEN BELOW
• LINK:
https://app.powerbi.com/view?r=eyJrIjoiYzU2MGMzMGEtZGYwZS00MDY2LWI0YTItOTI4MGY2ZGNhN
WI0IiwidCI6IjUzODhhOWI3LWUzOWQtNDZhMS1hZDQ5LTRiMjMwMjg5MzYzYiJ9
VISUALIZATION
CONCLUSION
• THUSACLUSTERING MODELWAS DEVELOPED ANDTHESE STEPS PLAYED A SIGNIFICANT
ROLE INDEVELOPING ACLUSTERING MODEL
• DATAPREPROCESSING
• FEATURE ENGINEERING
• CLUSTERING MODELDEVELOPMENT
• VISUALIZATION

Contenu connexe

Similaire à Unveiling the Patterns: A Cluster Analysis of NYC Shootings

engineering portfolio 26-11-22.pdf
engineering portfolio 26-11-22.pdfengineering portfolio 26-11-22.pdf
engineering portfolio 26-11-22.pdfbilal masood
 
ANN(Artificial Neural Networks) Clustering Algorithms
ANN(Artificial  Neural Networks)  Clustering AlgorithmsANN(Artificial  Neural Networks)  Clustering Algorithms
ANN(Artificial Neural Networks) Clustering AlgorithmsAnuj Kumar Pathak
 
Test_Case_Design_Techniques
Test_Case_Design_TechniquesTest_Case_Design_Techniques
Test_Case_Design_TechniquesMithilesh Singh
 
Book.steve
Book.steveBook.steve
Book.steveNASAPMC
 
Ar and vr in mining
Ar and vr in miningAr and vr in mining
Ar and vr in miningRahul Kumar
 
NEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptx
NEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptxNEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptx
NEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptxSimarKaurMonga
 
Parallel and Distributed Computing.pptx
Parallel and Distributed Computing.pptxParallel and Distributed Computing.pptx
Parallel and Distributed Computing.pptxxejoji6654
 
Create Award Winning Software Simulations
Create Award Winning Software SimulationsCreate Award Winning Software Simulations
Create Award Winning Software SimulationsLodestone
 
Composite Crews in Construction Cost Estimating
Composite Crews in Construction Cost Estimating Composite Crews in Construction Cost Estimating
Composite Crews in Construction Cost Estimating Doina Dobre, P.Eng, GSC
 
Production manegement
Production manegementProduction manegement
Production manegementTUSHAR IQBAL
 
BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...
BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...
BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...BlueHat Security Conference
 
Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....
Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....
Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....Leak Detection Tulsa
 
Fully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysFully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysShalin Hai-Jew
 

Similaire à Unveiling the Patterns: A Cluster Analysis of NYC Shootings (15)

engineering portfolio 26-11-22.pdf
engineering portfolio 26-11-22.pdfengineering portfolio 26-11-22.pdf
engineering portfolio 26-11-22.pdf
 
ANN(Artificial Neural Networks) Clustering Algorithms
ANN(Artificial  Neural Networks)  Clustering AlgorithmsANN(Artificial  Neural Networks)  Clustering Algorithms
ANN(Artificial Neural Networks) Clustering Algorithms
 
Test_Case_Design_Techniques
Test_Case_Design_TechniquesTest_Case_Design_Techniques
Test_Case_Design_Techniques
 
Book.steve
Book.steveBook.steve
Book.steve
 
Ar and vr in mining
Ar and vr in miningAr and vr in mining
Ar and vr in mining
 
NEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptx
NEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptxNEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptx
NEUROPSYCHOLOGICAL TESTS PART - 1 - Copy.pptx
 
Dmaic steps
Dmaic stepsDmaic steps
Dmaic steps
 
Parallel and Distributed Computing.pptx
Parallel and Distributed Computing.pptxParallel and Distributed Computing.pptx
Parallel and Distributed Computing.pptx
 
Create Award Winning Software Simulations
Create Award Winning Software SimulationsCreate Award Winning Software Simulations
Create Award Winning Software Simulations
 
Ear biometrics
Ear biometricsEar biometrics
Ear biometrics
 
Composite Crews in Construction Cost Estimating
Composite Crews in Construction Cost Estimating Composite Crews in Construction Cost Estimating
Composite Crews in Construction Cost Estimating
 
Production manegement
Production manegementProduction manegement
Production manegement
 
BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...
BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...
BlueHat v18 || Killsuit the equation group's swiss army knife for persistence...
 
Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....
Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....
Precision Leak Detection Solutions in Tulsa, OK with American Leak Detection....
 
Fully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online SurveysFully Exploiting Qualitative and Mixed Methods Data from Online Surveys
Fully Exploiting Qualitative and Mixed Methods Data from Online Surveys
 

Plus de Boston Institute of Analytics

Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary RangesDemystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary RangesBoston Institute of Analytics
 
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Boston Institute of Analytics
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeBoston Institute of Analytics
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksBoston Institute of Analytics
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesBoston Institute of Analytics
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceBoston Institute of Analytics
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBoston Institute of Analytics
 
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgBoston Institute of Analytics
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFBoston Institute of Analytics
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Boston Institute of Analytics
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesBoston Institute of Analytics
 

Plus de Boston Institute of Analytics (20)

Solar production with K means clustering
Solar production with K means clusteringSolar production with K means clustering
Solar production with K means clustering
 
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary RangesDemystifying Salaries: A Data Science Approach to Predicting Salary Ranges
Demystifying Salaries: A Data Science Approach to Predicting Salary Ranges
 
Machine Learning for Accident Severity Prediction
Machine Learning for Accident Severity PredictionMachine Learning for Accident Severity Prediction
Machine Learning for Accident Severity Prediction
 
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
Predicting Power Consumption for a Greener Tomorrow: Machine Learning Project...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor NetworksSensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
Sensing the Future: Anomaly Detection and Event Prediction in Sensor Networks
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
Unveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data ScienceUnveiling the Market: Predicting House Prices with Data Science
Unveiling the Market: Predicting House Prices with Data Science
 
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie ReviewsBeyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
Beyond Thumbs Up/Down: Using AI to Analyze Movie Reviews
 
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.orgEnhancing Cybersecurity: An In-depth Analysis of Travelblog.org
Enhancing Cybersecurity: An In-depth Analysis of Travelblog.org
 
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRFExploring Web Security Threats: A Practical Study on SQL Injection and CSRF
Exploring Web Security Threats: A Practical Study on SQL Injection and CSRF
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
Detecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven ApproachDetecting Credit Card Fraud: An AI-driven Approach
Detecting Credit Card Fraud: An AI-driven Approach
 
Predicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning ApproachPredicting House Prices: A Machine Learning Approach
Predicting House Prices: A Machine Learning Approach
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
Decoding Loan Approval with Predictive Modeling in Action Discovering Weaknes...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
NLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile PricesNLP Based project presentation: Analyzing Automobile Prices
NLP Based project presentation: Analyzing Automobile Prices
 
Decoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in ActionDecoding Loan Approval: Predictive Modeling in Action
Decoding Loan Approval: Predictive Modeling in Action
 

Dernier

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...Amil baba
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Jon Hansen
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一fztigerwe
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证ppy8zfkfm
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理cyebo
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfEmmanuel Dauda
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理pyhepag
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationmuqadasqasim10
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfscitechtalktv
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunksgmuir1066
 
Heaps & its operation -Max Heap, Min Heap
Heaps & its operation -Max Heap, Min  HeapHeaps & its operation -Max Heap, Min  Heap
Heaps & its operation -Max Heap, Min Heapaashikalamichhane
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7gragkhusi
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancingmohamed Elzalabany
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证ju0dztxtn
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...ssuserf63bd7
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group MeetingAlison Pitt
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...ssuserf63bd7
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理pyhepag
 

Dernier (20)

NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
如何办理哥伦比亚大学毕业证(Columbia毕业证)成绩单原版一比一
 
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
1:1原版定制利物浦大学毕业证(Liverpool毕业证)成绩单学位证书留信学历认证
 
一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理一比一原版麦考瑞大学毕业证成绩单如何办理
一比一原版麦考瑞大学毕业证成绩单如何办理
 
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdfGenerative AI for Trailblazers_ Unlock the Future of AI.pdf
Generative AI for Trailblazers_ Unlock the Future of AI.pdf
 
一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理一比一原版阿德莱德大学毕业证成绩单如何办理
一比一原版阿德莱德大学毕业证成绩单如何办理
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
Heaps & its operation -Max Heap, Min Heap
Heaps & its operation -Max Heap, Min  HeapHeaps & its operation -Max Heap, Min  Heap
Heaps & its operation -Max Heap, Min Heap
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
如何办理英国卡迪夫大学毕业证(Cardiff毕业证书)成绩单留信学历认证
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
一比一原版加利福尼亚大学尔湾分校毕业证成绩单如何办理
 

Unveiling the Patterns: A Cluster Analysis of NYC Shootings

  • 2. • OBJECTIVE • DATAPREPROCESSING • DATACLEANING &FORMATTING • DROPPING COLUMNS • FEATUREENGINEERING • CREATINGDUMMY VARIABLES • DATASUMMARIZATION/ DESCRIPTIVESTATISTICS • FINDINGKVALUE • CLUSTERINGMODEL DEVELOPMENT • VISUALIZATION • CONCLUSION AGENDA:
  • 4. DATA PREPROCESSING • DATAPREPROCESSINGINVOLVESCLEANING,FORMATTING,AND TRANSFORMINGRAWDATAINTOA MORESUITABLEFORMATFORANALYSISAND MODELING. • COMMONTASKSINDATAPREPROCESSINGINCLUDEHANDLINGMISSING VALUES,DEALINGWITHOUTLIERS,SCALINGFEATURES,ENCODINGCATEGORICAL VARIABLES,ANDSPLITTINGTHEDATAINTOTRAININGANDTESTINGSETS. • THEGOALOFDATAPREPROCESSINGISTOMAKETHEDATAREADYFOR ANALYSISANDMODELING BYENSURINGITSQUALITY,CONSISTENCY,AND COMPATIBILITYWITHTHEMACHINELEARNINGALGORITHMS.
  • 5. DATA CLEANING & FORMATTING • FIRSTICREATEDACOPY OFTHEGIVENDATAINORDER TOPERFORMTHE CLEANING PROCESSBYHAVING THE ORGINALDATAUNALTERED • THENIIMPORTEDTHE DATAANDCHECKED FORTHEREQUIREDCOLUMNS • THEN IDROPPEDTHE COLUMNS THATARENOTREQUIREDANDTHE COLUMNS THATCONTAINEDMOREBLANKVALUES • THENINTHEGIVEN DATASETICHANGEDTHE “NULL” AND“UNIDENTIFIED” VALUESINTO“UNKNOWN”FOREASYIDENTIFICATION
  • 6. DROPPING COLUMNS • THESE WERE THELISTOF THECOLUMNS THATIHAVEDROPPED • THESE COLUMNSWERE DROPPEDBECAUSE THESE COLUMNSCONTAINED EITHERLESS DATANORUNWANTED DATA
  • 7. DROPPING COLUMNS • DROPPING COLUMNS REFERS TOTHEPROCESS OF REMOVING CERTAIN COLUMNS OR VARIABLES FROMADATASET. THISISOFTEN DONEDURINGTHEDATAPREPROCESSING PHASE WHENSOME COLUMNS AREDEEMED UNNECESSARY ORREDUNDANT FORTHEANALYSIS OR MODELING TASK ATHAND.HERE ARESOME COMMON SCENARIOS WHERE DROPPING COLUMNS MIGHTBENECESSARY: • IRRELEVANT FEATURES: SOME COLUMNS MAYNOTCONTRIBUTE RELEVANTINFORMATION TO THEANALYSIS ORPREDICTION TASK • HIGHLYCORRELATED FEATURES: IFTWOORMORECOLUMNS AREHIGHLYCORRELATED, MEANINGTHEY CONTAIN SIMILARINFORMATION, DROPPING ONEOFTHEMCAN REDUCE REDUNDANCY ANDMULTICOLLINEARITYINTHE DATASET. THISCAN IMPROVE THESTABILITY AND INTERPRETABILITYOF THEMODELS.
  • 8. • MISSING VALUES: IFACOLUMNHAS AHIGHPERCENTAGE OFMISSING VALUESANDIMPUTATION ISN'TFEASIBLE ORAPPROPRIATE, DROPPING THECOLUMN MIGHTBENECESSARY TOMAINTAIN THEINTEGRITY OFTHEDATASET. • DATA LEAKAGE: COLUMNS THATCONTAIN INFORMATIONABOUTTHE TARGETVARIABLE ORARE DERIVED FROM THETARGETVARIABLE SHOULDBEREMOVED TOPREVENT DATALEAKAGE, WHICHCOULD ARTIFICIALLYINFLATETHEMODEL'SPERFORMANCE DURING TRAINING. • COMPUTATIONAL EFFICIENCY: LARGEDATASETS WITHALARGENUMBEROF COLUMNS CAN BE COMPUTATIONALLY EXPENSIVE TOPROCESS ANDTRAINMODELSON. DROPPING IRRELEVANT ORREDUNDANT COLUMNS CANHELPREDUCE THEDIMENSIONALITY OF THEDATASET AND IMPROVE COMPUTATIONAL EFFICIENCY
  • 9. FEATURE ENGINEERING • FEATURE ENGINEERING FOCUSES ONCREATING NEW FEATURES ORMODIFYING EXISTING ONES TOIMPROVE THEPERFORMANCE OFMACHINE LEARNING MODELS. THIS PROCESS INVOLVES SELECTING, TRANSFORMING, ORCOMBINING FEATURES TOEXTRACT USEFUL INFORMATION AND REPRESENT THE DATAMORE EFFECTIVELY. • FEATURE ENGINEERING TECHNIQUES INCLUDECREATING POLYNOMIALFEATURES, BINNING, DISCRETIZATION, DIMENSIONALITY REDUCTION (E.G., PCA), FEATURE SCALING, AND CREATING INTERACTION TERMS. • THEGOALOFFEATURE ENGINEERING ISTOENHANCE THEPREDICTIVE POWEROFTHE MODEL BYPROVIDING ITWITHMOREINFORMATIVE ANDDISCRIMINATIVE FEATURES, ULTIMATELYIMPROVING ITSACCURACY AND GENERALIZATIONABILITY.
  • 10. CREATING DUMMY VARIABLES • CREATINGDUMMYVARIABLESREFERSTOTHEPROCESSOFCONVERTING CATEGORICALVARIABLESINTOASETOFBINARYVARIABLES,ALSOKNOWN ASDUMMIES,THATREPRESENTTHEDIFFERENTCATEGORIESORLEVELSOF THEORIGINALVARIABLE. • INSUMMARY,CREATINGDUMMYVARIABLESISATECHNIQUEUSEDTO ENCODECATEGORICALVARIABLESINTOAFORMATTHATCANBEUTILIZEDBY MACHINELEARNINGALGORITHMS.
  • 11. CREATING DUMMY VARIABLES • BY USING THEABOVE CODE I’VECREATED DUMMYVARIABLES FORCERTAIN COLUMNS. • SINCE THESE COLUMNS PLAYAMAJOR ROLEINDEVELOPING AMODELTHESE SHOULD NOTBEDROPPED BUTCANNOT BEINSTRING FORMATEITHER. • THUSDUMMYVARIABLES ARE CREATED.
  • 12. • HERE INTHECOLOUMN“BORO”THE VALUES ARESTRING SINCE ITHASTO BE IN NUMERICALFORMATTHEDUMMYVARIABLES ARE CREATED • THESTRING VALUES BECOMES ACOLUMNAND THENTHE VALUE ARE GIVEN IN0 AND1 FORMAT BASED ONTRUEORFALSE CREATING DUMMY VARIABLES
  • 13. DATA SUMMARIZATION / DESCRIPTIVE STATISTICS • DATASUMMARIZATION ISUSEDTODESCRIBE THEPROCESSOFCONDENSING AND PRESENTING KEY CHARACTERISTICSORINSIGHTSFROMADATASET. ITINVOLVESVARIOUSTECHNIQUESFOR SUMMARIZING AND ANALYZING DATATOGAINABETTERUNDERSTANDING OFITSSTRUCTURE, PATTERNS, ANDRELATIONSHIPS. • THEDF.DESCRIBE() FUNCTIONISCOMMONLYUSEDINPYTHON WITHLIBRARIES LIKEPANDAS TO GENERATE DESCRIPTIVESTATISTICSOFADATAFRAME. ITPROVIDESSUMMARY STATISTICSFOR NUMERICAL COLUMNSINTHEDATAFRAME SUCHASCOUNT,MEAN, STANDARD DEVIATION,MINIMUM, MAXIMUM, AND QUARTILE VALUES. • USING DF.DESCRIBE() ISAQUICKWAYTOGETANOVERVIEWOFTHEDISTRIBUTIONAND CENTRAL TENDENCY OFNUMERICAL DATAINADATA FRAME. ITHELPS INUNDERSTANDING THERANGE OF VALUES, PRESENCE OFOUTLIERS,AND OVERALL SHAPE OFTHEDATA.
  • 14. DATA SUMMARIZATION / DESCRIPTIVE STATISTICS HERE'S WHATEACHSTATISTICREPRESENTS : • COUNT:NUMBER OFNON-NULL VALUES INEACHCOLUMN. • MEAN: AVERAGE VALUE OFEACHCOLUMN. • STD:STANDARD DEVIATION,AMEASURE OFTHE DISPERSION OFVALUES AROUNDTHEMEAN. • MIN:MINIMUMVALUE INEACHCOLUMN. • 25%:FIRSTQUARTILE, OR25THPERCENTILE. • 50%:MEDIAN, OR50THPERCENTILE. • 75%:THIRDQUARTILE, OR75THPERCENTILE. • MAX: MAXIMUM VALUE INEACHCOLUMN.
  • 15. FINDING K VALUE • IMPORT LIBRARIES:"FROMSKLEARN.CLUSTER IMPORTKMEANS" THISLINEIMPORTSTHE KMEANS CLUSTERINGALGORITHM FROM THESCIKIT-LEARN LIBRARY, WHICHISAWIDELYUSED MACHINE LEARNING LIBRARY INPYTHON. • INITIALIZEANEMPTY LIST:"WCSS=[]"THISLINEINITIALIZESAN EMPTYLISTCALLEDWCSS.ITWILLBE USEDTOSTORE THEWITHIN-CLUSTERSUMOFSQUARES (WCSS) FORDIFFERENTVALUES OFK. • LOOPOVERKVALUES: THISLOOPITERATESOVERARANGE OFVALUES FORKFROM1TO10. • INSTANTIATEKMEANS MODEL:"KMEANS =KMEANS(N_CLUSTERS=K, INIT="K-MEANS++")"INSIDE THE LOOP,AKMEANS MODEL ISINSTANTIATED WITHN_CLUSTERS=K, WHERE KISTHE CURRENT VALUE OFK. • THENSPECIFYTHEINITIALIZATIONMETHODFORCENTROIDS,WHICHIS "K-MEANS++". THIS INITIALIZATIONMETHODHELPSINCHOOSINGINITIALCLUSTERCENTROIDSINAWAYTHATSPEEDS UP CONVERGENCE.
  • 16. FINDING K VALUE • FITKMEANS MODEL:THEKMEANS MODEL ISFITTEDTOTHEDATA USINGTHE FITMETHOD.THE DATA USEDFORCLUSTERINGISOBTAINEDFROMTHEDATAFRAME DFBYEXCLUDINGTHE FIRSTCOLUMN. THISASSUMES THATTHEFIRSTCOLUMNCONTAINSLABELSORIDENTIFIERSAND THEREMAINING COLUMNS AREFEATURES USED FORCLUSTERING. • COMPUTEWCSS:THE WITHIN-CLUSTERSUM OFSQUARES (WCSS)IS COMPUTED.WCSSREPRESENTS THESUM OFSQUARED DISTANCESOFSAMPLES TOTHEIRCLOSESTCLUSTERCENTER. • AFTERCOMPUTINGWCSSFORALL VALUES OFK,ALINEPLOTIS CREATED. • THEX-AXIS REPRESENTS THE VALUES OFK(FROM1TO10), AND THEY-AXIS REPRESENTS THE CORRESPONDING WCSSVALUES.THE PLOTVISUALIZESTHERELATIONSHIP BETWEENKVALUESAND WCSS.. • FINALLY, THEPLOTISDISPLAYED.
  • 18. 1. KMEANS CLUSTERING: • STARTWITHINITIALIZINGAKMEANS CLUSTERINGMODELWITH2CLUSTERS. • THENTHEFIT_PREDICTMETHODISUSEDTOBOTHFITTHEMODEL TOTHEDATAAND PREDICTTHE CLUSTERLABELSFOREACH DATAPOINT. THECLUSTERLABELSARE ASSIGNED TOTHEDATAFRAME DF ASANEWCOLUMNNAMED "LABEL". 2.3DSCATTERPLOTVISUALIZATION: • FOR3DSCATTERPLOTVISUALIZATIONCREATE ANEW FIGURE WITHASPECIFIEDSIZEFORTHEPLOT. • THENCREATEA3DSUBPLOTWITHINTHE FIGURE. • SCATTERPLOTSARECREATEDFOREACHCLUSTER LABEL. CLUSTERING MODEL DEVELOPMENT
  • 19. CLUSTERING MODEL DEVELOPMENT • THENCREATEASCATTERPLOTFORDATAPOINTSBELONGING TOCLUSTERLABEL 0.THEX,Y,AND Z COORDINATESARESPECIFIEDASOCCUR_YEAR,LONGITUDE, ANDLATITUDE, RESPECTIVELY. • DATA POINTSBELONGING TOTHISCLUSTERARE PLOTTEDINBLUE. SIMILAR SCATTERPLOTS ARE CREATEDFOROTHERCLUSTERLABELS(E.G., CLUSTER1)WITHDIFFERENTCOLORS(E.G., RED). • THENADJUSTTHE VIEWANGLE OFTHE3DPLOT. • THENDISPLAYALEGEND SHOWINGTHECLUSTER LABELS. • THUS THE3DSCATTERPLOTTINGISSUCESSFULLY COMPLETED
  • 22. VISUALIZATION • DATAEXPLORATIONAND PREPROCESSING INVOLVEUTILIZINGVISUALIZATIONTECHNIQUES SUCHAS HISTOGRAMS, SCATTERPLOTS, BOXPLOTS, AND HEATMAPS. THESE VISUALIZATIONSAIDIN UNDERSTANDING THEDISTRIBUTION,RELATIONSHIPS, AND POTENTIALOUTLIERSINTHEDATA. THEY ARECRUCIALFORMAKING DECISIONSABOUTPREPROCESSING STEPS SUCHAS FEATURE SCALING, OUTLIERREMOVAL, ANDFEATURE ENGINEERING. • INTHIS PROJECTIHAVE VISUALIZEDTHE GIVENDATAINPOWERBI.THELINKFORMY POWERBI REPRESENTATION ISGIVEN BELOW • LINK: https://app.powerbi.com/view?r=eyJrIjoiYzU2MGMzMGEtZGYwZS00MDY2LWI0YTItOTI4MGY2ZGNhN WI0IiwidCI6IjUzODhhOWI3LWUzOWQtNDZhMS1hZDQ5LTRiMjMwMjg5MzYzYiJ9
  • 24. CONCLUSION • THUSACLUSTERING MODELWAS DEVELOPED ANDTHESE STEPS PLAYED A SIGNIFICANT ROLE INDEVELOPING ACLUSTERING MODEL • DATAPREPROCESSING • FEATURE ENGINEERING • CLUSTERING MODELDEVELOPMENT • VISUALIZATION