SlideShare une entreprise Scribd logo
1  sur  12
Télécharger pour lire hors ligne
Email Sherlock:
Using Machine Learning to Extract Information from Large Email Datasets
Jay Gondin
Investigations and Emails
● Bear Stearns V.
Lehman Brothers
● Enron
● Hillary Clinton
Data: Hillary’s Emails
● 30,320 emails in dataset
● 60,000 Meaningful Words
● Unique Acronyms
○ Ex. Hillary Clinton = Rodham,
HRC, Madam Secretary
○ Ex. Obama = President,
Administration, Barack
○ Ex. White House = WH
Email Pros and Cons
● Emails may contain crucial
information to solve an
investigation.
● Unique acronyms may help
vectorize emails
● Emails within a particular
dataset have a fewer number
of authors
● Often find duplicated text
● A majority of emails do not
contain important and/or
relevant information to an
investigation
● Unique acronyms may make it
more difficult to complete
searches
● Clusters of emails tend to
overlap
Pros Cons
Unsupervised Model
TFidF - vectorizer
LSA - reduce dimension
DBSCAN - cluster
Machine LearningSQLiteRaw Data Analyzed Clusters
Key Info:
- Orphan tend to be less important
and/or were anonymized.
- Dense clusters may contain more
information
- DBSCAN -- Density-based spatial
clustering of applications with noise
Semi-Unsupervised Model & Query Expansion
Benghazi
Search Term
Neural Network
(word2vec)
Tripoli
Stevens
Libyans
Consulate
Expanded Search Term Results (cluster)
Flask WebApp
&
SQLite
Finding Connections:
Benghazi Libyans
● Clusters are based on
meaning.
Sentiment Analysis
● High Polarity may indicate sensitive information.
Future developments
● Generalize to other Datasets
● Adapt algorithm to prevent fraud
● Develop graphical visualization
● Record Users Activities to improve the software
Jay Gondin
Masters in Mathematics
Experienced Economic Analyst
gondin@gmail.com
github.com/jgondin
linkedin.com/in/gondin

Contenu connexe

En vedette

Presentacion clase 1 bases de datos
Presentacion clase 1 bases de datosPresentacion clase 1 bases de datos
Presentacion clase 1 bases de datosalberromero
 
Vol mediacartar MSD Salud Animal salud Antiparasitarios
Vol mediacartar MSD Salud Animal salud Antiparasitarios Vol mediacartar MSD Salud Animal salud Antiparasitarios
Vol mediacartar MSD Salud Animal salud Antiparasitarios MSD Salud Animal
 
TodoRetail presenta TRN Coworking
TodoRetail presenta TRN CoworkingTodoRetail presenta TRN Coworking
TodoRetail presenta TRN CoworkingTodoRetail
 
Gabriel binet y oscar esteve
Gabriel binet y  oscar esteveGabriel binet y  oscar esteve
Gabriel binet y oscar esteveJose Trinidad
 
Diseño de sistema de gestion
Diseño de sistema de gestionDiseño de sistema de gestion
Diseño de sistema de gestionElvis Sa
 
Opinión de la gente acerca de los precios en los antros del Distrito Federal
Opinión de la gente acerca de los precios en los antros del Distrito FederalOpinión de la gente acerca de los precios en los antros del Distrito Federal
Opinión de la gente acerca de los precios en los antros del Distrito FederalAndreaharochi
 
Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Stijn van Schaijk
 
Avances en la telecomunicaciones a nivel mundial diana torres rosmery raiban...
Avances  en la telecomunicaciones a nivel mundial diana torres rosmery raiban...Avances  en la telecomunicaciones a nivel mundial diana torres rosmery raiban...
Avances en la telecomunicaciones a nivel mundial diana torres rosmery raiban...Diana Torres
 
Ficha inscripción taller de reducción del estrés. mindfulness
Ficha inscripción taller de reducción del estrés. mindfulnessFicha inscripción taller de reducción del estrés. mindfulness
Ficha inscripción taller de reducción del estrés. mindfulnessCole Navalazarza
 
Giao trinh ky nang lam viec nhom
Giao trinh ky nang lam viec nhomGiao trinh ky nang lam viec nhom
Giao trinh ky nang lam viec nhomtranthanhlong_gv
 
IT-Beschaffung und Open Source Software
IT-Beschaffung und Open Source SoftwareIT-Beschaffung und Open Source Software
IT-Beschaffung und Open Source SoftwareMatthias Stürmer
 

En vedette (19)

Presentacion clase 1 bases de datos
Presentacion clase 1 bases de datosPresentacion clase 1 bases de datos
Presentacion clase 1 bases de datos
 
The ultimate spa ritual
The ultimate spa ritualThe ultimate spa ritual
The ultimate spa ritual
 
New School
New SchoolNew School
New School
 
Grelha 04-edição de dados
Grelha 04-edição de dadosGrelha 04-edição de dados
Grelha 04-edição de dados
 
Vol mediacartar MSD Salud Animal salud Antiparasitarios
Vol mediacartar MSD Salud Animal salud Antiparasitarios Vol mediacartar MSD Salud Animal salud Antiparasitarios
Vol mediacartar MSD Salud Animal salud Antiparasitarios
 
TodoRetail presenta TRN Coworking
TodoRetail presenta TRN CoworkingTodoRetail presenta TRN Coworking
TodoRetail presenta TRN Coworking
 
Gabriel binet y oscar esteve
Gabriel binet y  oscar esteveGabriel binet y  oscar esteve
Gabriel binet y oscar esteve
 
Foro transferencia ciencias de la educacion
Foro transferencia ciencias de la educacionForo transferencia ciencias de la educacion
Foro transferencia ciencias de la educacion
 
Diseño de sistema de gestion
Diseño de sistema de gestionDiseño de sistema de gestion
Diseño de sistema de gestion
 
Opinión de la gente acerca de los precios en los antros del Distrito Federal
Opinión de la gente acerca de los precios en los antros del Distrito FederalOpinión de la gente acerca de los precios en los antros del Distrito Federal
Opinión de la gente acerca de los precios en los antros del Distrito Federal
 
Bim based process mining master thesis presentation
Bim based process mining master thesis presentation Bim based process mining master thesis presentation
Bim based process mining master thesis presentation
 
Avances en la telecomunicaciones a nivel mundial diana torres rosmery raiban...
Avances  en la telecomunicaciones a nivel mundial diana torres rosmery raiban...Avances  en la telecomunicaciones a nivel mundial diana torres rosmery raiban...
Avances en la telecomunicaciones a nivel mundial diana torres rosmery raiban...
 
Ficha inscripción taller de reducción del estrés. mindfulness
Ficha inscripción taller de reducción del estrés. mindfulnessFicha inscripción taller de reducción del estrés. mindfulness
Ficha inscripción taller de reducción del estrés. mindfulness
 
Paper 91
Paper 91Paper 91
Paper 91
 
Voltaire (aldo franquez)
Voltaire (aldo franquez)Voltaire (aldo franquez)
Voltaire (aldo franquez)
 
Giao trinh ky nang lam viec nhom
Giao trinh ky nang lam viec nhomGiao trinh ky nang lam viec nhom
Giao trinh ky nang lam viec nhom
 
CV_Nitin
CV_NitinCV_Nitin
CV_Nitin
 
WP2 - OPEN INNOVATION PROCESS MODEL
WP2 - OPEN INNOVATION PROCESS MODELWP2 - OPEN INNOVATION PROCESS MODEL
WP2 - OPEN INNOVATION PROCESS MODEL
 
IT-Beschaffung und Open Source Software
IT-Beschaffung und Open Source SoftwareIT-Beschaffung und Open Source Software
IT-Beschaffung und Open Source Software
 

Dernier

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 

Dernier (20)

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 

Email sherlock: Using Machine Learning to Extract Information from Large Email Dataset

  • 1. Email Sherlock: Using Machine Learning to Extract Information from Large Email Datasets Jay Gondin
  • 2. Investigations and Emails ● Bear Stearns V. Lehman Brothers ● Enron ● Hillary Clinton
  • 3. Data: Hillary’s Emails ● 30,320 emails in dataset ● 60,000 Meaningful Words ● Unique Acronyms ○ Ex. Hillary Clinton = Rodham, HRC, Madam Secretary ○ Ex. Obama = President, Administration, Barack ○ Ex. White House = WH
  • 4. Email Pros and Cons ● Emails may contain crucial information to solve an investigation. ● Unique acronyms may help vectorize emails ● Emails within a particular dataset have a fewer number of authors ● Often find duplicated text ● A majority of emails do not contain important and/or relevant information to an investigation ● Unique acronyms may make it more difficult to complete searches ● Clusters of emails tend to overlap Pros Cons
  • 5. Unsupervised Model TFidF - vectorizer LSA - reduce dimension DBSCAN - cluster Machine LearningSQLiteRaw Data Analyzed Clusters Key Info: - Orphan tend to be less important and/or were anonymized. - Dense clusters may contain more information - DBSCAN -- Density-based spatial clustering of applications with noise
  • 6. Semi-Unsupervised Model & Query Expansion Benghazi Search Term Neural Network (word2vec) Tripoli Stevens Libyans Consulate Expanded Search Term Results (cluster) Flask WebApp & SQLite
  • 7. Finding Connections: Benghazi Libyans ● Clusters are based on meaning.
  • 8.
  • 9. Sentiment Analysis ● High Polarity may indicate sensitive information.
  • 10.
  • 11. Future developments ● Generalize to other Datasets ● Adapt algorithm to prevent fraud ● Develop graphical visualization ● Record Users Activities to improve the software
  • 12. Jay Gondin Masters in Mathematics Experienced Economic Analyst gondin@gmail.com github.com/jgondin linkedin.com/in/gondin