SlideShare une entreprise Scribd logo
1  sur  34
Text mining of Beauty Blogs:
Text mining of Beauty Blogs:
О чем говорят женщины?
Артем Просветов
Data Scientist, CleverDATA
empty
not English
techcrunch.com
photo/video pages
correct English page
cleverdata.ru | info@cleverdata.ru
Raw blog data
Raw data: 98,496 pages in format of ~ 1,000,000 files.
Ready for analysis: 58,719 English pages (59.6%)
40.4% data: empty pages and pages with errors, not English pages
(23,461), photo/video pages without text (2,315), articles from
techcrunch.com (3,402)
cleverdata.ru | info@cleverdata.ru
From 60k of pages → ~2000 authors.
Pages → Authors
cleverdata.ru | info@cleverdata.ru
Mean blog post size (in words)
One can distinguish 2 populations
of bloggers:
•twitter style' authors with short
posts (~20%)
•full-length bloggers with 200-500
mean words per post (~80%)
cleverdata.ru | info@cleverdata.ru
Used APIs and services:
- Sentity (https://sentity.io/)
- Twinword (https://www.twinword.com/)
- Textualinsights (http://www.textualinsights.com/)
- VivekN (https://github.com/vivekn/sentiment-web)
Sentiment analysis
cleverdata.ru | info@cleverdata.ru
Sentiment analysis
• - the resulting sentiment rate is based
on 4 independent rate systems.
• - the majority of the blogs have positive
emotion rate.
• - the mean sentiment rate is «positive
warm» 0.72.
• - all this results are intuitively consistent
and are in a good agreement with
manual tests
cleverdata.ru | info@cleverdata.ru
We used a few traffic rank systems:
Estimation of blog efficiency
• Alexa Rank, that basically audits and makes public the frequency of
visits on various Web sites.
• Yandex Thematic Citation Index (TIC), that determines the
“credibility” of Internet resources based on a qualitative assessment
of links to other sites.
• Google Page Rank, that works by counting the number and quality
of links to blog to determine a rough estimate of how important the
website is.
cleverdata.ru | info@cleverdata.ru
Content relevance rate is based on fuzzy string matching:
- Every company product name was string matched with all amount of blogs.
- String matching is based on Levinstein's metric.
- Pages with 90% matching rate were marked up.
- Tests with direct brand name matching showed that we get about 90-100%
accuracy on each product name deppends on words in title.
- The result relevance rate for each author is summed from all marks of
his/hers pages.
Relevance Rate
cleverdata.ru | info@cleverdata.ru
Levenshtein distance is a string metric for measuring the difference between
two sequences.
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (i.e. insertions, deletions or substitutions)
required to change one word into the other.
Levinshtein distance between 'beer' and 'bread' is 44/100
Levenshtein distance
cleverdata.ru | info@cleverdata.ru
The most active authors
write with sentiment
rate in short range:
0.74 +/- 0.03
Sentiment rate
Blogsize(pages)
Sentiments vs Blog size
cleverdata.ru | info@cleverdata.ru
The most discussed
blogs have middle-
size authors.
Log(Blog size)
Meandiscussion
Discussion vs Blog size
cleverdata.ru | info@cleverdata.ru
Again, 2 kinds of bloggers:
- 'twitter style' authors
with short posts
- full-length bloggers
Log(mean words per page)
Log(Blogsize)
Words vs Pages
cleverdata.ru | info@cleverdata.ru
f you want to make a big
discussion, you should
praise something.
All highly discussed
authors are sentiment
positive (>=0.4)
Sentiment rate
Meandiscussion
Discussion vs Sentiments
cleverdata.ru | info@cleverdata.ru
We use Klout service to rank authors
according to online social influence.
Klout measures the size of a user's
social media network and correlates the
content created to measure how other
users interact with that content.
- the median Klout score is 40.1
Using of Klout score for bloggers
cleverdata.ru | info@cleverdata.ru
One can distinguish a population
of beginner bloggers with low
Klout score, that have tendency
to amplification of sentiments.
Sentiment rate
Kloutscore
Sentiments vs Klout score
cleverdata.ru | info@cleverdata.ru
• Amount of blog pages
• Mean discussion size
• AlexaRank + YandexTIC + Google PageRank
• Relevance rate
• Sentiment rate
• Klout score
Final Author Rating is based on
cleverdata.ru | info@cleverdata.ru
4 independent sentiment
rating systems are combined
Alexa Rank
Yandex Thematic Citation Index
Google PageRank
list of most PR effective authors
Pragmatic statistical information
key recommendations for blogger
resulting sentiment rate is
fully consistent with tests
Blog
efficien
cy
rating
Blog
relevance
rating
Sentiment
analysis
Make your data clever
Based on fuzzy string
matching
Blog rating in
accordance to
mentions of company
products in text
cleverdata.ru | info@cleverdata.ru
Name Url Sentiment Pages Mean
Comments
Hayley Carr http://www.londonbeautyqueen.com 0.71 229 10.9
Luzanne http://pinkpeonies.co.za 0.77 66 68.3
Allison http://www.neversaydiebeauty.com 0.70 182 42.9
Mica Kelly, Beth,
Jessica Diner
http://blog.birchbox.co.uk 0.74 196 0.26
Poonam http://beautyandmakeupmatters.com 0.78 142 4.3
Silvie http://mysillylittlegang.com 0.74 571 0.64
TOP Rated Authors
cleverdata.ru | info@cleverdata.ru
Testing the result
Hayley Carr (Top Rated Author):
“BlaBlaBla is definitely a brand to be reckoned with... All of the
BlaBlaBla products have multiple purposes, as well as smelling
and feeling fabulous; the packaging is clean and fresh whilst
still looking great in your bathroom, as well as having unique
application methods that only aid the product performance...
It's definitely worth checking out this growing brand, before it
starts taking over the world. “
cleverdata.ru | info@cleverdata.ru
Authors ←→ Products
cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
cleverdata.ru | info@cleverdata.ru
Finding the most perspective
for promotion products
cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
cleverdata.ru | info@cleverdata.ru
Let's build document-term
matrix, where each row is a
document, each term is a
column and a color intensity
indicates that a term appears in
a document at least once.
We can use TF-IDF method
to get document-term matrix.
Finding topics:
the document-term matrix
cleverdata.ru | info@cleverdata.ru
Finding topics: TF - IDF
• Term frequency TF(t,d) is the number of times that term t
occurs in document d.
• The inverse document frequency (IDF) is a measure of how
much information the word provides, that is, whether the
term is common or rare across all documents.
• Term frequency–inverse document frequency, is a
numerical statistic that is intended to reflect how important
a word is to a document in a collection or corpus.
cleverdata.ru | info@cleverdata.ru
• NMF is a variant of Matrix
Factorization where we start
with a matrix D with document-
term matrix, and constrain the
elements of W and T to be non-
negative.
• Lets us interpret each row of the
T matrix as a topic.
Topic extraction: NMF
cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
cleverdata.ru | info@cleverdata.ru
• For each author we build document-term matrix.
• For each document-term matrix we perform matrix
factorization and find main topics
• For each product we match product name with
main topics of author and find the rate of intensity.
• If author have exact product name in one of
his/hers titles, we set the rate of intensity to 0 (the
author has already made review of the the
product).
Topic extraction
cleverdata.ru | info@cleverdata.ru
Thus for each pair of author-product we find rate of intensity and we can
visualize it in form of heatmap where products are sorted by mean rate of
intensity and authors are sorted by author rating:
Note: the most rated authors are highly intensive on matrix
The intensity matrix
cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
cleverdata.ru | info@cleverdata.ru
Next we extract the most resonance peaks from product-author matrix of intensity.
After each peak extraction the column with a peak is dropped, so for each author
we get only one product.
We need to build recommendations only for 4 products and we can select 40
best rated authors for this task.
The intensity matrix
cleverdata.ru | info@cleverdata.ru
In order to associate a blogger
with a product we must:
• Find products for promotion
• Find main topics of each blogger
• Match topics of each blogger with product names
• Find best combinations of blogger and product
• Profit!
cleverdata.ru | info@cleverdata.ru
BlaBlaBla Body Oil Allison http://www.neversaydiebeauty.com
BlaBlaBla Wrinkle
Repair
Cindy Batchelor http://mystylespot.net
BlaBlaBla Face Serum Marie Papachatzis http://iamthemakeupjunkie.blogspot.ru
BlaBlaBla Face Oil Emily - Style Lobster http://stylelobster.com
The resulting associations
Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

Contenu connexe

En vedette

Data exchange как ключевой элемент экосистемы обмена данными
Data exchange как ключевой элемент экосистемы обмена даннымиData exchange как ключевой элемент экосистемы обмена данными
Data exchange как ключевой элемент экосистемы обмена даннымиCleverDATA
 
CleverDATA _HybridConf16_Public
CleverDATA _HybridConf16_PublicCleverDATA _HybridConf16_Public
CleverDATA _HybridConf16_PublicCleverDATA
 
Д.Афанасьев_ CleverDATA_Охота за данными
Д.Афанасьев_ CleverDATA_Охота за даннымиД.Афанасьев_ CleverDATA_Охота за данными
Д.Афанасьев_ CleverDATA_Охота за даннымиCleverDATA
 
Splunk - универсальная платформа для работы с любыми данными
Splunk - универсальная платформа для работы с любыми даннымиSplunk - универсальная платформа для работы с любыми данными
Splunk - универсальная платформа для работы с любыми даннымиCleverDATA
 
Big data. Тренды и технологии. Использование в работе с клиентами.
Big data. Тренды и технологии. Использование в работе с клиентами.Big data. Тренды и технологии. Использование в работе с клиентами.
Big data. Тренды и технологии. Использование в работе с клиентами.CleverDATA
 
Clever data 1dmp_oracle_fors
Clever data 1dmp_oracle_forsClever data 1dmp_oracle_fors
Clever data 1dmp_oracle_forsCleverDATA
 
DenReymer_presentation_for_CNewsforum_14112014
DenReymer_presentation_for_CNewsforum_14112014DenReymer_presentation_for_CNewsforum_14112014
DenReymer_presentation_for_CNewsforum_14112014CleverDATA
 
Тренды интернет бизнеса 2015
Тренды интернет бизнеса 2015Тренды интернет бизнеса 2015
Тренды интернет бизнеса 2015AMP Academy
 
Splunk 6.2 new features
Splunk 6.2 new featuresSplunk 6.2 new features
Splunk 6.2 new featuresCleverDATA
 
Tableau software in the world of Data Discover Tools
Tableau software in the world of Data Discover ToolsTableau software in the world of Data Discover Tools
Tableau software in the world of Data Discover ToolsCleverDATA
 
Den Reymer Resilience_2014
Den Reymer Resilience_2014Den Reymer Resilience_2014
Den Reymer Resilience_2014CleverDATA
 
Oracle big data_da_cut
Oracle big data_da_cutOracle big data_da_cut
Oracle big data_da_cutCleverDATA
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business AnalyticsCleverDATA
 
Oracle days14 cleverdata_dmp_public
Oracle days14 cleverdata_dmp_publicOracle days14 cleverdata_dmp_public
Oracle days14 cleverdata_dmp_publicCleverDATA
 
Больше DMP, хороших и разных
Больше DMP, хороших и разныхБольше DMP, хороших и разных
Больше DMP, хороших и разныхHybridRussia
 
Сравнение ТОП 5 SIEM РФ
Сравнение ТОП 5 SIEM РФСравнение ТОП 5 SIEM РФ
Сравнение ТОП 5 SIEM РФPete Kuzeev
 
Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014
Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014
Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014CleverDATA
 
Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015
Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015
Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015Den Reymer
 
Clever_data_splunk_overview_rus
Clever_data_splunk_overview_rusClever_data_splunk_overview_rus
Clever_data_splunk_overview_rusCleverDATA
 

En vedette (19)

Data exchange как ключевой элемент экосистемы обмена данными
Data exchange как ключевой элемент экосистемы обмена даннымиData exchange как ключевой элемент экосистемы обмена данными
Data exchange как ключевой элемент экосистемы обмена данными
 
CleverDATA _HybridConf16_Public
CleverDATA _HybridConf16_PublicCleverDATA _HybridConf16_Public
CleverDATA _HybridConf16_Public
 
Д.Афанасьев_ CleverDATA_Охота за данными
Д.Афанасьев_ CleverDATA_Охота за даннымиД.Афанасьев_ CleverDATA_Охота за данными
Д.Афанасьев_ CleverDATA_Охота за данными
 
Splunk - универсальная платформа для работы с любыми данными
Splunk - универсальная платформа для работы с любыми даннымиSplunk - универсальная платформа для работы с любыми данными
Splunk - универсальная платформа для работы с любыми данными
 
Big data. Тренды и технологии. Использование в работе с клиентами.
Big data. Тренды и технологии. Использование в работе с клиентами.Big data. Тренды и технологии. Использование в работе с клиентами.
Big data. Тренды и технологии. Использование в работе с клиентами.
 
Clever data 1dmp_oracle_fors
Clever data 1dmp_oracle_forsClever data 1dmp_oracle_fors
Clever data 1dmp_oracle_fors
 
DenReymer_presentation_for_CNewsforum_14112014
DenReymer_presentation_for_CNewsforum_14112014DenReymer_presentation_for_CNewsforum_14112014
DenReymer_presentation_for_CNewsforum_14112014
 
Тренды интернет бизнеса 2015
Тренды интернет бизнеса 2015Тренды интернет бизнеса 2015
Тренды интернет бизнеса 2015
 
Splunk 6.2 new features
Splunk 6.2 new featuresSplunk 6.2 new features
Splunk 6.2 new features
 
Tableau software in the world of Data Discover Tools
Tableau software in the world of Data Discover ToolsTableau software in the world of Data Discover Tools
Tableau software in the world of Data Discover Tools
 
Den Reymer Resilience_2014
Den Reymer Resilience_2014Den Reymer Resilience_2014
Den Reymer Resilience_2014
 
Oracle big data_da_cut
Oracle big data_da_cutOracle big data_da_cut
Oracle big data_da_cut
 
Splunk Business Analytics
Splunk Business AnalyticsSplunk Business Analytics
Splunk Business Analytics
 
Oracle days14 cleverdata_dmp_public
Oracle days14 cleverdata_dmp_publicOracle days14 cleverdata_dmp_public
Oracle days14 cleverdata_dmp_public
 
Больше DMP, хороших и разных
Больше DMP, хороших и разныхБольше DMP, хороших и разных
Больше DMP, хороших и разных
 
Сравнение ТОП 5 SIEM РФ
Сравнение ТОП 5 SIEM РФСравнение ТОП 5 SIEM РФ
Сравнение ТОП 5 SIEM РФ
 
Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014
Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014
Реклама со скоростью света. Презентация Сергея Жемжицкого на High Load++ 2014
 
Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015
Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015
Поймать Цифровую Волну. CleverDATA на Cnews Forum 2015
 
Clever_data_splunk_overview_rus
Clever_data_splunk_overview_rusClever_data_splunk_overview_rus
Clever_data_splunk_overview_rus
 

Similaire à Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxAnkitaVerma776806
 
Recap of the State of Search Dallas Conference 2015
Recap of the State of Search Dallas Conference 2015Recap of the State of Search Dallas Conference 2015
Recap of the State of Search Dallas Conference 2015Guy Dineen
 
Final Presentation
Final PresentationFinal Presentation
Final PresentationBryan Then
 
Voice of the Customer Workflow
Voice of the Customer WorkflowVoice of the Customer Workflow
Voice of the Customer Workflowquidsupport
 
What is SEO | Complete Details about SEO
What is SEO | Complete Details about SEOWhat is SEO | Complete Details about SEO
What is SEO | Complete Details about SEOtrainingutdigitalmed
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptxJamesKirk79
 
Designing for Search
Designing for SearchDesigning for Search
Designing for SearchKelly Page
 
Beginner Search Marketing
Beginner Search MarketingBeginner Search Marketing
Beginner Search MarketingAndrew Zarick
 
Understanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEMUnderstanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEMAnubha Rastogi
 
Recommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and RedisRecommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and Redisevanlight
 
Unleash your SEO powers to grow your business online
Unleash your SEO powers to grow your business onlineUnleash your SEO powers to grow your business online
Unleash your SEO powers to grow your business onlineRed Blue Blur Ideas
 
Measuring: Blogging Your Brand
Measuring: Blogging Your BrandMeasuring: Blogging Your Brand
Measuring: Blogging Your BrandHighRoad Solution
 
Webinar Structured Data
Webinar Structured DataWebinar Structured Data
Webinar Structured DataBotify
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataAndy Stretton
 
Recommending content from social information streams
Recommending content from social information streamsRecommending content from social information streams
Recommending content from social information streamsPARC, a Xerox company
 
Tips and technics for search engine market
Tips and technics for search engine marketTips and technics for search engine market
Tips and technics for search engine marketStefanos Anastasiadis
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfjill734733
 

Similaire à Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA) (20)

Reasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptxReasesrty djhjan S - explanation required.pptx
Reasesrty djhjan S - explanation required.pptx
 
Recap of the State of Search Dallas Conference 2015
Recap of the State of Search Dallas Conference 2015Recap of the State of Search Dallas Conference 2015
Recap of the State of Search Dallas Conference 2015
 
Final Presentation
Final PresentationFinal Presentation
Final Presentation
 
Voice of the Customer Workflow
Voice of the Customer WorkflowVoice of the Customer Workflow
Voice of the Customer Workflow
 
What is SEO | Complete Details about SEO
What is SEO | Complete Details about SEOWhat is SEO | Complete Details about SEO
What is SEO | Complete Details about SEO
 
ProjectsSummary.pptx
ProjectsSummary.pptxProjectsSummary.pptx
ProjectsSummary.pptx
 
Solved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdfSolved Big Data and Data Science Projects pdf.pdf
Solved Big Data and Data Science Projects pdf.pdf
 
Designing for Search
Designing for SearchDesigning for Search
Designing for Search
 
Beginner Search Marketing
Beginner Search MarketingBeginner Search Marketing
Beginner Search Marketing
 
Understanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEMUnderstanding Search Marketing :SEO & SEM
Understanding Search Marketing :SEO & SEM
 
Recommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and RedisRecommendation Engines with Ruby and Redis
Recommendation Engines with Ruby and Redis
 
Keywords
KeywordsKeywords
Keywords
 
Unleash your SEO powers to grow your business online
Unleash your SEO powers to grow your business onlineUnleash your SEO powers to grow your business online
Unleash your SEO powers to grow your business online
 
Measuring: Blogging Your Brand
Measuring: Blogging Your BrandMeasuring: Blogging Your Brand
Measuring: Blogging Your Brand
 
Webinar Structured Data
Webinar Structured DataWebinar Structured Data
Webinar Structured Data
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Recommending content from social information streams
Recommending content from social information streamsRecommending content from social information streams
Recommending content from social information streams
 
Tips and technics for search engine market
Tips and technics for search engine marketTips and technics for search engine market
Tips and technics for search engine market
 
AAN TrafficPresentation
AAN TrafficPresentationAAN TrafficPresentation
AAN TrafficPresentation
 
Data_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdfData_Modeling_MongoDB.pdf
Data_Modeling_MongoDB.pdf
 

Plus de CleverDATA

CRM onboarding - оффлайн данные для онлайн рекламы
CRM onboarding - оффлайн данные для онлайн рекламы CRM onboarding - оффлайн данные для онлайн рекламы
CRM onboarding - оффлайн данные для онлайн рекламы CleverDATA
 
Jpoint 2017 - как это было (обзор конференции)
Jpoint 2017 - как это было (обзор конференции)Jpoint 2017 - как это было (обзор конференции)
Jpoint 2017 - как это было (обзор конференции)CleverDATA
 
Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)
Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)
Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)CleverDATA
 
Splunk for IT Operations and IT Service Intelligence
Splunk for IT Operations and IT Service IntelligenceSplunk for IT Operations and IT Service Intelligence
Splunk for IT Operations and IT Service IntelligenceCleverDATA
 
CleverDATA_Afanasev_DigitalEconomy
CleverDATA_Afanasev_DigitalEconomyCleverDATA_Afanasev_DigitalEconomy
CleverDATA_Afanasev_DigitalEconomyCleverDATA
 
CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)
CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)
CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)CleverDATA
 
Fors и big data appliance
Fors и big data applianceFors и big data appliance
Fors и big data applianceCleverDATA
 
Oracle big data for finance
Oracle big data for financeOracle big data for finance
Oracle big data for financeCleverDATA
 
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentation
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentationClever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentation
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentationCleverDATA
 
Customers segmentation_responce prediction
Customers segmentation_responce predictionCustomers segmentation_responce prediction
Customers segmentation_responce predictionCleverDATA
 
HR_Scoring_CleverDATA
HR_Scoring_CleverDATAHR_Scoring_CleverDATA
HR_Scoring_CleverDATACleverDATA
 
CleverDATA_Oracle Cloud BI Day 2015
CleverDATA_Oracle Cloud BI Day 2015CleverDATA_Oracle Cloud BI Day 2015
CleverDATA_Oracle Cloud BI Day 2015CleverDATA
 
CleverDATA for Hadoop_Meetup_22052015_Spark_vs_Hadoop
CleverDATA for Hadoop_Meetup_22052015_Spark_vs_HadoopCleverDATA for Hadoop_Meetup_22052015_Spark_vs_Hadoop
CleverDATA for Hadoop_Meetup_22052015_Spark_vs_HadoopCleverDATA
 
CleverDATA_Spark_audience_segmentation_in_online_ad
CleverDATA_Spark_audience_segmentation_in_online_adCleverDATA_Spark_audience_segmentation_in_online_ad
CleverDATA_Spark_audience_segmentation_in_online_adCleverDATA
 
Julia Tuzin teradata omnichannel_interactions
Julia Tuzin teradata omnichannel_interactionsJulia Tuzin teradata omnichannel_interactions
Julia Tuzin teradata omnichannel_interactionsCleverDATA
 
Karel jabornik teradata real-time-interaction_mngmt
Karel jabornik teradata real-time-interaction_mngmtKarel jabornik teradata real-time-interaction_mngmt
Karel jabornik teradata real-time-interaction_mngmtCleverDATA
 
Roman styatugin clever_data_cxp_predictive marketing
Roman styatugin clever_data_cxp_predictive marketingRoman styatugin clever_data_cxp_predictive marketing
Roman styatugin clever_data_cxp_predictive marketingCleverDATA
 
Customer experience profile&PredictiveMarketing_R.Styatugin_31032015
Customer experience profile&PredictiveMarketing_R.Styatugin_31032015Customer experience profile&PredictiveMarketing_R.Styatugin_31032015
Customer experience profile&PredictiveMarketing_R.Styatugin_31032015CleverDATA
 
CleverCLUB-26.03.15-K.Obukhov
CleverCLUB-26.03.15-K.ObukhovCleverCLUB-26.03.15-K.Obukhov
CleverCLUB-26.03.15-K.ObukhovCleverDATA
 
CleverCLUB-26.03.15-G.Kanevsky
CleverCLUB-26.03.15-G.KanevskyCleverCLUB-26.03.15-G.Kanevsky
CleverCLUB-26.03.15-G.KanevskyCleverDATA
 

Plus de CleverDATA (20)

CRM onboarding - оффлайн данные для онлайн рекламы
CRM onboarding - оффлайн данные для онлайн рекламы CRM onboarding - оффлайн данные для онлайн рекламы
CRM onboarding - оффлайн данные для онлайн рекламы
 
Jpoint 2017 - как это было (обзор конференции)
Jpoint 2017 - как это было (обзор конференции)Jpoint 2017 - как это было (обзор конференции)
Jpoint 2017 - как это было (обзор конференции)
 
Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)
Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)
Большие данные в маркетинге: обработка, хранение, монетизация (Big Data 2017)
 
Splunk for IT Operations and IT Service Intelligence
Splunk for IT Operations and IT Service IntelligenceSplunk for IT Operations and IT Service Intelligence
Splunk for IT Operations and IT Service Intelligence
 
CleverDATA_Afanasev_DigitalEconomy
CleverDATA_Afanasev_DigitalEconomyCleverDATA_Afanasev_DigitalEconomy
CleverDATA_Afanasev_DigitalEconomy
 
CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)
CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)
CleverDATA (Denis Reymer) presentation for CNews Forum 2015 (Banking Section)
 
Fors и big data appliance
Fors и big data applianceFors и big data appliance
Fors и big data appliance
 
Oracle big data for finance
Oracle big data for financeOracle big data for finance
Oracle big data for finance
 
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentation
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentationClever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentation
Clever data datascienceweek_spark_vs_hadoop_in_online_audience_segmentation
 
Customers segmentation_responce prediction
Customers segmentation_responce predictionCustomers segmentation_responce prediction
Customers segmentation_responce prediction
 
HR_Scoring_CleverDATA
HR_Scoring_CleverDATAHR_Scoring_CleverDATA
HR_Scoring_CleverDATA
 
CleverDATA_Oracle Cloud BI Day 2015
CleverDATA_Oracle Cloud BI Day 2015CleverDATA_Oracle Cloud BI Day 2015
CleverDATA_Oracle Cloud BI Day 2015
 
CleverDATA for Hadoop_Meetup_22052015_Spark_vs_Hadoop
CleverDATA for Hadoop_Meetup_22052015_Spark_vs_HadoopCleverDATA for Hadoop_Meetup_22052015_Spark_vs_Hadoop
CleverDATA for Hadoop_Meetup_22052015_Spark_vs_Hadoop
 
CleverDATA_Spark_audience_segmentation_in_online_ad
CleverDATA_Spark_audience_segmentation_in_online_adCleverDATA_Spark_audience_segmentation_in_online_ad
CleverDATA_Spark_audience_segmentation_in_online_ad
 
Julia Tuzin teradata omnichannel_interactions
Julia Tuzin teradata omnichannel_interactionsJulia Tuzin teradata omnichannel_interactions
Julia Tuzin teradata omnichannel_interactions
 
Karel jabornik teradata real-time-interaction_mngmt
Karel jabornik teradata real-time-interaction_mngmtKarel jabornik teradata real-time-interaction_mngmt
Karel jabornik teradata real-time-interaction_mngmt
 
Roman styatugin clever_data_cxp_predictive marketing
Roman styatugin clever_data_cxp_predictive marketingRoman styatugin clever_data_cxp_predictive marketing
Roman styatugin clever_data_cxp_predictive marketing
 
Customer experience profile&PredictiveMarketing_R.Styatugin_31032015
Customer experience profile&PredictiveMarketing_R.Styatugin_31032015Customer experience profile&PredictiveMarketing_R.Styatugin_31032015
Customer experience profile&PredictiveMarketing_R.Styatugin_31032015
 
CleverCLUB-26.03.15-K.Obukhov
CleverCLUB-26.03.15-K.ObukhovCleverCLUB-26.03.15-K.Obukhov
CleverCLUB-26.03.15-K.Obukhov
 
CleverCLUB-26.03.15-G.Kanevsky
CleverCLUB-26.03.15-G.KanevskyCleverCLUB-26.03.15-G.Kanevsky
CleverCLUB-26.03.15-G.Kanevsky
 

Dernier

MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxUnduhUnggah1
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPramod Kumar Srivastava
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Dernier (20)

Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
MK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docxMK KOMUNIKASI DATA (TI)komdat komdat.docx
MK KOMUNIKASI DATA (TI)komdat komdat.docx
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptxPKS-TGC-1084-630 - Stage 1 Proposal.pptx
PKS-TGC-1084-630 - Stage 1 Proposal.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Text mining of Beauty Blogs: о чем говорят женщины? (Артем Просветов, data scientist CleverDATA)

  • 1. Text mining of Beauty Blogs: Text mining of Beauty Blogs: О чем говорят женщины? Артем Просветов Data Scientist, CleverDATA
  • 2. empty not English techcrunch.com photo/video pages correct English page cleverdata.ru | info@cleverdata.ru Raw blog data Raw data: 98,496 pages in format of ~ 1,000,000 files. Ready for analysis: 58,719 English pages (59.6%) 40.4% data: empty pages and pages with errors, not English pages (23,461), photo/video pages without text (2,315), articles from techcrunch.com (3,402)
  • 3. cleverdata.ru | info@cleverdata.ru From 60k of pages → ~2000 authors. Pages → Authors
  • 4. cleverdata.ru | info@cleverdata.ru Mean blog post size (in words) One can distinguish 2 populations of bloggers: •twitter style' authors with short posts (~20%) •full-length bloggers with 200-500 mean words per post (~80%)
  • 5. cleverdata.ru | info@cleverdata.ru Used APIs and services: - Sentity (https://sentity.io/) - Twinword (https://www.twinword.com/) - Textualinsights (http://www.textualinsights.com/) - VivekN (https://github.com/vivekn/sentiment-web) Sentiment analysis
  • 6. cleverdata.ru | info@cleverdata.ru Sentiment analysis • - the resulting sentiment rate is based on 4 independent rate systems. • - the majority of the blogs have positive emotion rate. • - the mean sentiment rate is «positive warm» 0.72. • - all this results are intuitively consistent and are in a good agreement with manual tests
  • 7. cleverdata.ru | info@cleverdata.ru We used a few traffic rank systems: Estimation of blog efficiency • Alexa Rank, that basically audits and makes public the frequency of visits on various Web sites. • Yandex Thematic Citation Index (TIC), that determines the “credibility” of Internet resources based on a qualitative assessment of links to other sites. • Google Page Rank, that works by counting the number and quality of links to blog to determine a rough estimate of how important the website is.
  • 8. cleverdata.ru | info@cleverdata.ru Content relevance rate is based on fuzzy string matching: - Every company product name was string matched with all amount of blogs. - String matching is based on Levinstein's metric. - Pages with 90% matching rate were marked up. - Tests with direct brand name matching showed that we get about 90-100% accuracy on each product name deppends on words in title. - The result relevance rate for each author is summed from all marks of his/hers pages. Relevance Rate
  • 9. cleverdata.ru | info@cleverdata.ru Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other. Levinshtein distance between 'beer' and 'bread' is 44/100 Levenshtein distance
  • 10. cleverdata.ru | info@cleverdata.ru The most active authors write with sentiment rate in short range: 0.74 +/- 0.03 Sentiment rate Blogsize(pages) Sentiments vs Blog size
  • 11. cleverdata.ru | info@cleverdata.ru The most discussed blogs have middle- size authors. Log(Blog size) Meandiscussion Discussion vs Blog size
  • 12. cleverdata.ru | info@cleverdata.ru Again, 2 kinds of bloggers: - 'twitter style' authors with short posts - full-length bloggers Log(mean words per page) Log(Blogsize) Words vs Pages
  • 13. cleverdata.ru | info@cleverdata.ru f you want to make a big discussion, you should praise something. All highly discussed authors are sentiment positive (>=0.4) Sentiment rate Meandiscussion Discussion vs Sentiments
  • 14. cleverdata.ru | info@cleverdata.ru We use Klout service to rank authors according to online social influence. Klout measures the size of a user's social media network and correlates the content created to measure how other users interact with that content. - the median Klout score is 40.1 Using of Klout score for bloggers
  • 15. cleverdata.ru | info@cleverdata.ru One can distinguish a population of beginner bloggers with low Klout score, that have tendency to amplification of sentiments. Sentiment rate Kloutscore Sentiments vs Klout score
  • 16. cleverdata.ru | info@cleverdata.ru • Amount of blog pages • Mean discussion size • AlexaRank + YandexTIC + Google PageRank • Relevance rate • Sentiment rate • Klout score Final Author Rating is based on
  • 17. cleverdata.ru | info@cleverdata.ru 4 independent sentiment rating systems are combined Alexa Rank Yandex Thematic Citation Index Google PageRank list of most PR effective authors Pragmatic statistical information key recommendations for blogger resulting sentiment rate is fully consistent with tests Blog efficien cy rating Blog relevance rating Sentiment analysis Make your data clever Based on fuzzy string matching Blog rating in accordance to mentions of company products in text
  • 18. cleverdata.ru | info@cleverdata.ru Name Url Sentiment Pages Mean Comments Hayley Carr http://www.londonbeautyqueen.com 0.71 229 10.9 Luzanne http://pinkpeonies.co.za 0.77 66 68.3 Allison http://www.neversaydiebeauty.com 0.70 182 42.9 Mica Kelly, Beth, Jessica Diner http://blog.birchbox.co.uk 0.74 196 0.26 Poonam http://beautyandmakeupmatters.com 0.78 142 4.3 Silvie http://mysillylittlegang.com 0.74 571 0.64 TOP Rated Authors
  • 19. cleverdata.ru | info@cleverdata.ru Testing the result Hayley Carr (Top Rated Author): “BlaBlaBla is definitely a brand to be reckoned with... All of the BlaBlaBla products have multiple purposes, as well as smelling and feeling fabulous; the packaging is clean and fresh whilst still looking great in your bathroom, as well as having unique application methods that only aid the product performance... It's definitely worth checking out this growing brand, before it starts taking over the world. “
  • 21. cleverdata.ru | info@cleverdata.ru In order to associate a blogger with a product we must: • Find products for promotion • Find main topics of each blogger • Match topics of each blogger with product names • Find best combinations of blogger and product
  • 22. cleverdata.ru | info@cleverdata.ru Finding the most perspective for promotion products
  • 23. cleverdata.ru | info@cleverdata.ru In order to associate a blogger with a product we must: • Find products for promotion • Find main topics of each blogger • Match topics of each blogger with product names • Find best combinations of blogger and product
  • 24. cleverdata.ru | info@cleverdata.ru Let's build document-term matrix, where each row is a document, each term is a column and a color intensity indicates that a term appears in a document at least once. We can use TF-IDF method to get document-term matrix. Finding topics: the document-term matrix
  • 25. cleverdata.ru | info@cleverdata.ru Finding topics: TF - IDF • Term frequency TF(t,d) is the number of times that term t occurs in document d. • The inverse document frequency (IDF) is a measure of how much information the word provides, that is, whether the term is common or rare across all documents. • Term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
  • 26. cleverdata.ru | info@cleverdata.ru • NMF is a variant of Matrix Factorization where we start with a matrix D with document- term matrix, and constrain the elements of W and T to be non- negative. • Lets us interpret each row of the T matrix as a topic. Topic extraction: NMF
  • 27. cleverdata.ru | info@cleverdata.ru In order to associate a blogger with a product we must: • Find products for promotion • Find main topics of each blogger • Match topics of each blogger with product names • Find best combinations of blogger and product
  • 28. cleverdata.ru | info@cleverdata.ru • For each author we build document-term matrix. • For each document-term matrix we perform matrix factorization and find main topics • For each product we match product name with main topics of author and find the rate of intensity. • If author have exact product name in one of his/hers titles, we set the rate of intensity to 0 (the author has already made review of the the product). Topic extraction
  • 29. cleverdata.ru | info@cleverdata.ru Thus for each pair of author-product we find rate of intensity and we can visualize it in form of heatmap where products are sorted by mean rate of intensity and authors are sorted by author rating: Note: the most rated authors are highly intensive on matrix The intensity matrix
  • 30. cleverdata.ru | info@cleverdata.ru In order to associate a blogger with a product we must: • Find products for promotion • Find main topics of each blogger • Match topics of each blogger with product names • Find best combinations of blogger and product
  • 31. cleverdata.ru | info@cleverdata.ru Next we extract the most resonance peaks from product-author matrix of intensity. After each peak extraction the column with a peak is dropped, so for each author we get only one product. We need to build recommendations only for 4 products and we can select 40 best rated authors for this task. The intensity matrix
  • 32. cleverdata.ru | info@cleverdata.ru In order to associate a blogger with a product we must: • Find products for promotion • Find main topics of each blogger • Match topics of each blogger with product names • Find best combinations of blogger and product • Profit!
  • 33. cleverdata.ru | info@cleverdata.ru BlaBlaBla Body Oil Allison http://www.neversaydiebeauty.com BlaBlaBla Wrinkle Repair Cindy Batchelor http://mystylespot.net BlaBlaBla Face Serum Marie Papachatzis http://iamthemakeupjunkie.blogspot.ru BlaBlaBla Face Oil Emily - Style Lobster http://stylelobster.com The resulting associations