Lecture given at the University of Catania on December 2nd, 2014.
Start from Big Data definitions, continue with real life examples of successful Big Data Projects, go a little bit deeper with Sentiment Analysis, and conclude with a brief overview of Big Data tools and Big Data with Microsoft.
Summary:
1. What is Big Data? (includes the 5Vs of Big Data)
2. Big Data Examples (includes 6 Real Life Examples and comments on Privacy concerns)
3. How to Tackle a Big Data Problem (my 4 Universal Steps to follow)
4. Sentiment Analysis (what is sentiment analysis? Why do we care? A Technique and a plan)
5. Big Data tools (Hadoop, Hadoop Ecosystem, Hive, Pig, Sqoop, Oozie; Azure HDInsight, Excel Power Query, Power Pivot, Power View, Power Map)
2. Agenda
✤ What is Big Data?
✤ Big Data Examples
✤ How to Tackle a Big Data Problem
✤ Sentiment Analysis
✤ Big Data tools
Part I Part II
3. How relevant is it?
Big Data
Social Media
Digital Marketing
Machine Learning
Computer Vision
Who’s more relevant to the people?
Let’s ask Google!
4. How relevant is it?
Big Data
Social Media
Digital Marketing
Machine Learning
Computer Vision
Google Trends
From 2007 to end 2014
5. Big Data Market
What is Big Data? How relevant is it?
Jobs to support Big Data
In 2012 it was $28B, for 2013 expected $37B
Scattered across a number of IT landscapes. 45% for new
social network analysis and content analytics tools[1]
4.4 Million IT jobs globally by 2015, 1.9m in the US[1]
By 2018, the US alone could face a shortage of 200k people
with deep analytical skills as well as 1.5m managers and
analysts[2]
6. Definition
Big Data according to Oxford Dictionary[3]:
big data n. Computing (also with capital initials) data of a very
large size, typically to the extent that its manipulation and
management present significant logistical challenges; (also) the
branch of computing involving such data.
Big Data according to Gartner[4]:
Big data is high-volume, high-velocity and high-variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.
This is where the 3 Vs originated from:
Volume Velocity Variety
7. VOLUME
About: Amount of data. Unit: bytes
What is Big Data? Definition
Information about the general population, education, health,
medicine, travel, geographic locations, shopping, financial
transactions, jobs, scientific experiments, emails, sensors,
texts, photos, videos, activity on social networks …
2.5 Exabytes of data are created each day worldwide[5]
Facebook (2012): 200 PB of data each year
In 3 years CERN collected 75 PB of data (with LHC)
Most of US company have 100 TB[5]
1 ZB = 10002 PB = 10003 TB = 10004 GB
How much is Big Data? > 5 TB (as of 2014)
8. VELOCITY
About: moving data. Unit: bytes per seconds
What is Big Data? Definition
This really has two interpretations:
Data Generation Rate or Data Processing Rate
Every minute (2014)[6]:
200M emails
4M google search
277k more tweets
216k pictures on Instagram
What’s the limit to be considered big data?
As of 2014
Generation: time to reach 5TB < Project Life Time
Processing: > 1 MB/s = 5TB/2mo
9. VARIETY
About: Form of the data.
3 Types: structured, semi-structured, unstructured
What is Big Data? Definition
1. Structured = Data in a fixed field within a record
(spreadsheets, Relational Database)
2. Semi-Structured = XML, JSON, CSV (Text with columns,
with a separator)
3. Unstructured = Data stored without any model, or that
does not have any organisation
All of them can be Big Data
10. What is Big Data? Definition
VERACITY
Lack of accuracy
Data itself is often imprecise or incomplete (typos, empty
fields, errors, source changes, …)
The time of small and tidy samples is over
This concludes the classical 3 Vs of Big Data.
To better describe Big Data we can add a couple more Vs.
11. VALUE
About the actionable insights one can get
What is Big Data? Definition
People do not need data, they need insights which are hidden
in the data: Value is a concentrated data-juice.
Obtaining correct, but irrelevant, information is a waste of
time, effort and resources.
Close interactions between an analytics team and business
managers can help you address the right questions.
12. “Datafication” is the movement behind Big Data[7]
What is Big Data? Implications
Big Data implicitly requires 3 paradigm shifts:
1. from “some” to “all”
2. from “clean” to “messy”
3. from “causation” to “correlation”
16. General Application Fields
Not only business: Big Data have implications far beyond
marketing and consumer goods
It will profoundly change how governments work and alter
the nature of politics and our daily life too (smart cities).
When it comes to generating economic growth, providing
public services, or fighting wars, those who can harness big
data effectively will have a significant edge over others.
17. Forbes think that it will influence us in 5 ways[8]:
1. how we spend
2. how we vote
3. how we study
4. how we stay healthy
5. how we keep/lose privacy
Big Data Examples - General Application Fields
18. 1. Fire-prevention @ New York City[7]
Big Data Examples - Real Life Applications
19. Problem
Imbalance between needs and resource
Too many complaints (25,000 per year) too few inspectors
(200).
You want your inspectors to tackle the most relevant cases
only/first.
How to prioritise the complaints?
1. Fire-prevention @ New York City[7]
Big Data Examples - Real Life Applications
20. 1. Fire-prevention @ New York City[7]
Solution
a. Database with information about buildings (crime rates,
ambulance visits, utility usage, missed payments, …)
b. Compare database to records of building fires, looking for
correlations
c. Estimate the probability of fire for each of the complaint
Big Data Examples - Real Life Applications
Result
The efficiency of the inspectors raised from 13% to 70%
Among the predictors of a fire were:
the type of building and the year it was built
permits for exterior brickwork correlated with lower risks
21. 2. Improve Formula 1 car performance[9]
Big Data Examples - Real Life Applications
22. 2. Improve Formula 1 car performance[9]
Big Data Examples - Real Life Applications
Why is this Big Data?
Volume = average 10+ TB of data at each GP per team
Velocity = teams take decisions in <~ 30 seconds
Main goals
1. get real time alarms on brakes, tires, fuel and other factors
that affect car performance during a race
2. find ways to improve car performance in the long term
23. 2. Improve Formula 1 car performance[9]
a. Collect data:130-160 sensors on a car during race, plus
weather conditions, track conditions …
b. Compare data with records of success/failures
c. Look for correlations to get (1) real-time alarms and (2)
long term insights
Big Data Examples - Real Life Applications
$1B cost of saving 0.1s from a single lap
$60M money spent by a team on a supercomputer
24. 3. Predict Flu Outbreak in Real-Time
Big Data Examples - Real Life Applications
25. 3. Predict Flu Outbreak in Real-Time
Flu can spread very fast with catastrophic consequences,
traditional methods can be too slow.
Each day, millions of users around the world search for health
information online. As you might expect, there are more flu-
related searches during flu season.
Of course, not every person who searches for "flu" is actually
sick, but a pattern emerges when all the flu-related search
queries are added together.
Big Data Examples - Real Life Applications
26. 3. Predict Flu Outbreak in Real-Time
a. Collect data: keywords searched on the web; data collected by
national medical authorities (US Centers for Disease Control
and Prevention - CDC)
b. Compare the trends of search queries (top 50M) with the
records in real data
c. Find the keywords that correlate with the actual trends, to
make predictions based on current searches.
Big Data Examples - Real Life Applications
There are 45 keywords that correlate well with the historical data
The predictions from this system can improve the CDC data by up
to 50% [Royal Society Open Science, 2014]
27. 3. Predict Flu Outbreak in Real-Time
Big Data Examples - Real Life Applications
Orange: US real data
Blue: predictions based on keywords
28. 3. Predict Flu Outbreak in Real-Time
Big Data Examples - Real Life Applications
Google Flu Trend GFT project: www.google.org/flutrends/
Published in Nature in 2009[10]
Example of power of Big Data and of failure of Big Data.
29. 4. Reduce injuries in sports[11]
Big Data Examples - Real Life Applications
30. 4. Reduce injuries in sports[11]
Big Data Examples - Real Life Applications
Injuries are probably the largest market inefficiency in pro
sports
In 2013, teams in the Major League Baseball spent $665
million on the salaries of injured players and replacements
Goal
anticipate when an athlete will get hurt before it actually
happens so to avoid it
31. 4. Reduce injuries in sports[11]
a. Collect data: data about how players actually move
(accelerations, elevations, jumping ranges, …) and at what
intensity.
b. Compare with records of injuries; let doctors analyse the
data
c. Predict the chances to get an injury and intervene before it
happens both during workouts or matches
Big Data Examples - Real Life Applications
Founded in 2006, Catapult sales have increased ~70% for six
consecutive years and is on track to gross $20 million in 2013.
32. 5. Running massive multiplayer games
Big Data Examples - Real Life Applications
33. “Infinity Challenge”, a massive 5 week online battle.
Two needs: handle massive amount of data in almost real time
to update leaderboards and detect cheaters.
Big Data Examples - Real Life Applications
The development team was taking these insights and
updating the game almost weekly, using direct player
feedback to tweak the game.
Behind the scenes there was the Microsoft Big Data cloud
platform - HDInsight on Azure.
5. Running massive multiplayer games
34. 6. Transparency of Governments
Improving politics for all
Big Data Examples - Real Life Applications
35. 6. Transparency of Governments
Improving politics for all
In 2009 the US government started www.data.gov
Today there are 133k datasets in different fields:
Agriculture, Climate, Education, Energy, Finance, Geospatial,
Global Development, Health, Jobs & Skills, Public Safety,
Science & Research, Weather
Big Data Examples - Real Life Applications
Many countries have followed
including Italy (from 2011):
~ 9k datasets from 80 PA
Code4italy @Montecitorio
36. The Dark Side
There is one massive downside to this: Privacy concerns
Do we really want all our data to be logged and stored?
Data that can say where we are everyday, which products we
buy, which movie we watch, how fast (or slow) we drive our
car, where we park it, which roads we usually take, where we
go with out bike, how much exercise we do (or don’t), what
we eat, how much we spend, which drugs we take, …
Security issues: track my position, steal my identity
Not all applications are customer-centric: insurance
companies (use data to increase costs)
37. Governments need to protect citizens against unhealthy
market dominance: data antitrust
Also, they need to regulate better the ways companies ask and
get the data (just asking for permission with Terms of Use is
not enough!)
Big Data Examples - The Dark Side
At present the control of information is being taken away
from citizens
The danger is that individuals will not be able to control the
ways they are monitored or what happens to the information
39. Preliminary Steps
First things first: check if it really is a Big Data problem
From the examples we have seen that common 3 steps are:
1. collect data
2. find correlations (compare with historical records)
3. make predictions
Do not follow these steps!
These are relevant phases to execute a Big Data project, once
everything is in place.
40. Preliminary steps:
1. Goals and timescale
what you want to achieve and by when
2. Data
which data you have or need to get
3. Team
which skills you need (can change with data)
4. Silo breaking
connections you need to create (crm, it, marketing)
5. Budget
how much money you can put overall (business stakeholders)
How to Tackle a Big Data Problem - Preliminary Steps
41. How to Tackle a Big Data Problem - Four Universal Steps
1. Collect & store data (source, privacy, real-time)
2. Clean data (na, errors)
3. Analyse data (correlations)
4. Visualise data (kpi)
It is very unluckily to get everything right (or everything you
need) at first attempt. Be prepared to iterate.
4 Universal Steps
42. Agenda
✤ What is Big Data?
✤ Big Data Examples
✤ How to Tackle a Big Data Problem
Part I Part II
✤ Sentiment Analysis
✤ Big Data tools
44. What is Sentiment Analysis?
Sentiment Analysis according to Oxford[14]:
The process of computationally identifying and categorising
opinions expressed in a piece of text, especially in order to
determine whether the writer’s attitude towards a particular
topic, product, etc. is positive, negative, or neutral.
45. Operative definition in steps:
Trying to understand what people think about a subject,
from what they write,
automatically,
producing a measure of what they think.
Sentiment Analysis - What is Sentiment Analysis?
46. The challenge:
Sentiment Analysis - What is Sentiment Analysis?
Hundreds (if not more) of scientific papers have been
published on this topic.
None of the problem is solved, applications are flourishing
(plenty of space for new ideas)
What humans readily grasp from context is very difficult for
computers to detect.
Abbreviations, bad spelling and grammar, sarcasm, irony,
slang, idiom and personality
47. Show me the data!
Where is the sentiment expressed?
Activity on social network
Survey
CRM notes
Reviews (movies, restaurants, events,…)
Blogs
News
Sentiment Analysis - What is Sentiment Analysis?
48. Why is it important?
Today people are different, they are:
1. more digital/technological
2. more connected
3. less loyal to brands
Communication is bidirectional and people’s reach is large
The People, not the Companies, have the power …
… and they are not afraid to use it.
49. Sentiment Analysis - Why is it important?
Nestle’ censors a Greenpeace video criticising the company
Domino’s Pizza employees post a video showing bad health
codes
United Airlines broke a guitar and did not reimburse
50. Some reasons to do sentiment analysis:
Gather feedback from customers (automatic, reliable)
• Give chance to react in real time
Sentiment as proxy of sales, opinions influence a lot
• To make predictions
Sentiment Analysis - Why is it important?
Gather information from/about competitions (so start
“listening”!)
• Find ways to get new customers
51. Sentiment Analysis - Techniques[13]
One Technique consists in (mainly) looking for:
Lexical choice, Negator, Intensifier, Modal operators
I bought an iPhone a few days ago. It is such a nice phone. The
touch screen is really cool. The voice quality is clear too. It is
much better than my old Blackberry, which was a terrible
phone and so difficult to type with its tiny keys. However, my
mother was mad with me as I did not tell her before I bought
the phone. She also thought the phone was too expensive.
Here is an (old) opinion:
52. Sentiment Analysis - Techniques
Lexical choice (words):
positive: nice, boost, benefit, brave
negative: terrible, conspire, catastrophe, cowardly
Negator: can flip the valence,
not, never
Intensifier: give the strength of the sentiment,
really, very, most
Modal operators: distinguish hypothetical from real situations
and weaken intensity,
might, could, should
53. A text can contain
multiples sentiments,
that will usually be
connected to each
other, maybe a
comparison (as for
products)
Analyse the whole text,
each sentence
Sentiment Analysis - Techniques
Lexical choice (words):
positive: nice, boost, benefit, brave
negative: terrible, conspire,
catastrophe, cowardly
Negator: can flip the valence,
not, never
Intensifier: give the strength of the
sentiment,
really, very, most
Modal operators: distinguish
hypothetical from real situations
and weaken intensity,
might, could, should
55. Every opinion is a quintuple:
entity, feature, sentiment value, holder, time
Mike87 on 23-06-2009 “I bought an iPhone a few days ago. It is
such a nice phone. The touch screen is really cool. The voice
quality is clear too. It is much better than my old Blackberry,
which was a terrible phone and so difficult to type with its tiny
keys. However, my mother was mad with me as I did not tell her
before I bought the phone. She also thought the phone was too
expensive”
Sentiment Analysis - Techniques
(iPhone, GENERAL , +, Mike87, 23-06-2009)
(iPhone, touch_screen, +, Mike87, 23-06-2009)
…
We are making an unstructured data a structured data
56. An Operative Plan
Preliminary:
What’s your goal?
e.g. Reaction to my new product launch (1 month tail)
How can you obtain it?
e.g. Twitter, Facebook and related-field blogs (want to use
google alert?)
How can I measure it? Which KPI? Which test?
e.g. KPI: # of mentions/comments/posts, % of positive over
total; choose threshold values for the goal to be met (for each
KPI)
57. Universal step 1: Collect and Store The Data
Identify the data
tweets that mention the product (or the company?),
comments to your Facebook page posts, select the specific
blogs to follow
Setup a system that can get the data
create/buy some tool to get the data automatically and
programmatically
Store the data
somewhere useful for the project and for your company
(you don’t want to create new silos!)
Sentiment Analysis - An Operative Plan
58. Universal step 2: Clean The Data
Act on the data
deal with writer mistakes: replace, modify text
deal with program error: remove records
Sentiment Analysis - An Operative Plan
Universal step 3: Analyse The Data
Analyse the data, extract the sentiment
Build the KPI
59. Universal step 4: Visualise The Data
Learn from the numbers, you need to come out with a
story
e.g. Reaction was massive on Twitter and Facebook (2 x
threshold), initially very positive (1.5x), then reduce but
still good (1.3x); for blog posts the positive test was just
passed (1x)
Visualise the story,
create a dashboard to follow evolution in real-time
create a static infographics to describe what happened
Sentiment Analysis - An Operative Plan
61. What is Hadoop?
Apache Hadoop is an open-source software framework
for distributed storage and distributed processing of Big Data
on clusters of commodity hardware
Created in 2005 by Doug
Cutting and Mike Cafarella
Named it after a toy elephant
(Cutting son). Originally
developed to support the Nutch
search engine project
62. The base Apache Hadoop framework is composed of the
following modules:
1. Hadoop Common – libraries and utilities for other modules
2. Hadoop Distributed File System (HDFS) – a distributed
file-system that splits files into large blocks and distribute
them among the machines
3. Hadoop MapReduce – a programming model for large
scale data processing. MapReduce ships code (.jar files) to
the nodes that have the required data, and the nodes then
process the data in parallel.
4. Hadoop YARN - resource-management platform
Big Data Tools - What is Hadoop?
63. The Hadoop Ecosystem
Since 2012, "Hadoop"
often refers not to just
the base modules but
rather to the Hadoop
Ecosystem,
which includes all of
the additional
packages that can be
installed on top of or
alongside Hadoop.
64. Let us meet some of the “Hadoop tools”:
Hive
Pig
Sqoop
Oozie
Big Data Tools - The Hadoop Ecosystem
65. Both HIVE and PIG allow to run MapReduce jobs using simple
query languages
Big Data Tools - The Hadoop Ecosystem
Hive
provides a SQL-like interface to data and allows to impose a
schema on the data, and is best suited for structured and semi
structured data
Pig
translates the Pig Latin language so that scripts can run on
Hadoop. Best suited for data flow jobs, for semi-structured
and unstructured data
66. Sqoop
tool designed for efficiently transferring bulk data
between Apache Hadoop and structured data stores.
Big Data Tools - The Hadoop Ecosystem
Oozie
workflow scheduler system to manage Apache Hadoop jobs.
Oozie is integrated with the rest of the Hadoop Ecosystem
supporting several types of Hadoop jobs out of the box
(including Pig, Hive and Sqoop) as well as system specific jobs
(such as Java programs and shell scripts).
68. Big Data with Microsoft
Hadoop can be deployed on premises as well as in the cloud.
The cloud allows organisations to deploy Hadoop without
hardware to acquire or specific setup expertise.
Vendors who currently have an offer for the cloud include
Microsoft, Amazon and Google.
Let us focus on Microsoft
The key product is: HDInsight for Microsoft Azure
69. Big Data Tools - Big Data with Microsoft
Azure is Microsoft Cloud Platform, that offers several services
Azure HDInsight
deploys and provisions Apache Hadoop clusters in the cloud,
it is compatible with: Ambari, Avro, HBase, HDFS, Hive,
Mahout, MapReduce and YARN, Oozie, Pig, Sqoop, Storm,
Zookeeper.
Azure Power Shell
A scripting environment to control and automate the
deployment and management of your workloads in Azure
70. Big Data Tools - Big Data with Microsoft
Windows Azure Blob Storage WASB
Blob Storage is a general-purpose Hadoop-compatible Azure
storage solution that integrates with HDInsight.
Store data in Azure (blob) instead that in the cluster (HDFS)
(Positive) Consequences:
Data are still there after you finish Map Reduce jobs and
turn the cluster down
Easier to share data with other applications
71. Big Data Tools - Big Data with Microsoft
Windows Azure Blob Storage WASB
72. Big Data Tools - Big Data with Microsoft
Excel on steroids,
thanks to some powerful add-ins
Power Query
allows to simplifies data discovery and access.
You can connect to data across a wide variety of sources,
including relational databases, Web and Hadoop
You can combine and refine the data
You can save queries and refresh the data
73. Big Data Tools - Big Data with Microsoft
Power Pivot
allows non specialised users to do some Business Intelligence
on different data sources and create interactive reports,
sharable as web applications
Power View
is a very interactive data exploration, visualisation and
presentation tool
Power Map
is a data visualisation tool that allows to plot geographic and
temporal data on a 3D map, show it over time, and create
visual tours
75. References
Big Data & Digital Marketing
Most of the original material
has been posted on:
Editor's Notes
300 BC: Library of Alexandria
Today: 320 Alexandria Library per person
What about our knowledge level?
1. how we spend: real-time & targeted
2. vote: micro-targeting & n=all approach
3. study: reduce drop-out in schools[9]
4. stay healthy: wearable monitor 24h. Analyse to get suggestion on better life style[10]
5. keep/lose privacy: Big Data also attracts criminal hackers and identity thieves
Imbalance between needs and resource:
Too many complaints (25,000 per year) too few inspectors (200).
You want your inspectors to tackle the most relevant cases only/first.
Volume = 240+ TB of data at each GP per team.
Velocity = take a decisions in <~ 30 seconds.
Main goal: to get real time alarms on brakes, tires, fuel and other factors that affect car performance during a race.
Flu can spread very fast with catastrophic consequences, traditional methods can be too slow.
Each day, millions of users around the world search for health information online. As you might expect, there are more flu-related searches during flu season.
Of course, not every person who searches for "flu" is actually sick, but a pattern emerges when all the flu-related search queries are added together.
Of course, not every person who searches for "flu" is actually sick, but a pattern emerges when all the flu-related search queries are added together.
Predictions based only on GFT can be very inaccurate;
if used a complementary tool is incredibly powerful.
Injuries = largest market inefficiency in pro sports
in 2013 teams in the Major League Baseball spent $665 million on the salaries of injured players and replacements.
Goal: anticipate when an athlete will get hurt before it actually happens.
Big Data can help in running massive multiplayer games with success
and also in tailoring games on the player
Open data: certain data should be freely available to everyone to use and republish as they wish.
Companies such as Google, Amazon, and Facebook are amassing vast amounts of information on everyone and everything.
Sentiment Analysis according to New York Times[15]:
translating the vagaries of human emotion into hard data, mining the web for feelings, not facts
Sentiment is categorised in negative, neutral and positive.
There are several works trying to get more states.