1. Dr. Bill Howe - Director of Research,
Scalable Data Analytics
2.
What is data science?
◦ Set of theories and principles to perform several
data related tasks, like
◦ Data collection
◦ Data cleaning
◦ Data integration
◦ Data modeling
◦ Data visualization
3.
Data science is different from
◦ Business intelligence
◦ Statistics
◦ Database management
◦ Visualization
◦ Machine Learning
4.
DBA- Unstructured data
Statistician – data that doesn’t fit in to memories
Software engineer- statistical models and how to
communicate results
Business analyst- algorithms and tradeoff at scale
5.
Common three skills of Data scientiest
◦ Statistics
traditional analysis
◦ Data Munging
parsing, scraping, and formatting data
◦ Visualization
graphs, tools, etc.
6.
Three types of tasks:
◦ Preparing to run a model
◦ Running the model
◦ Communicating the results
7. ◦ Preparing to run a model
Gathering
Cleaning
Integrating
Restructuring
Transforming
Loading
Filtering
8. ◦ Running the model
Choosing appropriate machine learning
algorithms for regression, classification,
clustering and recommendations.
Validation of model
Improvement of model
◦ Communicating the results
10.
Scale – Cloud for Bigdata
The bigdata can be measured by 3 V’s
◦ Volume – number of rows (size)
◦ Variety – number of columns OR sources (text,
images, audio, video)
◦ Velocity - number of rows OR bytes per unit time
(processing time )
11.
“data exhaust” from customers
new and pervasive sensors
the ability to “keep everything”
13.
Twitter sentiment Analysis
◦ Extract the tweets from twitter API
◦ Calculate the sentiment score for tweets
◦ Calculate the sentiment score for terms in tweets
◦ Calculate frequency for terms of tweets
◦ Identify the happiest state
◦ Identify the top ten hastag