ABSTRACT: Due to recent advances in technology, humanity is collecting vast amounts of data at an unprecedented rate, making the skills necessary to mine insights from this data increasingly valuable. So what does it take for a Developer to enter the world of data science?
Join me on a journey into the world of big data and machine learning where we will explore what the work actually looks like, identify which skills are most important, and design a road map for how you too can join this exciting and profitable industry.
5. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• National gap for analytical expertise at 140k+ by 2017.
–McKinsey 2011
• Shortage of 100k Data Scientists by 2020. –Gartner 2012
• 90% of clients need expertise, 40% cite lack of talent.
–Accenture 2014
• Survey finds 83% of data scientists see shortage.
–Crowdflower 2016
• “I keep saying that the sexy job in the next 10 years will be
statisticians. And I’m not kidding.”
–Google’s Chief Economist
• Data Scientist the #1 job in America for 2016 AND 2017!
–GlassDoor
The Demand
9. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
A data scientist is a job title for an employee or
business intelligence (BI) consultant who excels at
analyzing data, particularly large amounts of data, to
help a business gain a competitive edge.
–WhatIs.com
The Definition
12. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Classification – Is this A or B?
• Anomaly Detection – Is this weird?
• Regression – How much -or- how many?
• Clustering – How is this organized?
• Reinforcement Learning – What should I do next?
The Five Questions
13. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Educate the business
• Look for problems to solve
• Research new techniques
• Collate data for analysis (ETL)*
• Implement algorithms
• Design big data-capable architecture
• Present insights
The Job
14. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Big Data
• Fast Data
• Dark Data
• Unstructured Data
• Data Mining
• Data Visualization
• Predictive Analytics
• [Deep] Neural Network
The Buzzwords
19. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Big Data
Volume Variety
Velocity
• Real Time
• Near Time
• Batch
• Streams
• Records
• Transactions
• Tables & Files
• Structured
• Unstructured
• Semi-structured
20. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Velocity
Twitter
• 6,000 tweets per second
• 500 million tweets/day
Facebook
• 300 million photos/day
NY Stock Exchange
• captures 1TB of trade information each session
21. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Big Data
Big Data
Volume Variety
Velocity
• Real Time
• Near Time
• Batch
• Streams
• Records
• Transactions
• Tables & Files
• Structured
• Unstructured
• Semi-structured
23. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Skills
Subject Matter Expertise
Statistics
• Choose Procedures
• Diagnose Problems
• Develop Procedures
Hacking Expertise
• Technical Skills
• Creativity
• Values
• Goals
• Constraints
Machine
Learning
Traditional
Research
Traditional
Software
Data
Science
24. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Skills
Subject
Matter
Expertise
Hacking
Expertise
Social
Sciences
Statistics
Machine
Learning
Traditional
Software
Data
Science
Traditional
Research
Traditional
Research
Holistic
Research
Socially
Unaware
Domain
Unaware
Holistic
Software
29. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• SPSS
• Matlab
• Julia
• Kafka/Storm
• R
• Python
• Java/Scala
• Stata
• SAS
The Languages
http://www.kdnuggets.com/2015/05/r-vs-python-data-science.html
30. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Languages – SAS, Phython or R?
http://www.burtchworks.com/2016/07/13/sas-r-python-survey-2016-tool-analytics-pros-prefer/
36. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• R Statistical Programming Language
• Based on the S programming language
• R Development Environment
• Statistical and Visual Analysis
• Cross-Platform
• Free Open Source
• Active User Community
• Over 9,000 Extension Packages
The R
37. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Created in 1991 to emphasize productivity and code
readability
• Easier learning curve than R
• Free Open Source
• Active User Community
The Python
38. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Hadoop Distributed File System (HDFS)
• MapReduce vs. YARN
• Pig
• Hive
• Hbase
• Storm
• Spark
• etc.
The Hadoop Collective
39. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Sample the Data
• Random
• Stratified
Reconcile Missing Data
• Discard
• Infer
Normalize Numeric Values
• Standard Unit of Measure
• Subtract Average (Mean = 0)
• Divide by Standard Deviation
The Wrangling
Reduce Dimensionality
• Irrelevant Input Variables
• Redundant Input Variables
Add Derivative Values
• Generalize Attributes
• Discretize Attributes to Categories
• Binarize Categorical Attributes
Design Training Data
• Select
• Combine
• Aggregate
Power and Log transformation
• Approximate Normal Distribution
40. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• basic statistics (ie. p-value)
• statistical modeling
• statistical tests
• experiment design
• distributions
• maximum likelihood estimators
• probability theory
• linear algebra
• multivariable calculus
The Math
49. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Fundamentals
2. Statistics
3. Programming
4. ML
5. Text Mining
6. Visualization
7. Big Data
8. Data Munging
9. Toolbox
The Path
50. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Matrices & Linear Algebra
2. Hash Functions, Binary Tree
3. Relational Algebra, DB Basics
4. Inner, Outer, Cross, Theta Join
5. Cap Theorem
6. Tabular Data
7. Data Frames & Series
8. Sharding
9. OLAP
The Fundamentals
51. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
10. Multidimensional Data Model
11. ETL
12. Reporting vs BI vs Analytics
13. JSON & XML
14. NoSQL
15. Regex
16. Vendor Landscape
17. Environment Setup
The Fundamentals
52. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Pick a Dataset
2. Descriptive Statistics
3. Exploratory Data Analysis
4. Histograms
5. Percentiles and Outliers
6. Probability Theorem
7. Bayes Theorem
8. Random Variables
9. Cumul Dist Fn (CDF)
The Statistics
53. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
The Statistics
10. Continuous Distr.
11. Skewness
12. ANOVA
13. Prob Den Fn (PDF)
14. Cenral Limit Theorem
15. Monte Carlo Method
16. Hypothesis Training
17. p-Value
…
54. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
1. Python Basics
2. Working in Excel
3. R Setup / R Studio
4. R Basics
5. Expressions
6. Variables
7. Vectors
8. Matrices
9. Arrays
The Programming
55. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
10. Factors
11. Lists
12. Data Frames
13. Reading CSV Data
14. Reading Raw Data
15. Subsetting Data
16. Manipulate Data Frames
17. Functions
18. Factor Analysis
19. Install Packages
The Programming
56.
57. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
• Coursera - www.coursera.org
• EdX- www.edx.org
• Udacity - www.udacity.com
• Kaggle - www.kaggle.com
• Youtube - projects.iq.harvard.edu/stat110/youtube
• Boot Camps
The Training
58. @GAINESK
@ITCAMPRO #ITCAMP17Community Conference for IT Professionals
Q & A
Slides at DotNetDude.net
Subject
Matter
Expertise
Hacking
Expertise
Social
Sciences
Statistics
Machine
Learning
Traditional
Software
Data
Science
Traditional
Research
Traditional
Research
Holistic
Research
Socially
Unaware
Domain
Unaware
Holistic
Software
Big Data
Volume Variety
Velocity
Data ScienceBig Data
Big
Data
Science