This year I have been so kindly invited for a keynote talk at Big Data Spain which will be held in Madrid 17-18 of November. This time, rather than diving into a specific technology or tool, I am reflecting on the state of data analytics, and how cloud, technology and data science are brewing a possible recipe for analytics at scale, towards ai, prescriptive analytics and cognitive processing.
Although it might be a bit ahead of the current state of development in analytical solutions and databases, I am starting to see clear early signals that something amazing is hatching in the realm of data processing, and I would like to share some of these facts/elements with the audience of Big Data Spain. I would like to stay grounded to the current technology developments but also let the imagination soar by showing that today in data analytics the sum is much more than the union of its parts.
Are we reaching a Data Science Singularity? - How Cognitive Computing is emerging from Machine Learning Algorithms, Big Data Tools, and Cloud Services
Prescriptive analytics is the ultimate analytical step which goes beyond predictions into the realm of goal-oriented recommendations. As such, we could consider prescriptive analytics as a particular sort of cognitive computing. In 2016, how far are we from cognitive computing actually? In this talk, I will describe the latest advances in machine learning algorithms, big data tools and cloud engineering practices.
These are the ingredients which are blended together to brew modern AI, prescriptive analytics and cognitive processing solutions. As data, and algorithms are made available into large cloud computing clusters, higher-level, cognitive-like services will solve real-world, complex and often ambiguous cases.
Finally, I will touch on the topic of meta-data science and how automated data science could (re)define the role of the data scientist in the coming years.
Applications
http://www.wsj.com/articles/googles-self-driving-car-program-odometer-reaches-2-million-miles-1475683321
http://www.nature.com/articles/srep26286
Why is AI so difficult?
http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html
http://www.forbes.com/sites/gilpress/2016/10/31/12-observations-about-artificial-intelligence-from-the-oreilly-ai-conference/
http://www.tor.com/2011/06/21/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai/
https://www.safaribooksonline.com/library/view/oreilly-ai-conference/9781491973912/video260721.html
Videos on AI
Yann LeCunn: https://youtu.be/_1Cyyt-4-n8
Andrej Karpathy: https://youtu.be/u6aEYuemt0M
Nando de Freitas: https://youtu.be/bEUX_56Lojc
Richard Socher: https://youtu.be/oGk1v1jQITw
for more info see:
https://www.linkedin.com/pulse/data-science-singularity-natalino-busa
6. 6
Natalino Busa - @natbusa
What about (data) science?
- technologies and tools are driving innovation in data analytics -
7. 7
Natalino Busa - @natbusa
Man - Machine
as integrated cognitive systems
8. 8
Natalino Busa - @natbusa
Learning: The Scientific Method
Ørsted's "First Introduction to General Physics" (1811)
https://en.m.wikipedia.org/wiki/History_of_scientific_method
observation hypothesis deduction synthesis
Hans Christian Ørsted
experiment
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
9. 9
Natalino Busa - @natbusa
Innovation in Data Analytics
Cloud Community AI & ML
11. 11
Natalino Busa - @natbusa
“we live in an age of open source datacenters, so
we can stack all these things together and we
have open source from the ground to ceiling.”
Sam Ramji, CEO of Cloud Foundry
https://www.youtube.com/watch?v=7oCSFcUW-Qk
12. 12
Natalino Busa - @natbusa
Analytics in the cloud
Bare Metal: Physical Machines
IAAS: Virtual Resources
CAAS: Containers,
dPAAS: Datastores, Data Engines
iPAAS: Tools Integration, Flows & Processes
DAAAS: Data Analytics as a Service
13. 13
Natalino Busa - @natbusa
DAAAS: AI and ML API’s
Cloud Computing for Deep Neural Networks
> Models, Compute (Train, Score), and Data
AI and ML models for:
● Speech (audio)
● Language (text)
● Vision (images/video)
● Data (classification, regression, clustering, anomaly detection)
14. 14
Natalino Busa - @natbusa
Ephemeral Computing Clusters on a Cloud
data
create load compute store
timeline
destroy
15. 15
Natalino Busa - @natbusa
dPaaS: Analytical clusters
Ephemeral
Short-Lived
Data Exploration
Isolated, Personal
Simple Access Management
Permanent
Long Lived
Production / Operations
Co-Ordinated
Complex Access Management
vs
16. 16
Natalino Busa - @natbusa
GPU’s and Distributed Computing
GPU support is coming in Kubernetes, Mesos, Spark
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
http://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
out
up
CPU
R,Python
Spark
TensorFrames
25. 25
Natalino Busa - @natbusa
Ask me Anything
Dynamic Memory Networks
for Natural Language
Processing
https://arxiv.org/pdf/1603.01417v1.pdf
https://youtu.be/oGk1v1jQITw
Caiming Xiong,
Stephen Merity,
Richard Socher
26. 26
Natalino Busa - @natbusa
Ask me Anything
http://www.socher.org/index.php/DeepLearningTutorial/DeepLearningTutorial
Dynamic Memory Networks for Natural Language Processing
https://arxiv.org/pdf/1603.01417v1.pdf
http://www.socher.org/
Local
context
Wider
context
NLP, Attention Masks
Semantic Embeddings from Text, Images
28. 28
Natalino Busa - @natbusa
Network Intrusion Detection
http://billsdata.net/?p=105
It contains 130 million flow records involving
12,027 distinct computers over 36 days (not
the full 58 days claimed for the entire data
release).
Each record consists of: time (to nearest
second), duration, source and destination
computer ids, source and destination ports,
protocol, number of packets and number of
bytes
Techniques: TDA, Dimensionality Reduction
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
29. 29
Natalino Busa - @natbusa
Approaching (Almost) Any Machine Learning Problem
- Abhishek Thakur, Kaggle Grandmaster -
data labels
raw data: tables, files Useful dataData munging Feature
Engineering
Tabular Data ready for ML
30. 30
Natalino Busa - @natbusa
AutoML challenge
- based on scikit-learn
- 15 classifiers,
- 14 feature preprocessing methods
- 4 data preprocessing methods
- 110 hyperparameters
- Supervised classification challenge:
100 different datasets
https://arxiv.org/abs/1611.03824v1
Natalino Busa - @natbusa
32. 32
Natalino Busa - @natbusa
Human cognitive biases :
Too much information
Not enough meaning
What should we
remember?
Need to act fast
https://en.wikipedia.org/wiki/List_of_cognitive_biases
33. 33
Natalino Busa - @natbusa
Man vs Machine cognitive limits
Model generation
Explanation
Unsupervised
Planning
Too much information
Not enough meaning
Need to act quickly
Memory limits
34. 34
Natalino Busa - @natbusa
Theorems often tell us complex truths about the simple things,
but only rarely tell us simple truths about the complex ones
Marvin Minsky
K-Linesː A Theory of Memory (1980)
35. 35
Natalino Busa - @natbusa
Data Science: wear the AI/ML Lenses
We are entering a new era of intelligent machines
Boost our understanding of data
Focus on higher level analyses
36. 36
Natalino Busa - @natbusa
Intelligent Data Systems:
Long live the “database”
Wikipedia:
A database is an organized collection of data.
DATA
New-SQL
ML
AI
SQL
Python - Scala - R
NLP
UX
Speech
COG
37. 37
Natalino Busa - @natbusa
The Database.
is never going to be the same.
39. 39
Natalino Busa - @natbusa
Credits
Cover: courtesy of Big Data Spain - https://www.bigdataspain.org/
Pictures:
https://commons.wikimedia.org/wiki/File:PurportedUFO2.jpg
https://commons.wikimedia.org/wiki/File:Amazing_Stories_October_1957.jpg
https://commons.wikimedia.org/wiki/File:DJI_Phantom_2_Vision%2B_V3_hovering_over_Weissfluhjoch_(cropped).jpg
https://commons.wikimedia.org/wiki/File:Leonard_Nimoy_as_Spock_1967.jpg
https://en.wikipedia.org/wiki/File:STUltimate_Cp.jpg
https://github.com/tensorflow/models/blob/master/syntaxnet/ff_nn_schematic.png
http://billsdata.net/wordpress/wp-content/uploads/2015/11/wikimap2.jpg
http://billsdata.net/wordpress/wp-content/uploads/2015/11/netflow.png
https://commons.wikimedia.org/wiki/File:Girls_learning_sign_language.jpg
https://arxiv.org/pdf/1603.01417v1.pdf
http://www.socher.org/uploads/Main/RichardSocher.jpg
https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf
https://commons.wikimedia.org/wiki/File:Cognitive_Bias_Codex_-_180%2B_biases,_designed_by_John_Manoogian_III_(jm3).jpg
Visualizations:
https://github.com/caffeinalab/siriwavejs
https://gist.github.com/AnanthaRajuC/91beee3eb04d11cb3af5
https://dribbble.com/shots/1714369-Cortana-Animation
Icons:
Icons made by Gregor Cresnar from www.flaticon.com is licensed by CC 3.0 BY
41. 41
Natalino Busa - @natbusa
AI & ML: curated list of links
Applications
http://www.wsj.com/articles/googles-self-driving-car-program-odometer-reaches-2-million-miles-1475683321
http://www.nature.com/articles/srep26286
Why is AI so difficult?
http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html
http://www.forbes.com/sites/gilpress/2016/10/31/12-observations-about-artificial-intelligence-from-the-oreilly-ai-conference/
http://www.tor.com/2011/06/21/norvig-vs-chomsky-and-the-fight-for-the-future-of-ai/
https://www.safaribooksonline.com/library/view/oreilly-ai-conference/9781491973912/video260721.html
You Tube, great videos on AI
Yann LeCunn: https://youtu.be/_1Cyyt-4-n8
Andrej Karpathy: https://youtu.be/u6aEYuemt0M
Nando de Freitas: https://youtu.be/bEUX_56Lojc
Richard Socher:https://youtu.be/oGk1v1jQITw
42. 42
Natalino Busa - @natbusa
AI & ML: curated list of links
NLP
https://github.com/tensorflow/models/tree/master/syntaxnet
https://arxiv.org/abs/1405.4053v2
https://arxiv.org/abs/1603.06042
https://arxiv.org/abs/1301.3781v3
Video, Images, Hybrid Deep Learning Networks
https://arxiv.org/abs/1611.01599v1
https://arxiv.org/abs/1603.01417v1
Topological Data Analysys (TDA), Dim Reduction:
https://en.wikipedia.org/wiki/Topological_data_analysis
https://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction
Meta Learning:
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
https://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf
https://arxiv.org/abs/1611.03824v1
43. 43
Natalino Busa - @natbusa
Curated list of links
Cognitive sciences:
https://en.wikipedia.org/wiki/History_of_scientific_method
https://en.wikipedia.org/wiki/List_of_cognitive_biases
Cloud:
The Making of a Cloud Native Application Platform - Sam Ramji https://www.youtube.com/watch?v=7oCSFcUW-Qk
https://en.wikipedia.org/wiki/Ephemerality
http://conferences.oreilly.com/oscon/oscon2011/public/schedule/detail/19812
GPU and distributed Computing:
https://www.oreilly.com/learning/accelerating-spark-workloads-using-gpus
http://www.slideshare.net/databricks/tensorframes-google-tensorflow-on-apache-spark
Collaborative coding and research:
https://github.com/tensorflow/models
https://github.com/jupyter
http://www.arxiv-sanity.com/