2. Definition of Big Data
"Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.“
"Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the potential to
be mined for information.“
Data growing way faster than computation speeds
A single machine can no longer process or even store all this data!
The Big Data problem
3. Where does Big Data come from?
Online recorded content:
Clicks
Ad views
Server requests
.. everything what happens online can potentially be recorded
User generated content (Facebook, Twitter, Instagram, etc)
Smartphone users reach to their phone 150 times a day (2013)
Health and scientific computing
Large Hadron Collider produces about double amount of data than Twitter every year
Internet of Things (IoT)
smart thermostat systems
automobiles with built-in sensors
all kind of “smart” devices of various sizes
4.
5. Example scales of Big Data
EIR communication logs: 1.4 TB / day
Facebook logs: 60 TB / day
Google total web index: ~10+ PB (10000TB)
Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)
..as a reminder..
time to read 1TB from disk: 3 hours (100MB/s)
Google web index could be read from disk serialized in ~3.4 years
8. Startup example
Let’s design a simple web tracker from scratch
Register and count each page view for a number of clients
“Keep simple things simple”
Version 1.0:
Problem?
Huge number of page views => massive DB load on concurrent updates => DB
timeouts => FAIL
9. Version 2.0
Why write each count?!
Let’s introduce a queue and buffer updates
Problem?
# of page views and # of clients keep increasing => DB overload => FAIL
10. Version 3.0
The bottleneck is the write-heavy DB
Let’s shard the database!
Problems?!
Have to keep adding new servers and re-sharding existing databases
Re-sharding online is tricky (maybe introduce pending queues?)
A single code failure corrupts a huge set of data collected over years
Maintenance nightmare
11. Is there a way out?
We need new tools which handle:
automatic sharding and re-sharding
automatic replication and rebalancing
fault tolerance
effortless horizontal scaling
But we need to adapt ourselves as well. We need:
a new definition of “data” (data ≠ information)
new architectures (Lambda Architecture)
immutable data (for scaling and fault tolerance)
functional programming concepts
No, writing 25 years old structural code in this year’s favorite language
won’t cut it anymore
12. Big Data tooling
Apache Hadoop distributed filesystem (HDFS)
Distributed, scalable, portable filesytem written in Java
Open source, 10 years old (!) project
Handles files in the gigabytes-terabytes range
Manages automatic replication and rebalancing of data
Facebook had 21 PB of storage on HDFS in 2010
Yahoo had a cluster of 10 000 Hadoop nodes in 2008
Apache Spark
Next generation data processing engine written in Scala
Open source, 5 years old project
Up to 100 times faster than Hadoop MapReduce
Uses functional programming techniques to process data
Can scale down to get run in an IDE!