Introduction to Big Data

Definition of Big Data
 "Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.“
 "Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the potential to
be mined for information.“
 Data growing way faster than computation speeds
 A single machine can no longer process or even store all this data!
The Big Data problem

Where does Big Data come from?
 Online recorded content:
 Clicks
 Ad views
 Server requests
 .. everything what happens online can potentially be recorded
 User generated content (Facebook, Twitter, Instagram, etc)
 Smartphone users reach to their phone 150 times a day (2013)
 Health and scientific computing
 Large Hadron Collider produces about double amount of data than Twitter every year
 Internet of Things (IoT)
 smart thermostat systems
 automobiles with built-in sensors
 all kind of “smart” devices of various sizes

Example scales of Big Data
 EIR communication logs: 1.4 TB / day
 Facebook logs: 60 TB / day
 Google total web index: ~10+ PB (10000TB)
 Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)
 ..as a reminder..
 time to read 1TB from disk: 3 hours (100MB/s)
 Google web index could be read from disk serialized in ~3.4 years

How do we program this thing?
6

OK but I don’t work at Google yet ...

Startup example
 Let’s design a simple web tracker from scratch
 Register and count each page view for a number of clients
 “Keep simple things simple”
 Version 1.0:
 Problem?
 Huge number of page views => massive DB load on concurrent updates => DB
timeouts => FAIL

Version 2.0
 Why write each count?!
 Let’s introduce a queue and buffer updates
 Problem?
 # of page views and # of clients keep increasing => DB overload => FAIL

Version 3.0
 The bottleneck is the write-heavy DB
 Let’s shard the database!
 Problems?!
 Have to keep adding new servers and re-sharding existing databases
 Re-sharding online is tricky (maybe introduce pending queues?)
 A single code failure corrupts a huge set of data collected over years
 Maintenance nightmare

Is there a way out?
 We need new tools which handle:
 automatic sharding and re-sharding
 automatic replication and rebalancing
 fault tolerance
 effortless horizontal scaling
 But we need to adapt ourselves as well. We need:
 a new definition of “data” (data ≠ information)
 new architectures (Lambda Architecture)
 immutable data (for scaling and fault tolerance)
 functional programming concepts
 No, writing 25 years old structural code in this year’s favorite language
won’t cut it anymore

Big Data tooling
 Apache Hadoop distributed filesystem (HDFS)
 Distributed, scalable, portable filesytem written in Java
 Open source, 10 years old (!) project
 Handles files in the gigabytes-terabytes range
 Manages automatic replication and rebalancing of data
 Facebook had 21 PB of storage on HDFS in 2010
 Yahoo had a cluster of 10 000 Hadoop nodes in 2008
 Apache Spark
 Next generation data processing engine written in Scala
 Open source, 5 years old project
 Up to 100 times faster than Hadoop MapReduce
 Uses functional programming techniques to process data
 Can scale down to get run in an IDE!

The good news
 The right tools are available and open-source
 The knowledge is available and mostly free
 It’s all ready to get learned!

Introduction to Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Introduction to Big Data

Similar to Introduction to Big Data (20)

Recently uploaded

Recently uploaded (20)

Introduction to Big Data