A Gentle Introduction to Big Data

A Gentle
Introduction to
Big Data
Presenter: Mehmet Ali Akyol
April 03, 2018

Big data happens when the data you
have to process is bigger than what
you can process in the given time

What is Big Data?
Field dedicated to the analysis, processing, and storage of large collections of
data that frequently originate from disparate sources
Required when traditional data analysis, processing and storage technologies
and techniques are insufficient
Addresses distinct requirements, such as the combining of multiple unrelated
datasets, processing of large amounts of unstructured data and harvesting of
hidden information in a time-sensitive manner

Characteristics
of Big Data
● Volume
● Variety
● Velocity
● Veracity
● Value

Volume
About the scale of data
Terabytes, petabytes, exabytes, zettabytes…
Airbus generates 640TB of data in a flight
Self driving cars will generate 2PB of data every year

Variety
Various types of data formats and types
Structured, unstructured, and semi-structured
Text files, media files such as sound and video

Velocity
The speed of data is produced
The delay before the data must be consumed
Streaming data; Social media posts (Tweets, Facebook posts), IOT (Internet of
Things)

Veracity
Quality of data
Meaningful results
Understandability
Importance of data source

Value
Usefulness of data
Amount of knowledge that can be extracted from data
Making informed decisions

Types of Data
Structured Data Unstructured
Data
Semi-structured
data
Metadata

Metadata
Data about data
Details of the dataset
such as source, date,
and type

Data Processing in Big Data
Batch Processing
- Stored data is processed at
certain time intervals
- Hadoop
Stream Processing
- Processing continuous streams of
data in real time
- Apache Storm

Big Data
Processing
Architectures
Processing architectures that takes
advantage of both batch and stream
processing
- Lambda Architecture
- Apache Spark
- Kappa Architecture
- Apache Flink

Big Data Tools
Hadoop: A framework
that allows for the
distributed storage &
processing of large
datasets across clusters
of computers
Hive: A data warehouse
infrastructure that
provides data
summarization and
querying
Flink: Stream
processing framework
for distributed,
high-performing,
always-available, and
accurate data streaming
applications
Spark: A fast and
general engine for
large-scale data
processing for both
batch and streaming
data

Big Data Tools
Storm: A distributed
realtime computation
system for unbounded
data processing
Beam: A framework for
batch and streaming
data processing jobs
that run on any
execution engine such
as Flink and Spark
Zeppelin: Web-based
notebook that enables
data-driven, interactive
data analytics, and data
visualization
Kafka: A distributed
message queue for
building real-time data
pipelines and streaming
apps

Takeaways
Big Data is a research field that deals with processing of large amount of data
that traditional data processing techniques cannot handle in a timely and
efficient manner
5 Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value
Data Types: Structured, Unstructured, Semi-structured, and Metadata
Big Data Processing Methods: Batch and Stream Processing
There are tens of open-source big data tools out there

A Gentle Introduction to Big Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A Gentle Introduction to Big Data

Similar to A Gentle Introduction to Big Data (20)

Recently uploaded

Recently uploaded (20)

A Gentle Introduction to Big Data