3. Big data happens when the data you
have to process is bigger than what
you can process in the given time
4. What is Big Data?
Field dedicated to the analysis, processing, and storage of large collections of
data that frequently originate from disparate sources
Required when traditional data analysis, processing and storage technologies
and techniques are insufficient
Addresses distinct requirements, such as the combining of multiple unrelated
datasets, processing of large amounts of unstructured data and harvesting of
hidden information in a time-sensitive manner
6. Volume
About the scale of data
Terabytes, petabytes, exabytes, zettabytes…
Airbus generates 640TB of data in a flight
Self driving cars will generate 2PB of data every year
7. Variety
Various types of data formats and types
Structured, unstructured, and semi-structured
Text files, media files such as sound and video
8. Velocity
The speed of data is produced
The delay before the data must be consumed
Streaming data; Social media posts (Tweets, Facebook posts), IOT (Internet of
Things)
16. Data Processing in Big Data
Batch Processing
- Stored data is processed at
certain time intervals
- Hadoop
Stream Processing
- Processing continuous streams of
data in real time
- Apache Storm
18. Big Data Tools
Hadoop: A framework
that allows for the
distributed storage &
processing of large
datasets across clusters
of computers
Hive: A data warehouse
infrastructure that
provides data
summarization and
querying
Flink: Stream
processing framework
for distributed,
high-performing,
always-available, and
accurate data streaming
applications
Spark: A fast and
general engine for
large-scale data
processing for both
batch and streaming
data
19. Big Data Tools
Storm: A distributed
realtime computation
system for unbounded
data processing
Beam: A framework for
batch and streaming
data processing jobs
that run on any
execution engine such
as Flink and Spark
Zeppelin: Web-based
notebook that enables
data-driven, interactive
data analytics, and data
visualization
Kafka: A distributed
message queue for
building real-time data
pipelines and streaming
apps
20. Takeaways
Big Data is a research field that deals with processing of large amount of data
that traditional data processing techniques cannot handle in a timely and
efficient manner
5 Vs of Big Data: Volume, Velocity, Variety, Veracity, and Value
Data Types: Structured, Unstructured, Semi-structured, and Metadata
Big Data Processing Methods: Batch and Stream Processing
There are tens of open-source big data tools out there