SlideShare a Scribd company logo
1 of 14
Download to read offline
Introduction to Big Data
Definition of Big Data
 "Big data is a broad term for data sets so large or complex that traditional
data processing applications are inadequate.“
 "Big data is an evolving term that describes any voluminous amount of
structured, semi-structured and unstructured data that has the potential to
be mined for information.“
 Data growing way faster than computation speeds
 A single machine can no longer process or even store all this data!
The Big Data problem
Where does Big Data come from?
 Online recorded content:
 Clicks
 Ad views
 Server requests
 .. everything what happens online can potentially be recorded
 User generated content (Facebook, Twitter, Instagram, etc)
 Smartphone users reach to their phone 150 times a day (2013)
 Health and scientific computing
 Large Hadron Collider produces about double amount of data than Twitter every year
 Internet of Things (IoT)
 smart thermostat systems
 automobiles with built-in sensors
 all kind of “smart” devices of various sizes
Example scales of Big Data
 EIR communication logs: 1.4 TB / day
 Facebook logs: 60 TB / day
 Google total web index: ~10+ PB (10000TB)
 Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)
 ..as a reminder..
 time to read 1TB from disk: 3 hours (100MB/s)
 Google web index could be read from disk serialized in ~3.4 years
How do we program this thing?
6
OK but I don’t work at Google yet ...
Startup example
 Let’s design a simple web tracker from scratch
 Register and count each page view for a number of clients
 “Keep simple things simple”
 Version 1.0:
 Problem?
 Huge number of page views => massive DB load on concurrent updates => DB
timeouts => FAIL
Version 2.0
 Why write each count?!
 Let’s introduce a queue and buffer updates
 Problem?
 # of page views and # of clients keep increasing => DB overload => FAIL
Version 3.0
 The bottleneck is the write-heavy DB
 Let’s shard the database!
 Problems?!
 Have to keep adding new servers and re-sharding existing databases
 Re-sharding online is tricky (maybe introduce pending queues?)
 A single code failure corrupts a huge set of data collected over years
 Maintenance nightmare
Is there a way out?
 We need new tools which handle:
 automatic sharding and re-sharding
 automatic replication and rebalancing
 fault tolerance
 effortless horizontal scaling
 But we need to adapt ourselves as well. We need:
 a new definition of “data” (data ≠ information)
 new architectures (Lambda Architecture)
 immutable data (for scaling and fault tolerance)
 functional programming concepts
 No, writing 25 years old structural code in this year’s favorite language
won’t cut it anymore
Big Data tooling
 Apache Hadoop distributed filesystem (HDFS)
 Distributed, scalable, portable filesytem written in Java
 Open source, 10 years old (!) project
 Handles files in the gigabytes-terabytes range
 Manages automatic replication and rebalancing of data
 Facebook had 21 PB of storage on HDFS in 2010
 Yahoo had a cluster of 10 000 Hadoop nodes in 2008
 Apache Spark
 Next generation data processing engine written in Scala
 Open source, 5 years old project
 Up to 100 times faster than Hadoop MapReduce
 Uses functional programming techniques to process data
 Can scale down to get run in an IDE!
Apache Spark by a glance
13
The good news
 The right tools are available and open-source
 The knowledge is available and mostly free
 It’s all ready to get learned!

More Related Content

What's hot

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - IntroductionAlex Meadows
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big DataMatthew Dennis
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Simplilearn
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache HadoopSuman Saurabh
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataHaluan Irsad
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewSivashankar Ganapathy
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyNishant Gandhi
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxPankajkumar496281
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big DataLewis Crawford
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAmpoolIO
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabatinabati
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 

What's hot (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data Analytics - Introduction
Big Data Analytics - IntroductionBig Data Analytics - Introduction
Big Data Analytics - Introduction
 
The Future Of Big Data
The Future Of Big DataThe Future Of Big Data
The Future Of Big Data
 
A Big Data Concept
A Big Data ConceptA Big Data Concept
A Big Data Concept
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
Big Data Tutorial | What Is Big Data | Big Data Hadoop Tutorial For Beginners...
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big data analytics with Apache Hadoop
Big data analytics with Apache  HadoopBig data analytics with Apache  Hadoop
Big data analytics with Apache Hadoop
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Lesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptxLesson 1 introduction to_big_data_and_hadoop.pptx
Lesson 1 introduction to_big_data_and_hadoop.pptx
 
Motivation for big data
Motivation for big dataMotivation for big data
Motivation for big data
 
Big Data
Big DataBig Data
Big Data
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big data analytics, survey r.nabati
Big data analytics, survey r.nabatiBig data analytics, survey r.nabati
Big data analytics, survey r.nabati
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 

Viewers also liked

Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applicationsali easazadeh
 
Virtualization, the cloud enabler
Virtualization, the cloud enablerVirtualization, the cloud enabler
Virtualization, the cloud enablerPraveen Hanchinal
 
Big Data
Big DataBig Data
Big DataNGDATA
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Big Data and Social Media
Big Data and Social MediaBig Data and Social Media
Big Data and Social MediaAmy Shuen
 
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...Santiago Castelo
 
Big Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in EmergenciesBig Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in EmergenciesThomas Dybro Lundorf
 
Klarity - Asia digital analytic summit
Klarity -  Asia digital analytic summitKlarity -  Asia digital analytic summit
Klarity - Asia digital analytic summitNDN Group
 
Introduction to Social Media
Introduction to Social MediaIntroduction to Social Media
Introduction to Social MediaGerald Hensel
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Product Placement: The Present & The Future
Product Placement: The Present & The FutureProduct Placement: The Present & The Future
Product Placement: The Present & The Futureitandlaw
 
Big Data Social Media & Smart Apps
Big Data Social Media & Smart AppsBig Data Social Media & Smart Apps
Big Data Social Media & Smart AppsGiacomo Nasilli
 

Viewers also liked (20)

Big data and its applications
Big data and its applicationsBig data and its applications
Big data and its applications
 
What is big data?
What is big data?What is big data?
What is big data?
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Virtualization, the cloud enabler
Virtualization, the cloud enablerVirtualization, the cloud enabler
Virtualization, the cloud enabler
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data
Big DataBig Data
Big Data
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Big Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning GuruBig Data Hadoop Tutorial by Easylearning Guru
Big Data Hadoop Tutorial by Easylearning Guru
 
Big Data and Social Media
Big Data and Social MediaBig Data and Social Media
Big Data and Social Media
 
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
#PolíticosViolentos, un análisis de la agresión en el discurso de Cristina Ki...
 
Social media & big data
Social media & big dataSocial media & big data
Social media & big data
 
Big Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in EmergenciesBig Data from Social Media and Crowdsourcing in Emergencies
Big Data from Social Media and Crowdsourcing in Emergencies
 
Klarity - Asia digital analytic summit
Klarity -  Asia digital analytic summitKlarity -  Asia digital analytic summit
Klarity - Asia digital analytic summit
 
Introduction to Social Media
Introduction to Social MediaIntroduction to Social Media
Introduction to Social Media
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Product Placement: The Present & The Future
Product Placement: The Present & The FutureProduct Placement: The Present & The Future
Product Placement: The Present & The Future
 
Big Data Social Media & Smart Apps
Big Data Social Media & Smart AppsBig Data Social Media & Smart Apps
Big Data Social Media & Smart Apps
 

Similar to Introduction to Big Data

Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampSpotle.ai
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single DatabaseDatafiniti
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talksyhadoop
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformGeekNightHyderabad
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Eli White
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big DataArjen de Vries
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Sciencesarith divakar
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by KeylabsSiva Sankar
 
Big data and hadoop introduction
Big data and hadoop introductionBig data and hadoop introduction
Big data and hadoop introductionAjay Mittal
 

Similar to Introduction to Big Data (20)

00 hadoop welcome_transcript
00 hadoop welcome_transcript00 hadoop welcome_transcript
00 hadoop welcome_transcript
 
Bigdata and Hadoop Bootcamp
Bigdata and Hadoop BootcampBigdata and Hadoop Bootcamp
Bigdata and Hadoop Bootcamp
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
The Internet as a Single Database
The Internet as a Single DatabaseThe Internet as a Single Database
The Internet as a Single Database
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...
 
Big Data - Need of Converged Data Platform
Big Data - Need of Converged Data PlatformBig Data - Need of Converged Data Platform
Big Data - Need of Converged Data Platform
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011Big data and APIs for PHP developers - SXSW 2011
Big data and APIs for PHP developers - SXSW 2011
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
1. what is hadoop part 1
1. what is hadoop   part 11. what is hadoop   part 1
1. what is hadoop part 1
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Final deck
Final deckFinal deck
Final deck
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Hadoop Online training by Keylabs
Hadoop Online training by KeylabsHadoop Online training by Keylabs
Hadoop Online training by Keylabs
 
Big data and hadoop introduction
Big data and hadoop introductionBig data and hadoop introduction
Big data and hadoop introduction
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 

Introduction to Big Data

  • 2. Definition of Big Data  "Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate.“  "Big data is an evolving term that describes any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information.“  Data growing way faster than computation speeds  A single machine can no longer process or even store all this data! The Big Data problem
  • 3. Where does Big Data come from?  Online recorded content:  Clicks  Ad views  Server requests  .. everything what happens online can potentially be recorded  User generated content (Facebook, Twitter, Instagram, etc)  Smartphone users reach to their phone 150 times a day (2013)  Health and scientific computing  Large Hadron Collider produces about double amount of data than Twitter every year  Internet of Things (IoT)  smart thermostat systems  automobiles with built-in sensors  all kind of “smart” devices of various sizes
  • 4.
  • 5. Example scales of Big Data  EIR communication logs: 1.4 TB / day  Facebook logs: 60 TB / day  Google total web index: ~10+ PB (10000TB)  Facebook total data: 300 PB with an incoming rate of 600 TB / day (2014)  ..as a reminder..  time to read 1TB from disk: 3 hours (100MB/s)  Google web index could be read from disk serialized in ~3.4 years
  • 6. How do we program this thing? 6
  • 7. OK but I don’t work at Google yet ...
  • 8. Startup example  Let’s design a simple web tracker from scratch  Register and count each page view for a number of clients  “Keep simple things simple”  Version 1.0:  Problem?  Huge number of page views => massive DB load on concurrent updates => DB timeouts => FAIL
  • 9. Version 2.0  Why write each count?!  Let’s introduce a queue and buffer updates  Problem?  # of page views and # of clients keep increasing => DB overload => FAIL
  • 10. Version 3.0  The bottleneck is the write-heavy DB  Let’s shard the database!  Problems?!  Have to keep adding new servers and re-sharding existing databases  Re-sharding online is tricky (maybe introduce pending queues?)  A single code failure corrupts a huge set of data collected over years  Maintenance nightmare
  • 11. Is there a way out?  We need new tools which handle:  automatic sharding and re-sharding  automatic replication and rebalancing  fault tolerance  effortless horizontal scaling  But we need to adapt ourselves as well. We need:  a new definition of “data” (data ≠ information)  new architectures (Lambda Architecture)  immutable data (for scaling and fault tolerance)  functional programming concepts  No, writing 25 years old structural code in this year’s favorite language won’t cut it anymore
  • 12. Big Data tooling  Apache Hadoop distributed filesystem (HDFS)  Distributed, scalable, portable filesytem written in Java  Open source, 10 years old (!) project  Handles files in the gigabytes-terabytes range  Manages automatic replication and rebalancing of data  Facebook had 21 PB of storage on HDFS in 2010  Yahoo had a cluster of 10 000 Hadoop nodes in 2008  Apache Spark  Next generation data processing engine written in Scala  Open source, 5 years old project  Up to 100 times faster than Hadoop MapReduce  Uses functional programming techniques to process data  Can scale down to get run in an IDE!
  • 13. Apache Spark by a glance 13
  • 14. The good news  The right tools are available and open-source  The knowledge is available and mostly free  It’s all ready to get learned!