SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Teoria efectului defectului
   hardware: GoogleFS
  (a whitepaper review)



         Dan Şerban
And so it begins...
:: The Google File System was created by man at
   the dawn of the 21st century
:: In July 2010, it had not yet become self-aware

… but seriously now …

:: GoogleFS is not really a file system
:: GoogleFS is a storage framework for doing
   a very specific type of parallel computing
   (MapReduce) on massive amounts of data

http://labs.google.com/papers/gfs.html
Overview of MapReduce
1) Mappers read (key1, valueA) pairs
   from GoogleFS input files named key1.txt
   (large streaming reads)
2) Mappers derive key2 and valueB from valueA
3) Mappers record-append (key2_key1, valueB) pairs
   to temporary GoogleFS files named key2_key1.txt
4) Reducers read (key2_key1, valueB) pairs
   from those temporary key2_key1.txt files
   (large streaming reads)
5) Reducers aggregate by key2:
      valueC = aggregateFunction(valueB)
6) Reducers record-append (key2, valueC) pairs
   to permanent GoogleFS files named key2.bigtbl
The philosophy behind GoogleFS
The problem: Web Indexing

The solution
:: Find the sweet spot is in price / performance
   for hardware components
:: Build a large scale network of machines
   from those components
:: Make the assumption of frequent hardware
   failures the single most important design
   principle for building the file system
:: Apply highly sophisticated engineering to make
   the system reliable in software
Assumptions and Design Aspects
:: The quality and quantity of components
   together make frequent hardware failures
   the norm rather than the exception
   - Do not completely trust the machines
   - Do not completely trust the disks
:: GoogleFS must constantly monitor itself
   and detect, tolerate, and recover promptly
   from component failures on a routine basis
:: Streaming data access to large data sets
   is the focus of GoogleFS optimization
   (disks are increasingly becoming seek-limited)
:: High sustained throughput is more important
   than low latency
Assumptions and Design Aspects
:: Relaxed consistency model
   (slightly adjusted and extended file system API)
:: GoogleFS must efficiently implement
   well-defined semantics for multiple clients
   concurrently appending to the same file
   with minimal synchronization overhead
:: Prefer moving computation to where the data is
   vs. moving data over the network
   to be processed elsewhere
:: Application designers don't have to deal with
   component failures or where data is located
What GoogleFS exploits
:: Google controls both layers of the architectural stack:
   the file system and the MapReduce applications
:: Applications only use a few specific access patterns:
   - Large, sustained streaming reads
     (optimized for throughput)
   - Occasional small, but very seek-intensive
     random reads (supported, not optimized for within
     GoogleFS, but rather via app-level read barriers)
   - Large, sustained sequential appends
     by a single client (optimized for throughput)
   - Simultaneous atomic record appends
     by a large number of clients
     (optimized for heavy-duty concurrency
     by relaxing the consistency model)
Who is who?
One master, multiple chunkservers, multiple clients
Who is who?
:: A GoogleFS cluster consists of a single master
   and multiple chunkservers
   - The cluster is accessed by multiple clients
:: Clients, chunkservers and the master
   are implemented as userland processes
   running on commodity Linux machines
:: Files are divided into fixed-size chunks
   - 64 MB in GoogleFS v1
   - 1 MB in GoogleFS v2 / Caffeine
Who is who - The Master
:: The master stores three major types of data:
   - The file and chunk namespaces
   - The file-to-chunk mappings
   - The locations of each chunk's replicas
:: All metadata is kept in the master's memory
:: The master is not involved in any of the data transfers
   - It's not on the critical path for moving data
:: It's also responsible for making decisions about
   chunk allocation, placement and replication
:: The master periodically communicates with each
   chunkserver in HeartBeat messages to give it
   instructions and collect its state
:: The master delegates low-level work via chunk lease
:: It was the single point of failure in GoogleFS v1
:: GoogleFS v2 / Caffeine now supports multiple masters
Who is who - Chunkservers
:: Chunkservers are servers that store chunks of data
   on their local Linux file systems
:: They have no knowledge of the GoogleFS metadata
   - They only know about their own chunks
:: Each chunk is identified by an immutable
   and globally unique 64-bit chunk handle
   assigned by the master at the time of chunk creation
:: Chunks become read-only once they have been
   completely written (file writes are append-only)
:: For reliability, each chunk is replicated on multiple
   chunkservers (the default is 3 replicas)
:: Records appended to a chunk are checksummed
   on the way in and verified before they are delivered
   to clients
Who is who - Clients
:: Clients are Linux machines running distributed
   applications which are driving various kinds
   of MapReduce workloads
:: GoogleFS client code linked into each application
   implements the file system API and communicates
   with the master and chunkservers to read or write
   data on behalf of the application
:: Clients interact with the master for metadata
   operations, but all data-bearing communication
   goes directly to the chunkservers
:: Clients cache metadata (for a limited period of time)
   but they never cache chunk data
Appending
Appending
Regular append
(optimized for throughput from single client)


:: Client specifies:
   - file name
   - offset
   - data
:: Clients sends data to chunkservers
   in pipeline fashion
:: Client sends "append-commit" to the lease holder

:: The LH gets ACK back from all chunkservers involved
:: The LH sends ACK back to the client
Appending

Atomic record append
(optimized for heavy-duty concurrency)
:: Client specifies:
   - file name

     - data
::   Clients sends data to chunkservers
     in pipeline fashion
::   Client sends "append-commit" to the lease holder
::   The LH serializes commits
::   The LH gets ACK back from all chunkservers involved
::   The LH sends ACK back to the client
::   The LH converts some of the requests into pad reqs
Chunkserver goes offline
The master initiates and coordinates
intelligently prioritized re-replication of chunks

:: First priority is to bring the cluster to a basic
   level of redundancy where it can safely tolerate
   another node failure (2x replication)
:: Second priority is to minimize impact
   on application progress (3x replication for live chunks)
:: Third priority is to restore full overall redundancy
   (3x replication for all chunks, live and read-only)

The master also performs re-replication of a chunk if a
chunkserver signals a checksum mismatch when
delivering data to clients
Chunkserver rejoins the cluster
The master:

:: asks the chunkserver about each chunk replica
   under its control
:: performs a version number cross-check
   with its own metadata
:: instructs the chunkserver to delete
   stale chunk replicas (those that were live at the time
   the chunkserver went offline)

After successfully joining the cluster, the chunkserver
periodically polls the master about each chunk replica's
status
Replica management
:: Biased towards topological spreading
:: Creation and re-replication
   - Focused on spreading the write load
   - Density of recent chunk creations used as a proxy
      for the write load
:: Rebalancing
   - Gently moving chunks around
      to balance disk fullness
   - Low priority, fixes imbalances caused
      by new chunkservers joining the cluster
      for the first time
Questions / Feedback

Contenu connexe

Tendances

GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMJYoTHiSH o.s
 
Google file system
Google file systemGoogle file system
Google file systemDhan V Sagar
 
Google File Systems
Google File SystemsGoogle File Systems
Google File SystemsAzeem Mumtaz
 
HP-UX with Rsync by Dusan Baljevic
HP-UX with Rsync by Dusan BaljevicHP-UX with Rsync by Dusan Baljevic
HP-UX with Rsync by Dusan BaljevicCircling Cycle
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Antonio Cesarano
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 
Swift container sync
Swift container syncSwift container sync
Swift container syncOpen Stack
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEkawamuray
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems Ioanna Tsalouchidou
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Xavier Lucas
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaGuozhang Wang
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingGuozhang Wang
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache KafkaAmir Sedighi
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsgrro
 

Tendances (20)

GOOGLE FILE SYSTEM
GOOGLE FILE SYSTEMGOOGLE FILE SYSTEM
GOOGLE FILE SYSTEM
 
Google file system
Google file systemGoogle file system
Google file system
 
Google File System
Google File SystemGoogle File System
Google File System
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Google File Systems
Google File SystemsGoogle File Systems
Google File Systems
 
HP-UX with Rsync by Dusan Baljevic
HP-UX with Rsync by Dusan BaljevicHP-UX with Rsync by Dusan Baljevic
HP-UX with Rsync by Dusan Baljevic
 
kafka
kafkakafka
kafka
 
Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...Cluster based storage - Nasd and Google file system - advanced operating syst...
Cluster based storage - Nasd and Google file system - advanced operating syst...
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 
Swift container sync
Swift container syncSwift container sync
Swift container sync
 
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINEKafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
Kafka Multi-Tenancy - 160 Billion Daily Messages on One Shared Cluster at LINE
 
The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems The Chubby lock service for loosely- coupled distributed systems
The Chubby lock service for loosely- coupled distributed systems
 
Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28Openstack meetup lyon_2017-09-28
Openstack meetup lyon_2017-09-28
 
Building a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache KafkaBuilding a Replicated Logging System with Apache Kafka
Building a Replicated Logging System with Apache Kafka
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
Apache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream ProcessingApache Kafka, and the Rise of Stream Processing
Apache Kafka, and the Rise of Stream Processing
 
An Introduction to Apache Kafka
An Introduction to Apache KafkaAn Introduction to Apache Kafka
An Introduction to Apache Kafka
 
Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
Cassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requestsCassandra by example - the path of read and write requests
Cassandra by example - the path of read and write requests
 
Kafka ops-new
Kafka ops-newKafka ops-new
Kafka ops-new
 

En vedette

Frio invierno هكذا يكون الشتاء الروسي russian winter
Frio invierno هكذا يكون الشتاء الروسي russian winterFrio invierno هكذا يكون الشتاء الروسي russian winter
Frio invierno هكذا يكون الشتاء الروسي russian winteramr hassaan
 
In to the ice vocabulary
In to the ice vocabularyIn to the ice vocabulary
In to the ice vocabularyBrenda Obando
 
Cloud computing's truly open silver lining: OpenStack
Cloud computing's truly open silver lining: OpenStackCloud computing's truly open silver lining: OpenStack
Cloud computing's truly open silver lining: OpenStackAsociatia ProLinux
 
Empower Your Zultys Mx Platform With Reliable Voice Secure Fax And Video
Empower Your Zultys Mx Platform With Reliable Voice Secure Fax And VideoEmpower Your Zultys Mx Platform With Reliable Voice Secure Fax And Video
Empower Your Zultys Mx Platform With Reliable Voice Secure Fax And VideoEtherSpeak, Inc.
 

En vedette (7)

Frio invierno هكذا يكون الشتاء الروسي russian winter
Frio invierno هكذا يكون الشتاء الروسي russian winterFrio invierno هكذا يكون الشتاء الروسي russian winter
Frio invierno هكذا يكون الشتاء الروسي russian winter
 
Be2camp
Be2campBe2camp
Be2camp
 
In to the ice vocabulary
In to the ice vocabularyIn to the ice vocabulary
In to the ice vocabulary
 
Cloud computing's truly open silver lining: OpenStack
Cloud computing's truly open silver lining: OpenStackCloud computing's truly open silver lining: OpenStack
Cloud computing's truly open silver lining: OpenStack
 
Empower Your Zultys Mx Platform With Reliable Voice Secure Fax And Video
Empower Your Zultys Mx Platform With Reliable Voice Secure Fax And VideoEmpower Your Zultys Mx Platform With Reliable Voice Secure Fax And Video
Empower Your Zultys Mx Platform With Reliable Voice Secure Fax And Video
 
Second CRM Enterprise – An Introduction
Second CRM Enterprise – An IntroductionSecond CRM Enterprise – An Introduction
Second CRM Enterprise – An Introduction
 
Your Digital Footprint
Your Digital FootprintYour Digital Footprint
Your Digital Footprint
 

Similaire à Teoria efectului defectului hardware: GoogleFS

Google File System
Google File SystemGoogle File System
Google File SystemDreamJobs1
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptxShimoFcis
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file Systemdiptipan
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applicationsDing Li
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.pptvijayapraba1
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
L06 a versioning_system_overview
L06 a versioning_system_overviewL06 a versioning_system_overview
L06 a versioning_system_overviewM. Shahzad Mughal
 
Distributed File Systems
Distributed File Systems Distributed File Systems
Distributed File Systems Maurvi04
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
Design (Cloud systems) for Failures
Design (Cloud systems) for FailuresDesign (Cloud systems) for Failures
Design (Cloud systems) for FailuresRodolfo Kohn
 

Similaire à Teoria efectului defectului hardware: GoogleFS (20)

Gfs介绍
Gfs介绍Gfs介绍
Gfs介绍
 
Kosmos Filesystem
Kosmos FilesystemKosmos Filesystem
Kosmos Filesystem
 
Google File System
Google File SystemGoogle File System
Google File System
 
storage-systems.pptx
storage-systems.pptxstorage-systems.pptx
storage-systems.pptx
 
advanced Google file System
advanced Google file Systemadvanced Google file System
advanced Google file System
 
Kfs presentation
Kfs presentationKfs presentation
Kfs presentation
 
Google file system
Google file systemGoogle file system
Google file system
 
Spark 1.0
Spark 1.0Spark 1.0
Spark 1.0
 
Software architecture for data applications
Software architecture for data applicationsSoftware architecture for data applications
Software architecture for data applications
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
HADOOP.pptx
HADOOP.pptxHADOOP.pptx
HADOOP.pptx
 
GFS & HDFS Introduction
GFS & HDFS IntroductionGFS & HDFS Introduction
GFS & HDFS Introduction
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Google
GoogleGoogle
Google
 
Hadoop
HadoopHadoop
Hadoop
 
tittle
tittletittle
tittle
 
L06 a versioning_system_overview
L06 a versioning_system_overviewL06 a versioning_system_overview
L06 a versioning_system_overview
 
Distributed File Systems
Distributed File Systems Distributed File Systems
Distributed File Systems
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Design (Cloud systems) for Failures
Design (Cloud systems) for FailuresDesign (Cloud systems) for Failures
Design (Cloud systems) for Failures
 

Plus de Asociatia ProLinux

Cristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology OverviewCristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology OverviewAsociatia ProLinux
 
Nicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRONicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMROAsociatia ProLinux
 
Razvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily WorkaholicRazvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily WorkaholicAsociatia ProLinux
 
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilorRăzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilorAsociatia ProLinux
 
Ioan Eugen Stan - Introducere HBase
Ioan Eugen Stan -  Introducere HBaseIoan Eugen Stan -  Introducere HBase
Ioan Eugen Stan - Introducere HBaseAsociatia ProLinux
 
Ciprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRIDCiprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRIDAsociatia ProLinux
 
Petru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfsPetru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfsAsociatia ProLinux
 
Calin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in LinuxCalin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in LinuxAsociatia ProLinux
 
Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8Asociatia ProLinux
 
Cornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on LinuxCornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on LinuxAsociatia ProLinux
 
Radu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PCRadu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PCAsociatia ProLinux
 
Ovidiu Constantin - Debian Live
Ovidiu Constantin - Debian LiveOvidiu Constantin - Debian Live
Ovidiu Constantin - Debian LiveAsociatia ProLinux
 

Plus de Asociatia ProLinux (20)

Cristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology OverviewCristina Vintila - 4G - Technology Overview
Cristina Vintila - 4G - Technology Overview
 
Razvan Deaconescu - rss2email
Razvan Deaconescu - rss2emailRazvan Deaconescu - rss2email
Razvan Deaconescu - rss2email
 
Nicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRONicu Buculei - Progresul WLMRO
Nicu Buculei - Progresul WLMRO
 
Razvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily WorkaholicRazvan Deaconescu - Task Management for the Daily Workaholic
Razvan Deaconescu - Task Management for the Daily Workaholic
 
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilorRăzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
Răzvan Deaconescu - Biblioteci, gestiunea bibliotecilor
 
Ioan Eugen Stan - Introducere HBase
Ioan Eugen Stan -  Introducere HBaseIoan Eugen Stan -  Introducere HBase
Ioan Eugen Stan - Introducere HBase
 
Ioan Eugen Stan - James
Ioan Eugen Stan - JamesIoan Eugen Stan - James
Ioan Eugen Stan - James
 
Dumitru Enache - Bacula
Dumitru Enache - BaculaDumitru Enache - Bacula
Dumitru Enache - Bacula
 
Ciprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRIDCiprian Badescu, Eugen Stoianovici - CUBRID
Ciprian Badescu, Eugen Stoianovici - CUBRID
 
Ovidiu Constantin - ReactOS
Ovidiu Constantin - ReactOSOvidiu Constantin - ReactOS
Ovidiu Constantin - ReactOS
 
Petru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfsPetru Ratiu - Linux bonding meets sysfs
Petru Ratiu - Linux bonding meets sysfs
 
Calin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in LinuxCalin Burloiu - Prelucrarea fisierelor video in Linux
Calin Burloiu - Prelucrarea fisierelor video in Linux
 
Alex Juncu - UDPCast
Alex Juncu - UDPCastAlex Juncu - UDPCast
Alex Juncu - UDPCast
 
Razvan Deaconescu - Org-Mode
Razvan Deaconescu - Org-ModeRazvan Deaconescu - Org-Mode
Razvan Deaconescu - Org-Mode
 
Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8Ovidiu Constantin - Linux From Scratch 6.8
Ovidiu Constantin - Linux From Scratch 6.8
 
Cornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on LinuxCornel Florentin Dimitriu - Tune in... on Linux
Cornel Florentin Dimitriu - Tune in... on Linux
 
Radu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PCRadu Zoran - Linux pe un Tablet PC
Radu Zoran - Linux pe un Tablet PC
 
Ovidiu Constantin - Debian Live
Ovidiu Constantin - Debian LiveOvidiu Constantin - Debian Live
Ovidiu Constantin - Debian Live
 
Razvan Deaconescu - Redmine
Razvan Deaconescu - RedmineRazvan Deaconescu - Redmine
Razvan Deaconescu - Redmine
 
Ovidiu constantin 1 airopl
Ovidiu constantin   1 airoplOvidiu constantin   1 airopl
Ovidiu constantin 1 airopl
 

Dernier

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 

Dernier (20)

Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 

Teoria efectului defectului hardware: GoogleFS

  • 1. Teoria efectului defectului hardware: GoogleFS (a whitepaper review) Dan Şerban
  • 2. And so it begins... :: The Google File System was created by man at the dawn of the 21st century :: In July 2010, it had not yet become self-aware … but seriously now … :: GoogleFS is not really a file system :: GoogleFS is a storage framework for doing a very specific type of parallel computing (MapReduce) on massive amounts of data http://labs.google.com/papers/gfs.html
  • 3. Overview of MapReduce 1) Mappers read (key1, valueA) pairs from GoogleFS input files named key1.txt (large streaming reads) 2) Mappers derive key2 and valueB from valueA 3) Mappers record-append (key2_key1, valueB) pairs to temporary GoogleFS files named key2_key1.txt 4) Reducers read (key2_key1, valueB) pairs from those temporary key2_key1.txt files (large streaming reads) 5) Reducers aggregate by key2: valueC = aggregateFunction(valueB) 6) Reducers record-append (key2, valueC) pairs to permanent GoogleFS files named key2.bigtbl
  • 4. The philosophy behind GoogleFS The problem: Web Indexing The solution :: Find the sweet spot is in price / performance for hardware components :: Build a large scale network of machines from those components :: Make the assumption of frequent hardware failures the single most important design principle for building the file system :: Apply highly sophisticated engineering to make the system reliable in software
  • 5. Assumptions and Design Aspects :: The quality and quantity of components together make frequent hardware failures the norm rather than the exception - Do not completely trust the machines - Do not completely trust the disks :: GoogleFS must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis :: Streaming data access to large data sets is the focus of GoogleFS optimization (disks are increasingly becoming seek-limited) :: High sustained throughput is more important than low latency
  • 6. Assumptions and Design Aspects :: Relaxed consistency model (slightly adjusted and extended file system API) :: GoogleFS must efficiently implement well-defined semantics for multiple clients concurrently appending to the same file with minimal synchronization overhead :: Prefer moving computation to where the data is vs. moving data over the network to be processed elsewhere :: Application designers don't have to deal with component failures or where data is located
  • 7. What GoogleFS exploits :: Google controls both layers of the architectural stack: the file system and the MapReduce applications :: Applications only use a few specific access patterns: - Large, sustained streaming reads (optimized for throughput) - Occasional small, but very seek-intensive random reads (supported, not optimized for within GoogleFS, but rather via app-level read barriers) - Large, sustained sequential appends by a single client (optimized for throughput) - Simultaneous atomic record appends by a large number of clients (optimized for heavy-duty concurrency by relaxing the consistency model)
  • 8. Who is who? One master, multiple chunkservers, multiple clients
  • 9. Who is who? :: A GoogleFS cluster consists of a single master and multiple chunkservers - The cluster is accessed by multiple clients :: Clients, chunkservers and the master are implemented as userland processes running on commodity Linux machines :: Files are divided into fixed-size chunks - 64 MB in GoogleFS v1 - 1 MB in GoogleFS v2 / Caffeine
  • 10. Who is who - The Master :: The master stores three major types of data: - The file and chunk namespaces - The file-to-chunk mappings - The locations of each chunk's replicas :: All metadata is kept in the master's memory :: The master is not involved in any of the data transfers - It's not on the critical path for moving data :: It's also responsible for making decisions about chunk allocation, placement and replication :: The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state :: The master delegates low-level work via chunk lease :: It was the single point of failure in GoogleFS v1 :: GoogleFS v2 / Caffeine now supports multiple masters
  • 11. Who is who - Chunkservers :: Chunkservers are servers that store chunks of data on their local Linux file systems :: They have no knowledge of the GoogleFS metadata - They only know about their own chunks :: Each chunk is identified by an immutable and globally unique 64-bit chunk handle assigned by the master at the time of chunk creation :: Chunks become read-only once they have been completely written (file writes are append-only) :: For reliability, each chunk is replicated on multiple chunkservers (the default is 3 replicas) :: Records appended to a chunk are checksummed on the way in and verified before they are delivered to clients
  • 12. Who is who - Clients :: Clients are Linux machines running distributed applications which are driving various kinds of MapReduce workloads :: GoogleFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application :: Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers :: Clients cache metadata (for a limited period of time) but they never cache chunk data
  • 14. Appending Regular append (optimized for throughput from single client) :: Client specifies: - file name - offset - data :: Clients sends data to chunkservers in pipeline fashion :: Client sends "append-commit" to the lease holder :: The LH gets ACK back from all chunkservers involved :: The LH sends ACK back to the client
  • 15. Appending Atomic record append (optimized for heavy-duty concurrency) :: Client specifies: - file name - data :: Clients sends data to chunkservers in pipeline fashion :: Client sends "append-commit" to the lease holder :: The LH serializes commits :: The LH gets ACK back from all chunkservers involved :: The LH sends ACK back to the client :: The LH converts some of the requests into pad reqs
  • 16. Chunkserver goes offline The master initiates and coordinates intelligently prioritized re-replication of chunks :: First priority is to bring the cluster to a basic level of redundancy where it can safely tolerate another node failure (2x replication) :: Second priority is to minimize impact on application progress (3x replication for live chunks) :: Third priority is to restore full overall redundancy (3x replication for all chunks, live and read-only) The master also performs re-replication of a chunk if a chunkserver signals a checksum mismatch when delivering data to clients
  • 17. Chunkserver rejoins the cluster The master: :: asks the chunkserver about each chunk replica under its control :: performs a version number cross-check with its own metadata :: instructs the chunkserver to delete stale chunk replicas (those that were live at the time the chunkserver went offline) After successfully joining the cluster, the chunkserver periodically polls the master about each chunk replica's status
  • 18. Replica management :: Biased towards topological spreading :: Creation and re-replication - Focused on spreading the write load - Density of recent chunk creations used as a proxy for the write load :: Rebalancing - Gently moving chunks around to balance disk fullness - Low priority, fixes imbalances caused by new chunkservers joining the cluster for the first time