SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Outline
                Introduction
                      Design
             Implementation
                     Results
                 Conclusions




Bigtable: A Distributed Storage System for
             Structured Data

                 Alvanos Michalis


                    April 6, 2009




            Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction
                               Design
                      Implementation
                              Results
                          Conclusions

1   Introduction
       Motivation
2   Design
       Data model
3   Implementation
       Building blocks
       Tablets
       Compactions
       Refinements
4   Results
       Hardware Environment
       Performance Evaluation
5   Conclusions
       Real applications
       Lessons
       End
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                 Design
                                          Motivation
                        Implementation
                                Results
                            Conclusions


Google!

     Lots of Different kinds of data!
          Crawling system URLs, contents, links, anchors, page-rank etc
          Per-user data: preferences, recent queries/ search history
          Geographic data, images etc ...
     Many incoming requests
     No commercial system is big enough
          Scale is too large for commercial databases
          May not run on their commodity hardware
          No dependence on other vendors
          Optimizations
          Better Price/Performance
          Building internally means the system can be applied across
          many projects for low incremental cost

                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                 Design
                                          Motivation
                        Implementation
                                Results
                            Conclusions


Google goals



     Fault-tolerant, persistent
     Scalable
          1000s of servers
          Millions of reads/writes, efficient scans
     Self-managing
     Simple!




                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                  Design
                                           Data model
                         Implementation
                                 Results
                             Conclusions


Bigtable




  Definition
  A Bigtable is a sparse, distributed, persistent multidimensional
  sorted map.

  The map is indexed by a row key, column key, and a timestamp;
  each value in the map is an uninterpreted array of bytes.

  (row:string, column:string, time:int64) -> string
                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                  Design
                                           Data model
                         Implementation
                                 Results
                             Conclusions


Rows




       The row keys in a table are arbitrary strings
       Every read or write of data under a single row key is atomic
       maintains data in lexicographic order by row key


                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction
                               Design
                                        Data model
                      Implementation
                              Results
                          Conclusions


Column Families




     Grouped into sets called column families
     All data stored in a column family is usually of the same type
     A column family must be created before data can be stored
     under any column key in that family
     A column key is named using the following syntax:
     family:qualifier
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design
                                         Data model
                       Implementation
                               Results
                           Conclusions


Timestamps


     Each cell in a Bigtable can contain multiple versions of the
     same data; these versions are indexed by timestamp (64-bit
     integers).
     Applications that need to avoid collisions must generate
     unique timestamps themselves.
     To make the management of versioned data less onerous, they
     support two per-column-family settings that tell Bigtable to
     garbage-collect cell versions automatically.




                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Infrastructure


      Google WorkQueue (scheduler)
      GFS: large-scale distributed file system
          Master: responsible for metadata
          Chunk servers: responsible for r/w large chunks of data
          Chunks replicated on 3 machines; master responsible
      Chubby: lock/file/name service
          Coarse-grained locks; can store small amount of data in a lock
          5 replicas; need a majority vote to be active




                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


SSTable




     Lives in GFS
     Immutable, sorted file of key-value pairs
     Chunks of data plus an index
     Index is of block ranges, not values

                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                         Introduction   Building blocks
                               Design   Tablets
                      Implementation    Compactions
                              Results   Refinements
                          Conclusions


Tablet Design




     Large tables broken into tablets at row boundaries
         Tablets hold contiguous rows
         Approx 100 200 MB of data per tablet
     Approx 100 tablets per machine
         Fast recovery
         Load-balancing
     Built out of multiple SSTables
                     Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Tablet Location




      Like a B+-tree, but fixed at 3 levels
      How can we avoid creating a bottleneck at the root?
          Aggressively cache tablet locations
          Lookup starts from leaf (bet on it being correct); reverse on
          miss
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Tablet Assignment
     Each tablet is assigned to one tablet server at a time. The
     master keeps track of the set of live tablet servers, and the
     current assignment of tablets to tablet servers.
     Bigtable uses Chubby to keep track of tablet servers. When a
     tablet server starts, it creates, and acquires an exclusive lock
     on, a uniquely-named file in a specific Chubby directory.
     Tablet server stops serving its tablets if loses its exclusive lock
     The master is responsible for detecting when a tablet server is
     no longer serving its tablets, and for reassigning those tablets
     as soon as possible.
     When a master is started by the cluster management system,
     it needs to discover the current tablet assignments before it
     can change them.
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Serving a Tablet




      Updates are logged
      Each SSTable corresponds to a batch of updates or a
      snapshot of the tablet taken at some earlier time
      Memtable (sorted by key) caches recent updates
      Reads consult both memtable and SSTables
                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction   Building blocks
                                 Design   Tablets
                        Implementation    Compactions
                                Results   Refinements
                            Conclusions


Compactions

  As write operations execute, the size of the memtable increases.
      Minor compaction convert the memtable into an SSTable
           Reduce memory usage
           Reduce log traffic on restart
      Merging compaction
           Periodically executed in the background
           Reduce number of SSTables
           Good place to apply policy keep only N versions
      Major compaction
           Merging compaction that results in only one SSTable
           No deletion records, only live data
           Reclaim resources.

                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Refinements (1/2)


     Group column families together into an SSTable. Segregating
     column families that are not typically accessed together into
     separate locality groups enables more efficient reads.
     Can compress locality groups, using Bentley and McIlroy’s
     scheme and a fast compression algorithm that looks for
     repetitions.
     Bloom Filters on locality groups allows to ask whether an
     SSTable might contain any data for a specified row/column
     pair. Drastically reduces the number of disk seeks required -
     for non-existent rows or columns do not need to touch disk.


                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction   Building blocks
                                Design   Tablets
                       Implementation    Compactions
                               Results   Refinements
                           Conclusions


Refinements (2/2)


     Caching for read performance ( two levels of caching)
         Scan Cache: higher-level cache that caches the key-value pairs
         returned by the SSTable interface to the tablet server code.
         Block Cache: lower-level cache that caches SSTables blocks
         that were read from GFS.
     Commit-log implementation
     Speeding up tablet recovery (log entries)
     Exploiting immutability



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Hardware Environment


     Tablet servers were configured to use 1 GB of memory and to
     write to a GFS cell consisting of 1786 machines with two 400
     GB IDE hard drives each.
     Each machine had two dual-core Opteron 2 GHz chips
     Enough physical memory to hold the working set of all
     running processes
     Single gigabit Ethernet link
     Two-level tree-shaped switched network with 100-200 Gbps
     aggregate bandwidth at the root.



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Results Per Tablet Server




  Number of 1000-byte values read/written per second.


                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Results Aggregate Rate




  Number of 1000-byte values read/written per second.

                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Single tablet-server performance


      The tablet server executes 1200 reads per second ( 75
      MB/s), enough to saturate the tablet server CPUs because of
      overheads in networking stack
      Random and sequential writes perform better than random
      reads (commit log and uses group commit)
      No significant difference between random writes and
      sequential writes (same commit log)
      Sequential reads perform better than random reads (block
      cache)



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                          Introduction
                                Design   Hardware Environment
                       Implementation    Performance Evaluation
                               Results
                           Conclusions


Scaling


      Aggregate throughput increases dramatically performance of
      random reads from memory increases
      However, performance does not increase linearly
      Drop in per-server throughput
          Imbalance in load: Re-balancing is throttled to reduce the
          number of tablet movement and the load generated by
          benchmarks shifts around as the benchmark progresses
          The random read benchmark: transfer one 64KB block over
          the network for every 1000-byte read and saturates shared 1
          Gigabit links



                      Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                            Introduction
                                           Real applications
                                  Design
                                           Lessons
                         Implementation
                                           End
                                 Results
                             Conclusions


Timestamps




     Google Analytics
     Google Earth
     Personalized Search

                        Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                           Introduction
                                          Real applications
                                 Design
                                          Lessons
                        Implementation
                                          End
                                Results
                            Conclusions


Lessons learned
      Large distributed systems are vulnerable to many types of
      failures, not just the standard network partitions and fail-stop
      failures
          Memory and network corruption
          Large clock skew
          Extended and asymmetric network partitions
          Bugs in other systems (Chubby !)
          ...
      Delay adding new features until it is clear how the new
      features will be used
      A practical lesson: the importance of proper system-level
      monitoring
      Keep It Simple!
                       Alvanos Michalis   Bigtable: A Distributed Storage System for Structured Data
Outline
                  Introduction
                                 Real applications
                        Design
                                 Lessons
               Implementation
                                 End
                       Results
                   Conclusions


END!




 QUESTIONS ?




           Alvanos Michalis      Bigtable: A Distributed Storage System for Structured Data

Contenu connexe

Tendances

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
Chandler Huang
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 

Tendances (20)

Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Big table
Big tableBig table
Big table
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Learning from google megastore (Part-1)
Learning from google megastore (Part-1)Learning from google megastore (Part-1)
Learning from google megastore (Part-1)
 
Designing data intensive applications
Designing data intensive applicationsDesigning data intensive applications
Designing data intensive applications
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
BigTable And Hbase
BigTable And HbaseBigTable And Hbase
BigTable And Hbase
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Google file system GFS
Google file system GFSGoogle file system GFS
Google file system GFS
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 

En vedette

Bigtable
BigtableBigtable
Bigtable
nextlib
 
Innovation Case Study BitTorrent
Innovation Case Study BitTorrentInnovation Case Study BitTorrent
Innovation Case Study BitTorrent
Petter S. Rønning
 
Privacy preserving public auditing for secure cloud storage
Privacy preserving public auditing for secure cloud storagePrivacy preserving public auditing for secure cloud storage
Privacy preserving public auditing for secure cloud storage
Mustaq Syed
 

En vedette (20)

google Bigtable
google Bigtablegoogle Bigtable
google Bigtable
 
Bigtable
BigtableBigtable
Bigtable
 
Summary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in TokyoSummary of "Google's Big Table" at nosql summer reading in Tokyo
Summary of "Google's Big Table" at nosql summer reading in Tokyo
 
Dynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and ComparisonDynamo and BigTable - Review and Comparison
Dynamo and BigTable - Review and Comparison
 
Google's BigTable
Google's BigTableGoogle's BigTable
Google's BigTable
 
Big table
Big tableBig table
Big table
 
Bigtable
BigtableBigtable
Bigtable
 
Google Bigtable paper presentation
Google Bigtable paper presentationGoogle Bigtable paper presentation
Google Bigtable paper presentation
 
Cloud-native Apps - Architektur, Implementierung, Demo
Cloud-native Apps - Architektur, Implementierung, DemoCloud-native Apps - Architektur, Implementierung, Demo
Cloud-native Apps - Architektur, Implementierung, Demo
 
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEYROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
ROCKING YOUR SEAT AT THE BIG TABLE - ROB BAILEY
 
I've (probably) been using Google App Engine for a week longer than you have
I've (probably) been using Google App Engine for a week longer than you haveI've (probably) been using Google App Engine for a week longer than you have
I've (probably) been using Google App Engine for a week longer than you have
 
Big table
Big tableBig table
Big table
 
Innovation Case Study BitTorrent
Innovation Case Study BitTorrentInnovation Case Study BitTorrent
Innovation Case Study BitTorrent
 
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storagePrivacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storage
 
Big table
Big tableBig table
Big table
 
Privacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storagePrivacy preserving public auditing for regenerating-code-based cloud storage
Privacy preserving public auditing for regenerating-code-based cloud storage
 
Bigtable
BigtableBigtable
Bigtable
 
Privacy preserving public auditing for secure cloud storage
Privacy preserving public auditing for secure cloud storagePrivacy preserving public auditing for secure cloud storage
Privacy preserving public auditing for secure cloud storage
 
Privacy Preserving Public Auditing for Data Storage Security in Cloud
Privacy Preserving Public Auditing for Data Storage Security in Cloud Privacy Preserving Public Auditing for Data Storage Security in Cloud
Privacy Preserving Public Auditing for Data Storage Security in Cloud
 
Privacy Preserving Public Auditing for Data Storage Security in Cloud.ppt
Privacy Preserving Public Auditing for Data Storage Security in Cloud.pptPrivacy Preserving Public Auditing for Data Storage Security in Cloud.ppt
Privacy Preserving Public Auditing for Data Storage Security in Cloud.ppt
 

Similaire à Bigtable: A Distributed Storage System for Structured Data

Bigtable
BigtableBigtable
Bigtable
ptdorf
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
temp2004it
 

Similaire à Bigtable: A Distributed Storage System for Structured Data (20)

Scaling Up vs. Scaling-out
Scaling Up vs. Scaling-outScaling Up vs. Scaling-out
Scaling Up vs. Scaling-out
 
Snowflake Cloning.pdf
Snowflake Cloning.pdfSnowflake Cloning.pdf
Snowflake Cloning.pdf
 
Bigtable
BigtableBigtable
Bigtable
 
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
VectorDB Schema Design 101 - Considerations for Building a Scalable and Perfo...
 
NOSQL
NOSQLNOSQL
NOSQL
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San FranciscoMySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
MySQL 8 Tips and Tricks from Symfony USA 2018, San Francisco
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
gfs-sosp2003
gfs-sosp2003gfs-sosp2003
gfs-sosp2003
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
Bigtable osdi06
Bigtable osdi06Bigtable osdi06
Bigtable osdi06
 
MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018MySQL 8 Server Optimization Swanseacon 2018
MySQL 8 Server Optimization Swanseacon 2018
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
Gfs论文
Gfs论文Gfs论文
Gfs论文
 
The google file system
The google file systemThe google file system
The google file system
 
Data Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future OutlookData Warehousing Trends, Best Practices, and Future Outlook
Data Warehousing Trends, Best Practices, and Future Outlook
 
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
Cloud Architecture Patterns for Mere Mortals - Bill Wilder - Vermont Code Cam...
 
Sybase IQ Big Data
Sybase IQ Big DataSybase IQ Big Data
Sybase IQ Big Data
 
Sybase IQ ve Big Data
Sybase IQ ve Big DataSybase IQ ve Big Data
Sybase IQ ve Big Data
 

Plus de elliando dias

Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
elliando dias
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
elliando dias
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
elliando dias
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
elliando dias
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
elliando dias
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
elliando dias
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
elliando dias
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
elliando dias
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
elliando dias
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
elliando dias
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
elliando dias
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
elliando dias
 

Plus de elliando dias (20)

Clojurescript slides
Clojurescript slidesClojurescript slides
Clojurescript slides
 
Why you should be excited about ClojureScript
Why you should be excited about ClojureScriptWhy you should be excited about ClojureScript
Why you should be excited about ClojureScript
 
Functional Programming with Immutable Data Structures
Functional Programming with Immutable Data StructuresFunctional Programming with Immutable Data Structures
Functional Programming with Immutable Data Structures
 
Nomenclatura e peças de container
Nomenclatura  e peças de containerNomenclatura  e peças de container
Nomenclatura e peças de container
 
Geometria Projetiva
Geometria ProjetivaGeometria Projetiva
Geometria Projetiva
 
Polyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better AgilityPolyglot and Poly-paradigm Programming for Better Agility
Polyglot and Poly-paradigm Programming for Better Agility
 
Javascript Libraries
Javascript LibrariesJavascript Libraries
Javascript Libraries
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
Ragel talk
Ragel talkRagel talk
Ragel talk
 
A Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the WebA Practical Guide to Connecting Hardware to the Web
A Practical Guide to Connecting Hardware to the Web
 
Introdução ao Arduino
Introdução ao ArduinoIntrodução ao Arduino
Introdução ao Arduino
 
Minicurso arduino
Minicurso arduinoMinicurso arduino
Minicurso arduino
 
Incanter Data Sorcery
Incanter Data SorceryIncanter Data Sorcery
Incanter Data Sorcery
 
Rango
RangoRango
Rango
 
Fab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine DesignFab.in.a.box - Fab Academy: Machine Design
Fab.in.a.box - Fab Academy: Machine Design
 
The Digital Revolution: Machines that makes
The Digital Revolution: Machines that makesThe Digital Revolution: Machines that makes
The Digital Revolution: Machines that makes
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.Hadoop - Simple. Scalable.
Hadoop - Simple. Scalable.
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Multi-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case StudyMulti-core Parallelization in Clojure - a Case Study
Multi-core Parallelization in Clojure - a Case Study
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Dernier (20)

Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Bigtable: A Distributed Storage System for Structured Data

  • 1. Outline Introduction Design Implementation Results Conclusions Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009 Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 2. Outline Introduction Design Implementation Results Conclusions 1 Introduction Motivation 2 Design Data model 3 Implementation Building blocks Tablets Compactions Refinements 4 Results Hardware Environment Performance Evaluation 5 Conclusions Real applications Lessons End Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 3. Outline Introduction Design Motivation Implementation Results Conclusions Google! Lots of Different kinds of data! Crawling system URLs, contents, links, anchors, page-rank etc Per-user data: preferences, recent queries/ search history Geographic data, images etc ... Many incoming requests No commercial system is big enough Scale is too large for commercial databases May not run on their commodity hardware No dependence on other vendors Optimizations Better Price/Performance Building internally means the system can be applied across many projects for low incremental cost Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 4. Outline Introduction Design Motivation Implementation Results Conclusions Google goals Fault-tolerant, persistent Scalable 1000s of servers Millions of reads/writes, efficient scans Self-managing Simple! Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 5. Outline Introduction Design Data model Implementation Results Conclusions Bigtable Definition A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes. (row:string, column:string, time:int64) -> string Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 6. Outline Introduction Design Data model Implementation Results Conclusions Rows The row keys in a table are arbitrary strings Every read or write of data under a single row key is atomic maintains data in lexicographic order by row key Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 7. Outline Introduction Design Data model Implementation Results Conclusions Column Families Grouped into sets called column families All data stored in a column family is usually of the same type A column family must be created before data can be stored under any column key in that family A column key is named using the following syntax: family:qualifier Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 8. Outline Introduction Design Data model Implementation Results Conclusions Timestamps Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp (64-bit integers). Applications that need to avoid collisions must generate unique timestamps themselves. To make the management of versioned data less onerous, they support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 9. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Infrastructure Google WorkQueue (scheduler) GFS: large-scale distributed file system Master: responsible for metadata Chunk servers: responsible for r/w large chunks of data Chunks replicated on 3 machines; master responsible Chubby: lock/file/name service Coarse-grained locks; can store small amount of data in a lock 5 replicas; need a majority vote to be active Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 10. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions SSTable Lives in GFS Immutable, sorted file of key-value pairs Chunks of data plus an index Index is of block ranges, not values Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 11. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Design Large tables broken into tablets at row boundaries Tablets hold contiguous rows Approx 100 200 MB of data per tablet Approx 100 tablets per machine Fast recovery Load-balancing Built out of multiple SSTables Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 12. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Location Like a B+-tree, but fixed at 3 levels How can we avoid creating a bottleneck at the root? Aggressively cache tablet locations Lookup starts from leaf (bet on it being correct); reverse on miss Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 13. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Tablet Assignment Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. Tablet server stops serving its tablets if loses its exclusive lock The master is responsible for detecting when a tablet server is no longer serving its tablets, and for reassigning those tablets as soon as possible. When a master is started by the cluster management system, it needs to discover the current tablet assignments before it can change them. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 14. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Serving a Tablet Updates are logged Each SSTable corresponds to a batch of updates or a snapshot of the tablet taken at some earlier time Memtable (sorted by key) caches recent updates Reads consult both memtable and SSTables Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 15. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Compactions As write operations execute, the size of the memtable increases. Minor compaction convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging compaction Periodically executed in the background Reduce number of SSTables Good place to apply policy keep only N versions Major compaction Merging compaction that results in only one SSTable No deletion records, only live data Reclaim resources. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 16. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Refinements (1/2) Group column families together into an SSTable. Segregating column families that are not typically accessed together into separate locality groups enables more efficient reads. Can compress locality groups, using Bentley and McIlroy’s scheme and a fast compression algorithm that looks for repetitions. Bloom Filters on locality groups allows to ask whether an SSTable might contain any data for a specified row/column pair. Drastically reduces the number of disk seeks required - for non-existent rows or columns do not need to touch disk. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 17. Outline Introduction Building blocks Design Tablets Implementation Compactions Results Refinements Conclusions Refinements (2/2) Caching for read performance ( two levels of caching) Scan Cache: higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code. Block Cache: lower-level cache that caches SSTables blocks that were read from GFS. Commit-log implementation Speeding up tablet recovery (log entries) Exploiting immutability Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 18. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Hardware Environment Tablet servers were configured to use 1 GB of memory and to write to a GFS cell consisting of 1786 machines with two 400 GB IDE hard drives each. Each machine had two dual-core Opteron 2 GHz chips Enough physical memory to hold the working set of all running processes Single gigabit Ethernet link Two-level tree-shaped switched network with 100-200 Gbps aggregate bandwidth at the root. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 19. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Results Per Tablet Server Number of 1000-byte values read/written per second. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 20. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Results Aggregate Rate Number of 1000-byte values read/written per second. Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 21. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Single tablet-server performance The tablet server executes 1200 reads per second ( 75 MB/s), enough to saturate the tablet server CPUs because of overheads in networking stack Random and sequential writes perform better than random reads (commit log and uses group commit) No significant difference between random writes and sequential writes (same commit log) Sequential reads perform better than random reads (block cache) Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 22. Outline Introduction Design Hardware Environment Implementation Performance Evaluation Results Conclusions Scaling Aggregate throughput increases dramatically performance of random reads from memory increases However, performance does not increase linearly Drop in per-server throughput Imbalance in load: Re-balancing is throttled to reduce the number of tablet movement and the load generated by benchmarks shifts around as the benchmark progresses The random read benchmark: transfer one 64KB block over the network for every 1000-byte read and saturates shared 1 Gigabit links Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 23. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions Timestamps Google Analytics Google Earth Personalized Search Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 24. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions Lessons learned Large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures Memory and network corruption Large clock skew Extended and asymmetric network partitions Bugs in other systems (Chubby !) ... Delay adding new features until it is clear how the new features will be used A practical lesson: the importance of proper system-level monitoring Keep It Simple! Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data
  • 25. Outline Introduction Real applications Design Lessons Implementation End Results Conclusions END! QUESTIONS ? Alvanos Michalis Bigtable: A Distributed Storage System for Structured Data