SlideShare une entreprise Scribd logo
1  sur  2
Télécharger pour lire hors ligne
Voldemort	
  on	
  Solid	
  State	
  Drives	
  
                             Vinoth	
  Chandar,	
  Lei	
  Gao,	
  Cuong	
  Tran	
  
                           Linkedin	
  Corporation,	
  Mountain	
  View,	
  CA	
  

Abstract
Voldemort is Linkedin’s open implementation of Amazon Dynamo, providing fast, scalable, fault-
tolerant access to key-value data. Voldemort is widely used by applications at LinkedIn that demand lots
of IOPS. Solid State Drives (SSD) are becoming an attractive option to speed up data access. In this
paper, we describe our experiences with GC issues on Voldemort server nodes, after migrating to SSD.
Based on these experiences, we provide an intuition for caching strategies with SSD storage.

1. Introduction
Voldemort [1] is a distributed key-value storage system, based on Amazon Dynamo. It has a very simple
get(k), put(k,v), delete(k) interface, that allows for pluggable serialization, routing and storage engines.
Voldemort serves a substantial amount of site traffic at LinkedIn for applications like ‘Skills’, ‘People
You May Know’, ‘Company Follow’, ‘LinkedIn Share’, serving an average of 100K operations/sec over
roughly 80TB of data. It also has wide adoption in companies such as Gilt Group, EHarmony, Nokia,
Jive Software, WealthFront and Mendeley.

Due to simple key-value access pattern, the single Voldemort server node performance is typically bound
by IOPS, with plenty of CPU cycles to spare. Hence, Voldemort clusters at LinkedIn were migrated to
SSD to increase the single server node capacity. The migration has proven fruitful, although unearthing a
set of interesting GC issues, which led to rethinking of our caching strategy with SSD. Rest of the paper
is organized as follows. Section 2 describes the software stack for a single Voldemort server. Section 3
describes the impact of SSD migration on the single server performance and details ways to mitigate
Java GC issues. Section 3 also explores leveraging SSD to alleviate caching problems. Section 4
concludes.

2. Single Server stack
The server uses an embedded, log structured, Java based storage engine - Oracle BerkeleyDB JE [2].
BDB employs an LRU cache on top of the JVM heap and relies on Java garbage collection for managing
its memory. Loosely, the cache is a bunch of references to index and data objects. Cache eviction
happens simply by releasing the references for garbage collection. A single cluster serves a large number
of applications and hence the BDB cache contains objects of different sizes, sharing the same BDB
cache. The server also has a background thread that enforces data retention policy, by periodically
deleting stale entries.

3. SSD Performance Implications
With plenty of IOPS at hand, the allocation rates went up causing very frequent GC pauses, moving the
bottleneck from IO to garbage collection. After migrating to SSD, the average latency greatly improved
from 20ms to 2ms. Speed of cluster expansion and data restoration has improved 10x. However, the 95th
and 99th percentile latencies shot up from 30ms to 130ms and 240ms to 380ms respectively, due to a host
of garbage collection issues, detailed below.

3.1 Need for End-End Correlation
By developing tools to correlate Linux paging statistics from SAR with pauses from GC, we discovered
that Linux was stealing pages from the JVM heap, resulting in 4-second minor pauses. Subsequent
promotions into the old generation incur page scans, causing the big pauses with a high system time
component. Hence, it is imperative to mlock() the server heap to prevent it from being swapped out.
Also, we experienced higher system time in lab experiments, since not all of the virtual address space of
the JVM heap had been mapped to physical pages. Thus, using the AlwaysPreTouch JVM option is
imperative for any ‘Big Data’ benchmarking tool, to reproduce the same memory conditions as in the
real world. This exercise stressed the importance of developing performance tools that can identify
interesting patterns by correlating performance data across the entire stack.

3.2 SSD Aware Caching
Promotion failures with huge 25-second pauses during the retention job, prompted us to rethink the
caching strategy with SSD. The retention job does a walk of the entire BDB database without any
throttling. With very fast SSD, this translates into rapid 200MB allocations and promotions, parallely
kicking out the objects from the LRU cache in old generation. Since the server is multitenant, hosting
different object sizes, this leads to heavy fragmentation. Real workloads almost always have ‘hotsets’
which live in the old generation and any incoming traffic that drastically changes the hotset is likely to
run into this issue. The issue was very difficult to reproduce since it depended heavily on the state of old
generation, highlighting the need for building performance test infrastructures that can replay real world
traffic. We managed to reproduce the problem by roughly matching up cache miss rates as seen in
production. We solved the problem by forcing BDB to evict data objects brought in by the retention job
right away, such that they are collected in young generation and never promoted.

In fact, we plan to cache only the index nodes over the JVM heap even for regular traffic. This will help
fight fragmentation and achieve predictable multitenant deployments. Results in lab have shown that this
approach can deliver comparable performance, due to the power of SSD and uniformly sized index
objects. Also, this approach reduces the promotion rate, thus increasing the chances that CMS initial
mark is scheduled after a minor collection. This improves initial mark time as described in next section.
This approach is applicable even for systems that manage their own memory since fragmentation is a
general issue.

3.3 Reducing Cost of CMS Initial mark
Assuming we can control fragmentation, yielding control back to the JVM to schedule CMS adaptively
based on promotion rate can help cut down initial mark times. Even when evicting data objects right
away, the high SSD read rates could cause heavy promotion for index objects. Under such
circumstances, the CMS initial mark might be scheduled when the young generation is not empty,
resulting in a 1.2 second CMS initial mark pause on a 2GB young generation. We found that by
increasing the CMSInitiatingOccupancyFraction to a higher value (90), the scheduling of CMS happened
much closer to minor collections when the young generation is empty or small, reducing the maximum
initial mark time to 0.4 seconds.

4. Conclusion
With SSD, we find that garbage collection will become a very significant bottleneck, especially for
systems, which have little control over the storage layer and rely on Java memory management. Big heap
sizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We
believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has
provided an efficient solution for handling fragmentation and moving towards predictable multitenancy.

References
[1] http://project-voldemort.com/
[2] http://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html	
  

Contenu connexe

Tendances

Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL ServerStephen Rose
 
A Guide to your Tagadab Shared Hosting Control Panel
A Guide to your Tagadab Shared Hosting Control PanelA Guide to your Tagadab Shared Hosting Control Panel
A Guide to your Tagadab Shared Hosting Control Panelwebhostingguy
 
Introduction to Threading in .Net
Introduction to Threading in .NetIntroduction to Threading in .Net
Introduction to Threading in .Netwebhostingguy
 
Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Principled Technologies
 
Boosting performance with the Dell Acceleration Appliance for Databases
Boosting performance with the Dell Acceleration Appliance for DatabasesBoosting performance with the Dell Acceleration Appliance for Databases
Boosting performance with the Dell Acceleration Appliance for DatabasesPrincipled Technologies
 
High availability solutions bakostech
High availability solutions bakostechHigh availability solutions bakostech
High availability solutions bakostechViktoria Bakos
 
IMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM RevolutionIMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM RevolutionIn-Memory Computing Summit
 
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solutionDrive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solutionPrincipled Technologies
 
Intro to Azure SQL database
Intro to Azure SQL databaseIntro to Azure SQL database
Intro to Azure SQL databaseSteve Knutson
 
Answering the Database Scale Out Problem with PCI SSDs
Answering the Database Scale Out Problem with PCI SSDsAnswering the Database Scale Out Problem with PCI SSDs
Answering the Database Scale Out Problem with PCI SSDsanswers
 
Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...
Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...
Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...Principled Technologies
 
Oracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – StorageOracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – StorageMarketingArrowECS_CZ
 
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...Principled Technologies
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...
Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...
Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...Principled Technologies
 
SQL Server & la virtualisation : « 45 minutes inside » !
SQL Server & la virtualisation :  « 45 minutes inside » !SQL Server & la virtualisation :  « 45 minutes inside » !
SQL Server & la virtualisation : « 45 minutes inside » !Microsoft Décideurs IT
 
Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images Anamika Vinod
 

Tendances (18)

Troubleshooting SQL Server
Troubleshooting SQL ServerTroubleshooting SQL Server
Troubleshooting SQL Server
 
A Guide to your Tagadab Shared Hosting Control Panel
A Guide to your Tagadab Shared Hosting Control PanelA Guide to your Tagadab Shared Hosting Control Panel
A Guide to your Tagadab Shared Hosting Control Panel
 
Introduction to Threading in .Net
Introduction to Threading in .NetIntroduction to Threading in .Net
Introduction to Threading in .Net
 
Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...Get insight from document-based distributed MongoDB databases sooner and have...
Get insight from document-based distributed MongoDB databases sooner and have...
 
Boosting performance with the Dell Acceleration Appliance for Databases
Boosting performance with the Dell Acceleration Appliance for DatabasesBoosting performance with the Dell Acceleration Appliance for Databases
Boosting performance with the Dell Acceleration Appliance for Databases
 
My sql
My sqlMy sql
My sql
 
High availability solutions bakostech
High availability solutions bakostechHigh availability solutions bakostech
High availability solutions bakostech
 
IMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM RevolutionIMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
IMCSummit 2015 - Day 2 Developer Track - The NVM Revolution
 
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solutionDrive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution
Drive new initiatives with a powerful Dell EMC, Nutanix, and Toshiba solution
 
Intro to Azure SQL database
Intro to Azure SQL databaseIntro to Azure SQL database
Intro to Azure SQL database
 
Answering the Database Scale Out Problem with PCI SSDs
Answering the Database Scale Out Problem with PCI SSDsAnswering the Database Scale Out Problem with PCI SSDs
Answering the Database Scale Out Problem with PCI SSDs
 
Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...
Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...
Dell PowerEdge R920 running Oracle Database: Benefits of upgrading with NVMe ...
 
Oracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – StorageOracle Cloud Infrastructure – Storage
Oracle Cloud Infrastructure – Storage
 
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...
Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...
Upgrading to Windows Server 2019 on Dell EMC PowerEdge servers: A simple proc...
 
SQL Server & la virtualisation : « 45 minutes inside » !
SQL Server & la virtualisation :  « 45 minutes inside » !SQL Server & la virtualisation :  « 45 minutes inside » !
SQL Server & la virtualisation : « 45 minutes inside » !
 
Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images Liquid: A Scalable Deduplication File System for Virtual Machine Images
Liquid: A Scalable Deduplication File System for Virtual Machine Images
 

Similaire à Voldemort on Solid State Drives

Demartek lenovo s3200_mixed_workload_environment_2016-01
Demartek lenovo s3200_mixed_workload_environment_2016-01Demartek lenovo s3200_mixed_workload_environment_2016-01
Demartek lenovo s3200_mixed_workload_environment_2016-01Lenovo Data Center
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch ProcessingChris Adkin
 
Webcenter application performance tuning guide
Webcenter application performance tuning guideWebcenter application performance tuning guide
Webcenter application performance tuning guideVinay Kumar
 
LIQUID-A Scalable Deduplication File System For Virtual Machine Images
LIQUID-A Scalable Deduplication File System For Virtual Machine ImagesLIQUID-A Scalable Deduplication File System For Virtual Machine Images
LIQUID-A Scalable Deduplication File System For Virtual Machine Imagesfabna benz
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop User Group
 
Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)IRJET Journal
 
10 Best Practices for Reducing Spend in Azure
10 Best Practices for Reducing Spend in Azure10 Best Practices for Reducing Spend in Azure
10 Best Practices for Reducing Spend in AzureVAST
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Shardinguzzal basak
 
Container Native Storage
Container Native StorageContainer Native Storage
Container Native StorageData Source
 
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM India Smarter Computing
 
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM India Smarter Computing
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewPierre Baillet
 
Configuration and Deployment Guide For Memcached on Intel® Architecture
Configuration and Deployment Guide For Memcached on Intel® ArchitectureConfiguration and Deployment Guide For Memcached on Intel® Architecture
Configuration and Deployment Guide For Memcached on Intel® ArchitectureOdinot Stanislas
 
Virtualizing Business Critical Applications
Virtualizing Business Critical ApplicationsVirtualizing Business Critical Applications
Virtualizing Business Critical ApplicationsDataCore Software
 
Demartek lenovo s3200_sql_server_evaluation_2016-01
Demartek lenovo s3200_sql_server_evaluation_2016-01Demartek lenovo s3200_sql_server_evaluation_2016-01
Demartek lenovo s3200_sql_server_evaluation_2016-01Lenovo Data Center
 

Similaire à Voldemort on Solid State Drives (20)

Demartek lenovo s3200_mixed_workload_environment_2016-01
Demartek lenovo s3200_mixed_workload_environment_2016-01Demartek lenovo s3200_mixed_workload_environment_2016-01
Demartek lenovo s3200_mixed_workload_environment_2016-01
 
J2EE Batch Processing
J2EE Batch ProcessingJ2EE Batch Processing
J2EE Batch Processing
 
Webcenter application performance tuning guide
Webcenter application performance tuning guideWebcenter application performance tuning guide
Webcenter application performance tuning guide
 
LIQUID-A Scalable Deduplication File System For Virtual Machine Images
LIQUID-A Scalable Deduplication File System For Virtual Machine ImagesLIQUID-A Scalable Deduplication File System For Virtual Machine Images
LIQUID-A Scalable Deduplication File System For Virtual Machine Images
 
Tombolo
TomboloTombolo
Tombolo
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)Refactoring Web Services on AWS cloud (PaaS & SaaS)
Refactoring Web Services on AWS cloud (PaaS & SaaS)
 
How To Scale v2
How To Scale v2How To Scale v2
How To Scale v2
 
No sql exploration keyvaluestore
No sql exploration   keyvaluestoreNo sql exploration   keyvaluestore
No sql exploration keyvaluestore
 
10 Best Practices for Reducing Spend in Azure
10 Best Practices for Reducing Spend in Azure10 Best Practices for Reducing Spend in Azure
10 Best Practices for Reducing Spend in Azure
 
MongoDB Sharding
MongoDB ShardingMongoDB Sharding
MongoDB Sharding
 
Container Native Storage
Container Native StorageContainer Native Storage
Container Native Storage
 
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
 
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage UtilizationIBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
IBM Upgrades SVC with Solid State Drives — Achieves Better Storage Utilization
 
MongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of viewMongoDB vs Mysql. A devops point of view
MongoDB vs Mysql. A devops point of view
 
Configuration and Deployment Guide For Memcached on Intel® Architecture
Configuration and Deployment Guide For Memcached on Intel® ArchitectureConfiguration and Deployment Guide For Memcached on Intel® Architecture
Configuration and Deployment Guide For Memcached on Intel® Architecture
 
Virtualizing Business Critical Applications
Virtualizing Business Critical ApplicationsVirtualizing Business Critical Applications
Virtualizing Business Critical Applications
 
Demartek lenovo s3200_sql_server_evaluation_2016-01
Demartek lenovo s3200_sql_server_evaluation_2016-01Demartek lenovo s3200_sql_server_evaluation_2016-01
Demartek lenovo s3200_sql_server_evaluation_2016-01
 

Plus de Amy W. Tang

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Amy W. Tang
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedInAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to DatabusAmy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 

Plus de Amy W. Tang (13)

Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn Data Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Data Infrastructure at LinkedIn
Data Infrastructure at LinkedInData Infrastructure at LinkedIn
Data Infrastructure at LinkedIn
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Dernier

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 

Dernier (20)

A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Voldemort on Solid State Drives

  • 1. Voldemort  on  Solid  State  Drives   Vinoth  Chandar,  Lei  Gao,  Cuong  Tran   Linkedin  Corporation,  Mountain  View,  CA   Abstract Voldemort is Linkedin’s open implementation of Amazon Dynamo, providing fast, scalable, fault- tolerant access to key-value data. Voldemort is widely used by applications at LinkedIn that demand lots of IOPS. Solid State Drives (SSD) are becoming an attractive option to speed up data access. In this paper, we describe our experiences with GC issues on Voldemort server nodes, after migrating to SSD. Based on these experiences, we provide an intuition for caching strategies with SSD storage. 1. Introduction Voldemort [1] is a distributed key-value storage system, based on Amazon Dynamo. It has a very simple get(k), put(k,v), delete(k) interface, that allows for pluggable serialization, routing and storage engines. Voldemort serves a substantial amount of site traffic at LinkedIn for applications like ‘Skills’, ‘People You May Know’, ‘Company Follow’, ‘LinkedIn Share’, serving an average of 100K operations/sec over roughly 80TB of data. It also has wide adoption in companies such as Gilt Group, EHarmony, Nokia, Jive Software, WealthFront and Mendeley. Due to simple key-value access pattern, the single Voldemort server node performance is typically bound by IOPS, with plenty of CPU cycles to spare. Hence, Voldemort clusters at LinkedIn were migrated to SSD to increase the single server node capacity. The migration has proven fruitful, although unearthing a set of interesting GC issues, which led to rethinking of our caching strategy with SSD. Rest of the paper is organized as follows. Section 2 describes the software stack for a single Voldemort server. Section 3 describes the impact of SSD migration on the single server performance and details ways to mitigate Java GC issues. Section 3 also explores leveraging SSD to alleviate caching problems. Section 4 concludes. 2. Single Server stack The server uses an embedded, log structured, Java based storage engine - Oracle BerkeleyDB JE [2]. BDB employs an LRU cache on top of the JVM heap and relies on Java garbage collection for managing its memory. Loosely, the cache is a bunch of references to index and data objects. Cache eviction happens simply by releasing the references for garbage collection. A single cluster serves a large number of applications and hence the BDB cache contains objects of different sizes, sharing the same BDB cache. The server also has a background thread that enforces data retention policy, by periodically deleting stale entries. 3. SSD Performance Implications With plenty of IOPS at hand, the allocation rates went up causing very frequent GC pauses, moving the bottleneck from IO to garbage collection. After migrating to SSD, the average latency greatly improved from 20ms to 2ms. Speed of cluster expansion and data restoration has improved 10x. However, the 95th and 99th percentile latencies shot up from 30ms to 130ms and 240ms to 380ms respectively, due to a host of garbage collection issues, detailed below. 3.1 Need for End-End Correlation By developing tools to correlate Linux paging statistics from SAR with pauses from GC, we discovered that Linux was stealing pages from the JVM heap, resulting in 4-second minor pauses. Subsequent
  • 2. promotions into the old generation incur page scans, causing the big pauses with a high system time component. Hence, it is imperative to mlock() the server heap to prevent it from being swapped out. Also, we experienced higher system time in lab experiments, since not all of the virtual address space of the JVM heap had been mapped to physical pages. Thus, using the AlwaysPreTouch JVM option is imperative for any ‘Big Data’ benchmarking tool, to reproduce the same memory conditions as in the real world. This exercise stressed the importance of developing performance tools that can identify interesting patterns by correlating performance data across the entire stack. 3.2 SSD Aware Caching Promotion failures with huge 25-second pauses during the retention job, prompted us to rethink the caching strategy with SSD. The retention job does a walk of the entire BDB database without any throttling. With very fast SSD, this translates into rapid 200MB allocations and promotions, parallely kicking out the objects from the LRU cache in old generation. Since the server is multitenant, hosting different object sizes, this leads to heavy fragmentation. Real workloads almost always have ‘hotsets’ which live in the old generation and any incoming traffic that drastically changes the hotset is likely to run into this issue. The issue was very difficult to reproduce since it depended heavily on the state of old generation, highlighting the need for building performance test infrastructures that can replay real world traffic. We managed to reproduce the problem by roughly matching up cache miss rates as seen in production. We solved the problem by forcing BDB to evict data objects brought in by the retention job right away, such that they are collected in young generation and never promoted. In fact, we plan to cache only the index nodes over the JVM heap even for regular traffic. This will help fight fragmentation and achieve predictable multitenant deployments. Results in lab have shown that this approach can deliver comparable performance, due to the power of SSD and uniformly sized index objects. Also, this approach reduces the promotion rate, thus increasing the chances that CMS initial mark is scheduled after a minor collection. This improves initial mark time as described in next section. This approach is applicable even for systems that manage their own memory since fragmentation is a general issue. 3.3 Reducing Cost of CMS Initial mark Assuming we can control fragmentation, yielding control back to the JVM to schedule CMS adaptively based on promotion rate can help cut down initial mark times. Even when evicting data objects right away, the high SSD read rates could cause heavy promotion for index objects. Under such circumstances, the CMS initial mark might be scheduled when the young generation is not empty, resulting in a 1.2 second CMS initial mark pause on a 2GB young generation. We found that by increasing the CMSInitiatingOccupancyFraction to a higher value (90), the scheduling of CMS happened much closer to minor collections when the young generation is empty or small, reducing the maximum initial mark time to 0.4 seconds. 4. Conclusion With SSD, we find that garbage collection will become a very significant bottleneck, especially for systems, which have little control over the storage layer and rely on Java memory management. Big heap sizes make the cost of garbage collection expensive, especially the single threaded CMS Initial mark. We believe that data systems must revisit their caching strategies with SSDs. In this regard, SSD has provided an efficient solution for handling fragmentation and moving towards predictable multitenancy. References [1] http://project-voldemort.com/ [2] http://www.oracle.com/technetwork/database/berkeleydb/overview/index-093405.html