SlideShare une entreprise Scribd logo
1  sur  27
Cassandra Data modeling
Practical considerations
Nitish Korla
Why Cassandra?
 High Availability / Fully distributed
 Scalability (Linear)
 Write performance
 Simple to install and operate
 Multi-region replication support (bi-directional)
Cassandra footprint @ Netflix
• 60+ Cassandra clusters
• 1600+ nodes holding 100+ TB data
• AWS 500 IOPS -> 100, 000 IOPS
• Streaming data completely persisted in Cassandra
• Related Open Source Projects
– Cassandra/Astyanax : in-house committer
– Priam : Cassandra Automation
– Test Tools : jmeter
– http://github.com/netflix
Data Model
keyspace
column family
Row
column
• name
• value
• timestamp
Cassandra RDBMS Equivalent
KEYSPACE DATABASE/SCHEMA
COLUMN FAMILY TABLE
ROW ROW
FLEXIBLE COLUMNS DEFINED COLUMNS
Data Model
Columns sorted by comparator
name
356
Paul
group
34567
sex
male
name
54
kim
group
34566
sex
female
US:CA:Fremont
54353
US:CA:Hayward
34343
status
single
zip
94538
r
o
w
s
Composite columns
US:CA:San Jose
987556
population
Columns sorted by composite comparators
Do your Homework
① Understand your application requirements
② Identify your access patterns
③ Model around these access patterns
④ Denormalization is your new friend but…
⑤ Benchmark – Avoid Surprises
Example 1 : Edge Service
Edge Services Data Model
alloc
/xyz/jkl_1
000
active
yes
script
text
alloc
/xyl/jkl_2
111
active
yes
script
text
alloc
/xyl/jkl_3
222
active
yes
script
text
ROWID ALLOCATION ACTIVE SCRIPT
Script_location_version 000 YES OR NO
EDGE
SERVICE
CLUSTER
Edge Service Anti patterns
• High concurrency: Edge servers auto scale
• Range scans: Read all data
• Large payload: ~1MB of data
Very high read latency /
unstable cassandra
Solution: inverted index
scripts
client
1
2
alloc
/xyz/jkl_1
000
active
yes
script
text
alloc
/xyl/jkl_2
111
active
yes
script
text
alloc
/xyl/tml_3
222
active
yes
script
text
/xyz/jkl
Index_1
1
/xyz/jzp
2
/xyz/plm
1
/xyz/tml
3
/xyz/urs
1
/xyz/zjkl
2
Script_index
Inverted Index considerations
• Column name can be used a row key
placeholder
• Hotspots!!
• Sharding
Other possible improvement
• Textual Data
• Think compression
Upcoming features
- Hadoop integration
- Solr
Example 2: Ratings
RDBMS -> CASSANDRA
user
id (primary key)
name
alias
email
movie
id (primary key)
title
description
user_movie_rating
id (primary key)
userId (foreign key)
movieId (foreign key)
rating
1 ∞ 1∞
Queries
Get email of userid 123
Get title and description of movieId 222
List all movie names and corresponding ratings for userId 123
List all users and corresponding rating for movieId 222
CASSANDRA MODEL
123
222:rating 222:title 534:rating 534:title 888:rating 888:title
4 rockstar 2 Finding
Nemo
1 Top Guns
movieI
d
userId
rating
222
334 455 544 633 789 999
2 5 1 2 2 3
123
name alias email
Nitish Korla buckwild nk@netflix.com
user
223
title description
Find Nemo Good luck
with that
movie
ratingsByMovie
ratingsByUser
userId
Seque
nce?
Example 3 : Viewing History
Viewing History
ROWID 1234454545 : 5466
Format
<Timeuuid> : <movieid>
1234454545 : 5466 1234454545 :
5466
1234454545 :
5466
Subscriber_id Playback/Bookmark related
SERRIALED DATA
Playback/Bookmark
related SERRIALED
DATA
Playback/Bookmark
related SERRIALED
DATA
Playback/Bookmar
k related
SERRIALED DATA
3454545_5
634534
JSON
3454546_5
JSON
3454547_5
JSON
3454555_9
JSON
3454560_9
JSON
3454580_9
JSON
454545_56
54534
JSON
4454546_5
JSON
4454547_5
JSON
4454555_9
JSON
5554560_9
JSON
5554580_9
JSON
3454545_5
69545 JSON
3454546_5
JSON
3454547_5
JSON
3454555_9
JSON
3454560_9
JSON
3454580_9
JSON
3454545_5
64354
JSON
3454546_5
JSON
3454547_5
JSON
3454555_9
JSON
3454560_9
JSON
3454580_9
JSON
Viewing History compression
ROWID 1234454545_5466
Format
<Timeuuid>_<movieid>
1234454546_5466 1234454547_5466 1234454548_5466
Subscriber_id Playback/Bookmark related
SERRIALED DATA
Playback/Bookmark
related SERRIALED DATA
Playback/Bookmark
related SERRIALED
DATA
Playback/Bookmark
related SERRIALED
DATA
Re-sort by movie id
Movie_id:[{playbackevent1,playbackevent2 ...... } ],
Movie_id:[{playbackevent1,playbackevent2 ...... } ],
Movie_id:[{playbackevent1,playbackevent2 ...... } ],
Movie_id:[{playbackevent1,playbackevent2 ...... } ],
Compress data
1
3
2
4 Store in separate column family
Reduced data size by 7
times
Operational processes
improved by 10 times
Money saved: $,$$$,$$$Improvement in app read
latency
Think Data Archival
• Data stores in Netflix grow exponentially
• Have a process in place to archive data
– DSE
– Moving to a separate column family
– Moving to a separate cluster (non SSD)
– Setting right expectations w.r.t latencies with historical
data
• Cassandra TTL’s
Example 4 : Personalized recommendations
read-modify-write pattern
• Data read and written back (even if data was not
modified)
• Large BLOB’s
Cassandra under IO pressure
Peak traffic – compaction yet to run – high
read latency
read-modify-write pattern
• Do you really need to read data ?
• Avoid write if data has not changed – SSTable
creation – immutable SSTables created at backend
• Write with a new row key (Limit sstable scans). TTL
data
• If a batch process, throttle the write rate to let
compactions catch up
Useful Tools
• Cassandra real-time metrics
• Capture schema changes –(automatically)
Observations
• Cassandra scales linearly without any noticeable
degradation to running cluster
• Self-healing : minimal operational noise
• Developers
– mindset need to shift from normalization to
denormalization
– Need to have reasonable understanding of Cassandra
architecture
– Enjoy the schema change flexibility. No more DDL locks/
DBA dependency
Questions
Reading from Cassandra
client
memtable
sstable
sstable
sstable
Row cache
key cache
Writing to Cassandra
client Commit
log (Disk)
Memtable
(memory)
sstable
Flush
Replication factor: 3
sstable sstablesstable

Contenu connexe

Tendances

How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)DataStax Academy
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache CassandraRobert Stupp
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebookragho
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loadingalex_araujo
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code GenerationDatabricks
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query languageCourtney Robinson
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflixgreggulrich
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compactionMIJIN AN
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsAlexander Korotkov
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...ScyllaDB
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
 

Tendances (20)

How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)How to size up an Apache Cassandra cluster (Training)
How to size up an Apache Cassandra cluster (Training)
 
Introduction to Apache Cassandra
Introduction to Apache CassandraIntroduction to Apache Cassandra
Introduction to Apache Cassandra
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Hive User Meeting August 2009 Facebook
Hive User Meeting August 2009 FacebookHive User Meeting August 2009 Facebook
Hive User Meeting August 2009 Facebook
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Understanding and Improving Code Generation
Understanding and Improving Code GenerationUnderstanding and Improving Code Generation
Understanding and Improving Code Generation
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Cql – cassandra query language
Cql – cassandra query languageCql – cassandra query language
Cql – cassandra query language
 
Cassandra Operations at Netflix
Cassandra Operations at NetflixCassandra Operations at Netflix
Cassandra Operations at Netflix
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
RocksDB compaction
RocksDB compactionRocksDB compaction
RocksDB compaction
 
Solving PostgreSQL wicked problems
Solving PostgreSQL wicked problemsSolving PostgreSQL wicked problems
Solving PostgreSQL wicked problems
 
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 

En vedette

Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase强 王
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseEdureka!
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented DatabaseSuvradeep Rudra
 
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...MongoDB
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger InternalsNorberto Leite
 
MongoDB at eBay
MongoDB at eBayMongoDB at eBay
MongoDB at eBayMongoDB
 

En vedette (6)

Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL databaseHBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
HBase Vs Cassandra Vs MongoDB - Choosing the right NoSQL database
 
Rise of Column Oriented Database
Rise of Column Oriented DatabaseRise of Column Oriented Database
Rise of Column Oriented Database
 
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
https://docs.google.com/presentation/d/1DcL4zK6i3HZRDD4xTGX1VpSOwyu2xBeWLT6a_...
 
MongoDB WiredTiger Internals
MongoDB WiredTiger InternalsMongoDB WiredTiger Internals
MongoDB WiredTiger Internals
 
MongoDB at eBay
MongoDB at eBayMongoDB at eBay
MongoDB at eBay
 

Similaire à Cassandra Data Modeling - Practical Considerations @ Netflix

Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataDonald Miner
 
SRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSean Davis
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeAdriel Café
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixKurt Brown
 
Structured Streaming with Apache Spark
Structured Streaming with Apache SparkStructured Streaming with Apache Spark
Structured Streaming with Apache SparkDataya Nolja
 
How Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses CassandraHow Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses Cassandragdusbabek
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012Amazon Web Services
 
Gerry McNicol Graph Databases
Gerry McNicol Graph DatabasesGerry McNicol Graph Databases
Gerry McNicol Graph DatabasesGerry McNicol
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Edureka!
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseBrendan Tierney
 
group_linkage@www15
group_linkage@www15group_linkage@www15
group_linkage@www15Pei Li
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge queryStanley Wang
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionPatrick McFadin
 
Sustainable queryable access to Linked Data
Sustainable queryable access to Linked DataSustainable queryable access to Linked Data
Sustainable queryable access to Linked DataRuben Verborgh
 
CassandraMeetup-0225-updated
CassandraMeetup-0225-updatedCassandraMeetup-0225-updated
CassandraMeetup-0225-updatedWei Zhu
 
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014Amazon Web Services
 
Hands on Training – Graph Database with Neo4j
Hands on Training – Graph Database with Neo4jHands on Training – Graph Database with Neo4j
Hands on Training – Graph Database with Neo4jSerendio Inc.
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large GraphsNishant Gandhi
 
Getting started with Cassandra 2.1
Getting started with Cassandra 2.1Getting started with Cassandra 2.1
Getting started with Cassandra 2.1Viswanath J
 

Similaire à Cassandra Data Modeling - Practical Considerations @ Netflix (20)

Survey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing DataSurvey of Accumulo Techniques for Indexing Data
Survey of Accumulo Techniques for Indexing Data
 
SRAdb Bioconductor Package Overview
SRAdb Bioconductor Package OverviewSRAdb Bioconductor Package Overview
SRAdb Bioconductor Package Overview
 
Data Access Patterns
Data Access PatternsData Access Patterns
Data Access Patterns
 
SPARQL-DL - Theory & Practice
SPARQL-DL - Theory & PracticeSPARQL-DL - Theory & Practice
SPARQL-DL - Theory & Practice
 
Data Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at NetflixData Science with Elastic MapReduce (EMR) at Netflix
Data Science with Elastic MapReduce (EMR) at Netflix
 
Structured Streaming with Apache Spark
Structured Streaming with Apache SparkStructured Streaming with Apache Spark
Structured Streaming with Apache Spark
 
How Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses CassandraHow Rackspace Cloud Monitoring uses Cassandra
How Rackspace Cloud Monitoring uses Cassandra
 
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
BDT303 Data Science with Elastic MapReduce - AWS re: Invent 2012
 
Gerry McNicol Graph Databases
Gerry McNicol Graph DatabasesGerry McNicol Graph Databases
Gerry McNicol Graph Databases
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
 
Overview of running R in the Oracle Database
Overview of running R in the Oracle DatabaseOverview of running R in the Oracle Database
Overview of running R in the Oracle Database
 
group_linkage@www15
group_linkage@www15group_linkage@www15
group_linkage@www15
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
Time series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long versionTime series with Apache Cassandra - Long version
Time series with Apache Cassandra - Long version
 
Sustainable queryable access to Linked Data
Sustainable queryable access to Linked DataSustainable queryable access to Linked Data
Sustainable queryable access to Linked Data
 
CassandraMeetup-0225-updated
CassandraMeetup-0225-updatedCassandraMeetup-0225-updated
CassandraMeetup-0225-updated
 
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
(BDT203) From Zero to NoSQL Hero: Amazon DynamoDB Tutorial | AWS re:Invent 2014
 
Hands on Training – Graph Database with Neo4j
Hands on Training – Graph Database with Neo4jHands on Training – Graph Database with Neo4j
Hands on Training – Graph Database with Neo4j
 
Processing Large Graphs
Processing Large GraphsProcessing Large Graphs
Processing Large Graphs
 
Getting started with Cassandra 2.1
Getting started with Cassandra 2.1Getting started with Cassandra 2.1
Getting started with Cassandra 2.1
 

Dernier

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 

Dernier (20)

Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 

Cassandra Data Modeling - Practical Considerations @ Netflix

  • 1. Cassandra Data modeling Practical considerations Nitish Korla
  • 2. Why Cassandra?  High Availability / Fully distributed  Scalability (Linear)  Write performance  Simple to install and operate  Multi-region replication support (bi-directional)
  • 3. Cassandra footprint @ Netflix • 60+ Cassandra clusters • 1600+ nodes holding 100+ TB data • AWS 500 IOPS -> 100, 000 IOPS • Streaming data completely persisted in Cassandra • Related Open Source Projects – Cassandra/Astyanax : in-house committer – Priam : Cassandra Automation – Test Tools : jmeter – http://github.com/netflix
  • 4. Data Model keyspace column family Row column • name • value • timestamp Cassandra RDBMS Equivalent KEYSPACE DATABASE/SCHEMA COLUMN FAMILY TABLE ROW ROW FLEXIBLE COLUMNS DEFINED COLUMNS
  • 5. Data Model Columns sorted by comparator name 356 Paul group 34567 sex male name 54 kim group 34566 sex female US:CA:Fremont 54353 US:CA:Hayward 34343 status single zip 94538 r o w s Composite columns US:CA:San Jose 987556 population Columns sorted by composite comparators
  • 6. Do your Homework ① Understand your application requirements ② Identify your access patterns ③ Model around these access patterns ④ Denormalization is your new friend but… ⑤ Benchmark – Avoid Surprises
  • 7. Example 1 : Edge Service
  • 8. Edge Services Data Model alloc /xyz/jkl_1 000 active yes script text alloc /xyl/jkl_2 111 active yes script text alloc /xyl/jkl_3 222 active yes script text ROWID ALLOCATION ACTIVE SCRIPT Script_location_version 000 YES OR NO EDGE SERVICE CLUSTER
  • 9. Edge Service Anti patterns • High concurrency: Edge servers auto scale • Range scans: Read all data • Large payload: ~1MB of data Very high read latency / unstable cassandra
  • 11. Inverted Index considerations • Column name can be used a row key placeholder • Hotspots!! • Sharding
  • 12. Other possible improvement • Textual Data • Think compression Upcoming features - Hadoop integration - Solr
  • 14. RDBMS -> CASSANDRA user id (primary key) name alias email movie id (primary key) title description user_movie_rating id (primary key) userId (foreign key) movieId (foreign key) rating 1 ∞ 1∞ Queries Get email of userid 123 Get title and description of movieId 222 List all movie names and corresponding ratings for userId 123 List all users and corresponding rating for movieId 222
  • 15. CASSANDRA MODEL 123 222:rating 222:title 534:rating 534:title 888:rating 888:title 4 rockstar 2 Finding Nemo 1 Top Guns movieI d userId rating 222 334 455 544 633 789 999 2 5 1 2 2 3 123 name alias email Nitish Korla buckwild nk@netflix.com user 223 title description Find Nemo Good luck with that movie ratingsByMovie ratingsByUser userId Seque nce?
  • 16. Example 3 : Viewing History
  • 17. Viewing History ROWID 1234454545 : 5466 Format <Timeuuid> : <movieid> 1234454545 : 5466 1234454545 : 5466 1234454545 : 5466 Subscriber_id Playback/Bookmark related SERRIALED DATA Playback/Bookmark related SERRIALED DATA Playback/Bookmark related SERRIALED DATA Playback/Bookmar k related SERRIALED DATA 3454545_5 634534 JSON 3454546_5 JSON 3454547_5 JSON 3454555_9 JSON 3454560_9 JSON 3454580_9 JSON 454545_56 54534 JSON 4454546_5 JSON 4454547_5 JSON 4454555_9 JSON 5554560_9 JSON 5554580_9 JSON 3454545_5 69545 JSON 3454546_5 JSON 3454547_5 JSON 3454555_9 JSON 3454560_9 JSON 3454580_9 JSON 3454545_5 64354 JSON 3454546_5 JSON 3454547_5 JSON 3454555_9 JSON 3454560_9 JSON 3454580_9 JSON
  • 18. Viewing History compression ROWID 1234454545_5466 Format <Timeuuid>_<movieid> 1234454546_5466 1234454547_5466 1234454548_5466 Subscriber_id Playback/Bookmark related SERRIALED DATA Playback/Bookmark related SERRIALED DATA Playback/Bookmark related SERRIALED DATA Playback/Bookmark related SERRIALED DATA Re-sort by movie id Movie_id:[{playbackevent1,playbackevent2 ...... } ], Movie_id:[{playbackevent1,playbackevent2 ...... } ], Movie_id:[{playbackevent1,playbackevent2 ...... } ], Movie_id:[{playbackevent1,playbackevent2 ...... } ], Compress data 1 3 2 4 Store in separate column family Reduced data size by 7 times Operational processes improved by 10 times Money saved: $,$$$,$$$Improvement in app read latency
  • 19. Think Data Archival • Data stores in Netflix grow exponentially • Have a process in place to archive data – DSE – Moving to a separate column family – Moving to a separate cluster (non SSD) – Setting right expectations w.r.t latencies with historical data • Cassandra TTL’s
  • 20. Example 4 : Personalized recommendations
  • 21. read-modify-write pattern • Data read and written back (even if data was not modified) • Large BLOB’s Cassandra under IO pressure Peak traffic – compaction yet to run – high read latency
  • 22. read-modify-write pattern • Do you really need to read data ? • Avoid write if data has not changed – SSTable creation – immutable SSTables created at backend • Write with a new row key (Limit sstable scans). TTL data • If a batch process, throttle the write rate to let compactions catch up
  • 23. Useful Tools • Cassandra real-time metrics • Capture schema changes –(automatically)
  • 24. Observations • Cassandra scales linearly without any noticeable degradation to running cluster • Self-healing : minimal operational noise • Developers – mindset need to shift from normalization to denormalization – Need to have reasonable understanding of Cassandra architecture – Enjoy the schema change flexibility. No more DDL locks/ DBA dependency
  • 27. Writing to Cassandra client Commit log (Disk) Memtable (memory) sstable Flush Replication factor: 3 sstable sstablesstable

Notes de l'éditeur

  1. Start with some live example.. And then use it as segway to cover some best practices
  2. RdbmsbackgroudKeyspace -&gt; DBCF -&gt; TableRow groups columnsEach column is a tripletColumn naming is not necessary/could be different. Column comparator specifies the sorting.. No need to stick to certain rules Name -&gt; sortedTimestamp -&gt; conflict resolution
  3. Rows are indexedColumns are sorted based on comparator you specify, so use it to your benefitKeep column names short as they are repeated Column size = 15 bytes + size of name + size of value Don’t store empty columns if there is no need – schema free designCOMPOSITE COLUMNScustom inverted search indexes: when you want more control over the CF layout than a secondary indexa replacement for super columns: both and a means to offset some of the worst performance penalties associated with such, as well as extend the model to provide and arbitrary level of nestinggrouping otherwise static skinny rows into wider rows for greater efficiency
  4. Cassandra is for point queriesStill ok for small set of rows
  5. API servers autoscale or new push, they need to read majority of rows in scripts column family
  6. Simple but powerful concept – based on premise thatrows are indexed and point looks are fasterCreate another column family and store list of all required rowid’s for faster lookup
  7. Wide row can reside only on one node.. And that can create hot spotsSharding – application logic / buckets
  8. 20% performance loss due to parsing1.2netty protocol
  9. Start with some live example.. And then use it as segway to cover some best practices
  10. One to one mapping doesn’t workFifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others. - multi-valued depedencies
  11. Sequence in cassandra??Index lookupdenormalization
  12. We don’t have linear growthTTL fascinating feature… coming from oracle backgroundViewing history dataWide row implementation, Compressed dataStored till perpetuitySome rows have ~20M of data (and growing)App code paginates through columns - Good thingCapacity considerationCassandra house keeping (more data -&gt; repairs/bootstraps)
  13. We don’t have linear growthTTL fascinating feature… coming from oracle backgroundViewing history dataWide row implementation, Compressed dataStored till perpetuitySome rows have ~20M of data (and growing)App code paginates through columns - Good thingCapacity considerationCassandra house keeping (more data -&gt; repairs/bootstraps)
  14. We don’t have linear growthTTL fascinating feature… coming from oracle background
  15. Read is going to drive the latency of overall request
  16. architecture to reap the benefits of distributed computing / high performance
  17. 2 digest query/ 1 complete data response. The optimization is only on the bandwidthNumber of replicas contacted depend on the consistency level specifiedHinted handoff, read repair, antientropy node repairDon’t expect cassandra as a load balancer
  18. Commit log for durability – sequential writeMemtable – no disk access (no reads or seeks)Sstables written sequentially to the diskThe operational design integrates nicely with the operating system page cache. Because Cassandra does not modify the data, dirty pages that would have to be flushed are not even generated.