7. NoSQL 정의
비관계형, 분산, 오픈소스, 수평 확장성을 주요 특징으로 갖는
차세대 데이터베이스
– nosql-database.org
□ NO! SQL Not Only SQL
관계형 데이터베이스의 한계를 극복하기 위한 데이터 저장소의 새로운 형태
최근에는 빅데이터 처리 & 분산시스템 문제에 집중!
Google Trends - nosql 6
8. NoSQL 기술 현황
Gartner’s 2012 Hype Cycle for Big Data
출처 : Gartner 7
9. NoSQL 유래
□ 1998년: Carlo Strozzi가 SQL interface를 지원하지 않은 가벼운(lightweight)
오픈소스 관계형DB를 NoSQL이라 정의
시스템 구조를 단순화 시킴
시스템을 이기종 장비 에도 이식시킬 수 있게 함
쉘과 같은 툴로 임의의 UNIX 환경에서 구동시킬 수 있음
표준 상용 제품보다 기능은 줄이면서 가격은 저렴하게 함
데이터 필드 크기, 컬럼 등의 제약을 없앰
□ 2009년: 랙스페이스 직원인 Eric Evans가 오픈소스, 분산, 비관계형DB 이벤트에서
NoSQL을 재언급함. 기존 관계형 DBMS와 다른 특징으로 규정함
비관계형 (Non-relational)
분산 (Distributed)
ACID 미지원
□ 2011년: UnQL(Unstructurred Query Language) 활동 시작
출처 : Wikipedia, Dataversity 8
10. Why NoSQL?
□ACID doesn’t scale well
□Web apps have diffent needs (than the apps that RDBMS
were designed for)
Low and predictable response time(latency)
Scalability & Elasticity (at low cost!)
High Availability
Flexible schema / semi-structured data
Geographic distribution (multiple datacenters)
□Web apps can (usually) do not
Transaction / strong consistency / integrity
Complex queues
http://www.slideshare.net/marin_dimitrov/nosql-databases-3584443 9
11. 관계형 데이터베이스의 문제 I – 확장성 문제
□ Replication - 복제에 의한 확장
Master-Slave 구조
결과를 슬레이브의 개수만큼 복제함. 특정 시점이 지나면 한계가 됨
읽기(Read)는 빠르지만 쓰기(Write)는 하나의 노드에 대해서만 일어나기 때문에 병목
현상이 발생함
Master에서 slave로 퍼지는데 시간이 소요되기 때문에 중요한(Critical)한 읽기는 여전
히 Master에서 읽어야 하고, 이것은 어플리케이션 개발에 고려가 필요함
데이터 규모가 큰 경우에는 N번 복제를 해야 하기 때문에 문제가 발생할 소지가 있음.
이것은 Master-Slave 방식으로 확장성에 대한 제한을 가지게 됨
Master-Master 구조
Master를 추가함으로써 쓰기성능을 향상할 수 있으나 충돌이 발생할 가능성이 있음
충돌 가능성은 O(N3) 또는 O(N2) 에 비례함
http://research.microsoft.com/~gray/replicas.ps
□ Partitioning(Sharding) - 분할에 의한 확장
Read만큼 Write도 확장할 수 있지만 애플리케이션에서 파티션된 것을 인지하고 있어야 함
RDBMS의 가치는 관계에 있다고 할 수 있는데 파티션을 하면 이 관계가 깨져버리고 각 파
티션된 조각간에 조인을 할 수 없기 때문에 관계에 대한 부분은 애플리케이션 레이어에서
책임져야 합니다.
일반적으로 RDBMS에서 수동 Sharding 은 쉽지 않다.
10
12. 관계형 데이터베이스의 문제 I – 확장성 문제
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 11
13. 관계형 데이터베이스의 문제 II – 필요없는 특징들
□ UPDATE와 DELETE
정보의 손실이 발생하기 때문에 잘 사용되지 않음
Auditing이나 re-activation을 위해서 기록이 필요함
일반적으로 도메인 관점에서 삭제(deleted)나 갱신(update)는 사용되지 않음
UPDATE나 DELETE는 INSERT와 version으로 모델 할 수 있음
데이터가 많아지면 비활성(inactive) 데이터는 archive함
INSERT-only 시스템에서는 2개의 문제 존재
데이터베이스에서 종속(cascade)에 대한 트리거를 이용할 수 없음
Query가 비활성 데이터를 걸러내야 할 필요가 있음
□ JOIN
피해야 하는 이유: 데이터가 많을 때 JOIN은 많은 양의 데이터에 복잡한 연산을 수행해야
하기 때문에 비용이 많이 들며 파티션을 넘어서는 동작하지 않기 때문
피하는 방법
정규화의 목적: 일관된 데이터를 가지기 쉽게 하고 스토리지의 양을 줄이기 위함
반정규화(de-normalization)를 하면 JOIN 문제를 피할 수 있음. 반정규화로 일관성에
대한 책임을 DB에서 어플리케이션으로 이동시킬 수 있는데 이는 INSERT-only라면 어
렵지 않음
12
14. 관계형 데이터베이스의 문제 III – 필요없는 특징들
□ ACID 트랜잭션
Atomic(원자성): 여러 레코드를 수정할 때 원자성은 필요 없으며 단일키 원자성이면 충분
Consistency(일관성): 대부분의 시스템은 C보다는 P나 A를 필요로 하기 때문에 엄격한 일관
성을 가질 필요는 없고 대신 결과적 일관성(Eventually Consistent)을 가질 수 잇음
Isolation(격리성): Read-Committeed 이상의 격리성은 필요하지 않으며 단일키 원자성이 더
쉽다.
Durability(지속성): 각 노드가 실패했을 때도 이용되기 위해서는 메모리가 데이터를 충분히
보관할 수 있을 정도로 저렴해지는 시점까지는 지속성이 필요함
□ 고정된 스키마(Fixed Schema)
RDBMS에서는 데이터를 사용하기 전에 스키마를 정의해야 함: Table, Index등을 정의해야
하는데
스키마 수정은 기본: 현재의 웹환경에서는 빠르게 새로운 피쳐를 추가하고 이미 존재하는
피쳐를 조정하기 위해서는 스키마 수정이 필수적으로 요구됨
스키마 수정의 어려움: 컬럼의 추가/수정/삭제는 row에 lock을 걸고 index의 수정은 테이블
에 lock을 걸기 때문
□ 일부 없는 특성
계층화나 그래프를 모델하는 것은 어려움
빠른 응답을 위해서 디스크를 피하고 메인 메모리에서 데이터를 제공하는 것이 바람직한데
대부분의 관계형 데이터베이스는 디스크기반이기 때문에 쿼리들이 디스크에서 수행됨
13
15. NoSQL Features
□Scale horizontally “simple Operations”
Key lookups, reads and writes of one record or a small
number of records, simple selections
□Replicate/distribute data over many servers
□Simple call level interface (constrast w/SQL)
□Weaker concurrency model than ACID
Eventual Consistency
BASE
□Efficient use of distributed indexes and RAM
□Flexible Schema
http://www.cs.washington.edu/education/courses/cse444/12sp/lectures/lecture26-nosql.pdf 14
16. NoSQL Use Cases
□Massive data Volumes
Massively distributed architecture required to store the data
Google, Amazon, Yahoo, Facebook – 10K ~ 100K servers
□Extremely query workload
Impossible to efficiently do joins at the scale with an RDBMS
□Schema evolution
Schema flexibility(migration) is not trivial at large scale
Schema changes can be gradually introduced with NoSQL
15
17. NoSQL 기본개념
CAP 정리
ACID vs. BASE
Isolation Levels
MVCC
Distributed Transaction
16
18. CAP 정리 I
2000 Prof. Eric Brewer PoDC Conference Keynote
2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
□분산 시스템이 보장해야 할 3가지 특성
□ Consistency: 각각의 사용자가 항상 동일한 데이터를 조회한다.
□ Availability: 모든 사용자가 항상 읽고 쓸 수 있다.
□ Partition tolerance: 물리적 네트워크 분산 환경에서 시스템이 잘 동작한다.
□분산 시스템에서는 적절한 시간에 2가지 특성만 만족할 수 있
다.
17
19. CAP 정리 II
Availability
관계형 DHT 기반
분산환경에서 적절한 Dynamo, Cassandra
RDBMS
응답시간 이내에
세가지 속성을 만족시키는
저장소는 구성하기 어렵다.
Partition
Consistency Tolerence
파티셔닝 기반
BigTable, Hbase, MongoDB
분산 시스템에서의 네트워크 분할은 반드시 대비해야 한다. 따라서 실제로는 어떤 것을
포기할지에 대해 두 개의 선택권만 있다. – Werner Vogels (아마존 CTO)
http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 18
20. CAP 정리 III – Partition Tolerence vs. Availability
"The network will be allowed to loss arbitrarily many messages sent from
one node to another"[...]"
"For a distributed system to be continuously available, every request
received by a non-failing node in the system must result in a response“
- Gilbert and Lynch, SIGACT 2002
CP: request can complete at nodes that
have quoram
AP: requests can complete at any live
node, possibly violating strong
consistency
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 19
21. ACID
Atomicity: All or nothing.
Consistency: Consistent state of data and transactions.
Isolation: Transactions are isolated from each other.
Durability: When the transaction is committed, state
will be durable.
Any data store can achieve Atomicity, Isolation and Durability but do
you always need consistency? No.
By giving up ACID properties, one can achieve higher performance
and scalability.
20
22. BASE – ACID alternative
Basically available: Nodes in the a distributed
environment can go down, but the whole system
shouldn’t be affected.
Soft State (scalable): The state of the system and data
changes over time, even without input. This is
because of the eventual consistency model.
Eventual Consistency: Given enough time, data will
be consistent across the distributed system.
21
23. ACID vs. BASE
ACID BASE
□Strong Consistency □Weak Consistency
□Isolation □Availability first
□Focus on “commit” □Best effort
□Nested transactions □Approximated answers
□Less Availability □Aggressive(optimistic)
□Conservative (pessimistic) □Simpler!
□Difficult evolution (e.g. □Faster
schema) □Easier evolution
출처 : Brewer 22
24. Isolation Levels
□Read Uncommitted
aka (NOLOCK)
Does not issue shared lock, does not honor exclusive lock
Rows can be updated/inserted/deleted before transaction ends
Least restrictive
□Read Committed
Holds shared Lock
Cannot read uncommitted data, but data can be changed before
end of transaction, resulting in non repeatable read or phantom
rows
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 23
25. Isolation Levels
□Repeatable Read
Locks data being read, prevents updates/deletes
New rows can be inserted during transaction, will be included in
later reads
□Serializable
Aka HOLDLOCK on all tables in SELECT
Locks range of data being read, no modifications are possible
Prevents updates/deletes/inserts
Most restrictive
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 24
31. Multi Version Concurrency Control
Root
Index
Index Index Index
Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 30
32. Multi Version Concurrency Control
obsolete
Root atomic pointer update
new version
marked for compaction
Index Index
Reads:
never
Index Index Index Index blocked
Index Index Index Index Index Index Index Index
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
http://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 31
33. Distributed Transactions – 2PC
□Voting Phase: each site is polled as to whether a
transactions should commit (ie: whether their sub-
transaction can commit)
□Decision Phase: if any site says “abort” or does not
reply, then all sites must be told to abort
□Logging is performed for failure recovery (as usual)
http://www.slideshare.net/atali/2011-db-distributed 32
38. Amazon Dynamo - Motivation
□Vast Distributed System
Tens of millions of customers
Tens of thousands of servers
Failure is a normal case
□Outage means
Lost Customer Trust
Financial loses
□Goal: great customer experience
Always Available
Fast
Reliable
Scalable
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 37
39. Key Features 1/2
□Amazon, ~2007
□Highly-available key-value storage system
99.9995% of request
Targeted for primary key access and small values(< 1MB)
□Scalable and decentralized
□Gives tight control over tradeoffs between:
Availability, consistency, performance
□Data partitioned using consistent hashing
□Consistency facilitated by object versioning
Quorum-like technique for replicas consistency
Decentralized replica synchronization protocol
Eventual consistency
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf, http://www.slideshare.net/kingherc/bigtable-and-dynamo 38
40. Key Features 2/2
□Gossip protocol for:
Failure detection
Membership protocol
□Service Level Agreements(SLAs)
Include client’s expected request rate distribution and expected
service latency
e.g.: Response time < 300ms, for 99.9% of requests, for a peak
load of 500 requests/sec.
Example: Managing shopping carts. Write /read and available
across multiple data centers.
□Trusted network, no authentication
□Incremental scalability
□Symmetry
□Heterogeneity, Load distribution
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf, http://www.slideshare.net/kingherc/bigtable-and-dynamo 39
41. Techniques used in Dynamo
Consistent Hashing
Vector Clocks
Gossip Protocols
Hinted Handoffs
Read Repair
Merkle Trees
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability for Vector clocks with reconciliation during
Version size is decoupled from update rates.
writes reads
Handling temporary
failures Provides high availability and durability guarantee
Sloppy Quorum and hinted handoff
when some of the replicas are not available.
Recovering from
Anti-entropy using Merkle trees Synchronizes divergent replicas in the background.
permanent failures
Preserves symmetry and avoids having a centralized
Membership and failure Gossip-based membership protocol and registry for storing membership and node liveness
detection failure detection. information.
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 40
43. Modulo-based Hashing
N1 N2 N3 N4
?
partition = key % (n_servers – 1)
Recalculate the hashes for all entries if n_servers changes
(i.e. full data redistribution when adding/removing a node)
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 42
44. Consistent Hashing
2128 0
hash(key)
A
Same hash function
F for data and nodes
B idx = hash(key)
Ring
(Key Space) Coordinator: next
E
available clockwise
node
C
D
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 43
45. Consistent Hashing
2128 0
hash(key)
A
Same hash function
F
for data and nodes
B idx = hash(key)
Ring
(Key Space)
E
Coordinator: next
available clockwise
node
C
D
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 44
46. Consistent Hashing - Replication
2128 0
KeyAB hosted
A in B, C, D
F
Data replication in
B the N-I clockwise
Ring
successor nodes
(Key Space)
E
C Node hosting
KeyFA, KeyAB, KeyBC
D
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 45
47. Consistent Hashing – Node Changes
2128 0
Key membership
A and replicas are
updated when a
F
node joins or leaves
the network.
Copy KeyAB B The number of
replicas for all data
is kept consistent.
E
Copy KeyFA
C
Copy KeyEF
D
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 46
48. Virtual Nodes
Random assignment leads to:
Non-uniform data
Uneven load distribution
Solution: “virtual nodes”
A single node (physical machine) is assigned
multiple random positions (“tokens”) on the ring.
On failure of a node, the load is evenly dispersed
On joining, a node accepts an equivalent load
Number of virtual nodes assigned to a physical
node can be decided based on its capacity.
http://www.slideshare.net/kingherc/bigtable-and-dynamo 47
49. Virtual Nodes – Load Distribution
2128 0
Different Strategies
A
I
Virtual Nodes
H
Random tokens per each
B
Ring physical node, partition by
token value
(Key Space)
G C
Node 1: tokens A, E, G
D Node 2: tokens C, F, H
Node 3: tokens B, D, I
F E
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 48
50. Virtual Nodes – Load Distribution
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 49
51. Replication & Consistency (Quorum)
N = number of nodes with a replica of the data
W = number of replicas that must acknowledge the update(*)
R = minimum number of replicas that must participate in a
successful read operation
(*) but the data will be written to N nodes no matter what
W+R>N Strong Consistency (usually N=3, R=W=2)
W=N, R=1 Optimised for reads
W=1, R=N Optimised for writes
(durability not guranteed in presence of failures)
W+R<N Weak Consistency
Latency is determined by the slowest of the R replicas for read, W replicas
for write.
50
52. Vector Clocks & Conflict Detection
Causality-based partial
order over events that
happen in the system.
Document version
history: a counter for
each node that updated
the document.
If all update counters in
V1 are smaller or equal
to all update counters in
V2, then V1 precedes V2
http://en.wikipedia.org/wiki/Vector_clock 51
53. Vector Clocks & Conflict Detection
Vector Clocks can detect
a conflict. The conflict
resolution is left to the
application or the user.
The application might
resolve conflicts by
checking relative
timestamps, or with
other strategies (like
merging the changes).
Vector clocks can grow
quite large (!)
http://en.wikipedia.org/wiki/Vector_clock 52
54. Gossip Protocol + Hinted Handoff
A
F periodic, pairwise,
inter-process
interactions of
B
Ring bounded size
(Key Space) among randomly-
E chosen peers
C
D
http://en.wikipedia.org/wiki/Vector_clock 53
55. Gossip Protocol + Hinted Handoff
A
F periodic, pairwise,
I can't see B, it might be
inter-process
down but I need some
ACK. My Merkle Tree interactions of
B
root for range XY is bounded size
"ab03Idab4a385afda"
among randomly-
E chosen peers
I can't see B either.
My Merkle Tree root for
range XY is different! B must be down
C
then. Let’s disable it.
D
http://en.wikipedia.org/wiki/Vector_clock 54
56. Gossip Protocol + Hinted Handoff
My canonical node is
A supposed to be B.
F periodic, pairwise,
inter-process
interactions of
B
bounded size
among randomly-
E chosen peers
C I see. Well, I'll take care of it
for now, and let B know
when B is available again
D
http://en.wikipedia.org/wiki/Vector_clock 55
57. Merkle Trees (Hash Trees)
Leaves: hashes of
data blocks.
Nodes: hashes of
their children.
Used to detect
inconsistencies
between replicas
(anti-entropy) and
to minimise the
amount of
transferred data
http://en.wikipedia.org/wiki/Hash_tree 56
58. Merkle Trees (Hash Trees)
Node A Node B
gossip
exchange
Minimal data transfer
Differences are easy to locate
SHA-1, Whirlpool or Tiger (TTH) hash functions
http://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees 57
59. Read Repair
A
F
B GET(k, R=2)
E
C
D
http://en.wikipedia.org/wiki/Vector_clock 58
60. Read Repair
K=XYZ(v.2)
A
F
K=XYZ(v.2)
B GET(k, R=2)
E
C
K=ABC(v.1)
D
http://en.wikipedia.org/wiki/Vector_clock 59
61. Read Repair
A
F
B GET(k, R=2)
E
UPDATE(k, XYZ)
C
D
http://en.wikipedia.org/wiki/Vector_clock 60
62. Google Bigtable
Motivation
Key Features
System Architecture
Building Blocks
Data Model
SSTable/Tablet/Table
IO / Compaction
61
63. Motivation
□Lots of (semi-)structured data at Google
URLs:
Contents crawl metadata links anchors pagerank , …
Per-user data:
User preference settings, recent queries/search results, …
Geographic locations:
Physical entities (shops, restaurants, etc.), roads, satellite image
data, user annotations, …
□Scale is large
Billions of URLs, many versions/page (~20K/version)
Hundreds of millions of users, thousands of q/sec
100TB+ of satellite image data
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 62
64. Key Features
□Google, ~2006
□Distributed multi-level map
□Fault-tolerant, persistent
□Scalable
Thousands of servers
Terabytes of in-memory data
Petabyte of disk-based data
Millions of reads/writes per second, efficient
scans
□Self-managing
Servers can be added/removed dynamically
Servers adjust to load imbalance
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 63
66. Building Blocks
□Building blocks:
Google File System (GFS): Raw storage
Scheduler: Google Work Queue, schedules jobs onto
machines
Lock service: Chubby, distributed lock manager
MapReduce: simplified large-scale data processing
□BigTable uses of building blocks:
GFS: stores persistent data (SSTable file format for storage
of data)
Scheduler: schedules jobs involved in BigTable serving
Lock service: master election, location bootstrapping
Map Reduce: often used to read/write BigTable data
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 65
67. Google File System
□Large-scale distributed “filesystem”
□Master: responsible for metadata
□Chunk servers: responsible for reading and
writing large chunks of data
□Chunks replicated on 3 machines, master
responsible for ensuring replicas exist
□OSDI ’04 Paper
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 66
68. Chubby
□Distributed Lock Service
□File System {directory/file}, for locking
Coarse-grained locks, can store small amount of data in a lock
□High Availability
5 replicas, one elected as master
Service live when majority is live
Uses Paxos algorithm to solve consensus
□A client leases a session with the service
□Also an OSDI ’06 Paper
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 67
69. Data model
□“Sparse, distributed, persistent, multidim. sorted map”
□<Row, Column, Timestamp> triple for key - lookup, insert, and
delete API
□Arbitrary “columns” on a row-by-row basis
Column family:qualifier. Family is heavyweight, qualifier lightweight
Column-oriented physical store- rows are sparse!
□Does not support a relational model
No table-wide integrity constraints
No multirow transactions
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 68
70. SSTable
□Immutable, sorted file of key-value pairs
□Chunks of data plus an index
Index is of block ranges, not values
triplicated across three machines in GFS
SSTable
64K 64K 64K
block block block
Index
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 69
71. Tablet
□Contains some range of rows of the table
Dynamically partitioned range of rows
□Built out of multiple SSTables
□Typical size: 100~200MB
□Tablets are stored in Tablet Servers (~100 per server)
□Unit of distribution and load balancing
Tablet Start:aardvark End:apple
SSTable SSTable
64K 64K 64K 64K 64K 64K
block block block block block block
Index Index
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 70
72. Table
□Multiple tablets(table segments) make up the table
□SSTables SSTables can be shared can be shared
□Tablets do not overlap, SSTables can overlap
Tablet Tablet
aardvark apple apple_two_E boat
SSTable SSTable SSTable SSTable
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 71
73. Tablets & Splitting
Large tables broken into tablets at row boundaries
“language” “contents”
aaa.com
cnn.com
EN “<html>…”
cnn.com/sports.html
TABLETS
…
Website.com
…
Zuppa.com/menu.html
http://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 72
74. Finding a Tablet
Approach: 3-level hierarchical lookup scheme for tablets
– Location is ip:port of relevant server, all stored in META tablets
– 1st level: bootstrapped from lock server, points to owner of META0
– 2nd level: Uses META0 data to find owner of appropriate META1 tablet
– 3rd level: META1 table holds locations of tablets of all other tables
META1 table itself can be split into multiple tablets
http://labs.google.com/papers/bigtable-osdi06.pdf 73
75. Servers
Tablet servers manage tablets, multiple tablets per
server. Each tablet is 100-200 megs
–Each tablet lives at only one server
–Tablet server splits tablets that get too big
Master responsible for load balancing and fault
tolerance
–Use Chubby to monitor health of tablet servers, restart failed
servers
–GFS replicates data. Prefer to start tablet server on same
machine that the data is already at
74
76. BigTable I/O
memtable read
minor
memory compaction
GFS
tablet log
SSTable SSTable SSTable
BMDiff Zippy
write
Merging / Major Compaction (GC)
□ Commit log stores the writes
Recent writes are stores in the memtable
Older writes are stores in SSTables
□ A read operation sees a merged view of the memtable and the SSTables
□ Checks authorization from ACL stored in Chubby
http://www.slideshare.net/kingherc/bigtable-and-dynamo 75
77. Compactions
Minor compaction – convert the memtable
into an SSTable
Reduce memory usage
Reduce log traffic on restart
Merging compaction
Reduce number of SSTables
Good place to apply policy “keep only N versions”
Major compaction
Merging compaction that results in only one
SSTable
No deletion records, only live data
76
78. Locality Groups
Group column families together into an
SSTable
–Avoid mingling data, ie page contents and page
metadata
–Can keep some groups all in memory
Can compress locality groups
Bloom Filters on locality groups – avoid
searching SSTable
77
82. NoSQL 종류 I
□Key-Value Stores
□Based on DHTs / Amazon’s Dynamo paper
□Data model: (global) collection of K-V pairs
□Example: Voldemort, Tokyo, Riak, Redis
□Column Store
□Based on Google’s BigTable paper
□Data model: big table, column families
□Example: Hbase, Cassandra, Hypertable
http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 81
83. NoSQL 종류 II
□Document Store
□Inspired by Lotus Notes
□Data model: collections of K-V collections
□Example: CouchDB, MongoDB
□Graph Database
□Inspired by Euler & graph theory
□Data model: nodes, rels, K-V on both
□Example: Neo4J, FlockDB
http://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 82
84. NoSQL Data Model 비교
http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ 83
86. Voldemort AP
□ LinkedIn, 2009, Apache 2.0, Java
□ Model: Key-Value(Data Model), Dynamo(Distributed Model)
□ Main point: Data is automatically replicated and partitioned to multiple
servers
□ Concurrency Control: MVCC
□ Transaction: No
□ Data Storage: BDB, MySQL, RAM
□ Key Features
□ Data is automatically replicated and partitioned to multiple servers
□ Simple Optimistic Locking for multi-row updates
□ Pluggable Storage Engine
□ Multiple read-writes
□ Consistent-hashing for data distribution
□ Data Versioning
□ Major Users: LinkedIn, GILT
□ Best Use: Real-time, large-scale
http://www.slideshare.net/adorepump/voldemort-nosql, http://nosql.findthebest.com/l/5/Voldemort 85
87. Voldemort Pros / Cons AP
Pros Cons
Highly customizable - each layer Versioning means lots of disk
of the stack can be replaced as space being used.
needed Does not support range queries
Data elements are versioned No complex query filters
during changes All joins must be done in code
All nodes are independent - no No foreign key constraints
SPOF
No triggers
Very, very fast reads
Support can be hard to find
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 86
88. Voldemort : Logical Architecture AP
□ Dynamo DHT Implementation
LICENSE
□ Consistent Hashing, Vector Clocks
APACHE 2.0
LANGUAGE
HTTP / Sockets
Java
Conflict resolved API / PROTOCOL
at read and write Time HTTP Java
Thrift
Json, Java String, byte[], Avro
Thrift, Avro, Protobuf ProtoBuf
CONCURRENCY
MVCC
Simple Optimistic Locking
for multi-row updates,
pluggable storage engine
http://www.project-voldemort.com/voldemort/design.html , http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 87
90. Riak AP
□ Basho Technologies, 2010, Apache 2.0, Erlang, C
□ Model: Key-Value(Data Model), Dynamo(Distributed Model)
□ Main Point: Fault tolerance
□ Protocol: HTTP/REST or custom binary
□ Transaction: No
□ Data Storage: Plug-in
□ Features
□ Tunable trade-offs for distribution and replication (N, R, W)
□ Pre- and post-commit hooks in JavaScript or Erlang
□ Map/Reduce in Javascript and Erlang
□ Links & link walking: use it as a graph database
□ Secondary indices: but only one at once
□ Large object support (Luwak)
□ Major Users: Mozilla, GitHub, Comcast, AOL, Ask.com
□ Best Uses: high availability
□ Example Usage: CMS, text search, Point-of-sales data collection.
http://nosql.findthebest.com/l/6/Riak, http://www.slideshare.net/seancribbs/introduction-to-riak-red-dirt-ruby-conf-training 89
91. Riak Pros / Cons AP
Pros Cons
All nodes are equal - no SPOF Not meant for small, discrete and
Horizontal Scalability numerous datapoints.
Full Text Search Getting data in is great; getting it
RESTful interface(and HTTP) out, not so much
Consistency level tunable on each Security is non-existent:
"Riak assumes the internal environment is
operation
trusted"
Secondary indexes available Conflict resolution can bubble up
Map/Reduce(JavaScript & Erlang to the client if not careful.
only) Erlang is fast, but it's got a serious
learning curve.
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 90
92. Riak AP
LICENSE
APACHE 2.0
LANGUAGE
C, Erlang
API / PROTOCOL
REST HTTP
*
ProtoBuf
Buckets -> K-V
“Links” (~relations)
Targeted JS Map/Reduce
Tunable consistency (one-quorum-all)
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 91
93. Riak Logical Architecture AP
http://bcho.tistory.com/621, http://basho.com/technology/technology-stack/ 92
94. Redis CP
□ VMWare, 2009, BSD, C/C++
□ Model: Key-Value(Data Model), Master-Slave(Distributed Model)
□ Main Point: Blazing fast
□ Protocol: Telnet-like
□ Concurrency Control: Locks
□ Transaction: Yes
□ Data Storage: RAM (in-memory)
□ Features
□ Disk-backed in-memory database
□ Currently without disk-swap (VM and Diskstore were abandoned)
□ Master-slave replication
□ Pub/Sub lets one implement messaging
□ Major Users: StackOverflow, flickr, GitHub, Blizzard, Digg
□ Best Uses: rapidly changing data, frequently written, rarely read
statistical data
□ Example Usage: Stock prices. Analytics. Real-time data
http://nosql.findthebest.com/l/6/Riak, http://www.slideshare.net/seancribbs/introduction-to-riak-red-dirt-ruby-conf-training 93
95. Redis Pros / Cons CP
Pros Cons
Transactional support Entirely in memory
Blob storage Master-slave replication (instead
Support for sets, lists and sorted of master-master)
sets Security is non-existent:
Support for Publish- designed to be used in trusted
environments
Subscribe(Pub-Sub) messaging
Does not support encryption
Robust set of operators
Support can be hard to find
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 94
96. Redis CP
K-V store “Data Structures Server”
LICENSE
Map, Set, Sorted Set, Linked List BSD
Set/Queue operations, Counters, Pub-Sub, Volatile keys LANGUAGE
ANSI C, C++
API / PROTOCOL
*(Many Language)
Telnet Like
PERSISTENCE
10-100K ops (whole dataset in RAM + VM) In Memory
bg snapshots
Persistence via snapshotting (tunable fsync freq.) REPLICATIONS
Master / Slave
Distributed if client supports consistent hashing
http://redis.io/presentation/Redis_Cluster.pdf, http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 95
98. Cassandra AP
□ ASF, 2008, Apache 2.0, Java
□ Model: Column(Data Model), Dynamo(Distributed Model)
□ Main Point: Best of BigTable and Dynamo
□ Protocol: Thrift, Avro
□ Concurrency Control: MVCC
□ Transaction: No
□ Data Storage: Disk
□ Features
□ Tunable trade-offs for distribution and replication (N, R, W)
□ Querying by column, range of keys
□ BigTable-like features: columns, column families
□ Has secondary indices
□ Writes are much faster than reads (!)
□ Map/reduce possible with Apache Hadoop
□ All nodes are similar, as opposed to Hadoop/HBase
□ Major Users: Facebook, Netflix, Twitter, Adobe, Digg
□ Best Uses: write often, read less
□ Example Usage: banking, finance, logging
http://nosql.findthebest.com/l/2/Cassandra, 97
99. Cassandra Pros / Cons AP
Pros Cons
Designed to span multiple No joins
datacenters No referential integrity
Peer to peer communication Written in Java - quite complex to
between nodes administer and configure
No SPOF Last update wins
Always writeable
Consistency level is tunable at run
time
Supports secondary indexes
Supports Map/Reduce
Support range queries
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 98
100. Cassandra AP
Data model of BigTable, infrastructure of Dynamo LICENSE
APACHE 2.0
LANGUAGE
Java
PROTOCOL
col_name Thrift
Avro
col_value
timestamp PERSISTENCE
Column memtable,
SSTable
CONSISTENCY
Tunnable
R/W/N
x
http://nosql.findthebest.com/l/2/Cassandra, 99
101. Cassandra AP
Data model of BigTable, infrastructure of Dynamo LICENSE
APACHE 2.0
LANGUAGE
Java
super_column_name PROTOCOL
col_name col_name Thrift
… Avro
col_value col_value
timestamp timestamp PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R/W/N
x
http://nosql.findthebest.com/l/2/Cassandra, 100
102. Cassandra AP
Data model of BigTable, infrastructure of Dynamo LICENSE
APACHE 2.0
Column Family LANGUAGE
Java
super_column_name PROTOCOL
col_name col_name Thrift
row_key … Avro
col_value col_value
timestamp timestamp PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R/W/N
x
http://nosql.findthebest.com/l/2/Cassandra, 101
103. Cassandra AP
Data model of BigTable, infrastructure of Dynamo LICENSE
APACHE 2.0
Super Column Family LANGUAGE
Java
super_column_name super_column_name PROTOCOL
col_name col_name col_name col_name Thrift
row_key … Avro
… …
col_value col_value col_value col_value
timestamp timestamp timestamp timestamp PERSISTENCE
memtable,
SSTable
keyspace.get (“column_family”, key, [“super_column”], “column”)
CONSISTENCY
Tunnable
R/W/N
x
http://nosql.findthebest.com/l/2/Cassandra, 102
104. Cassandra AP
Data model of BigTable, infrastructure of Dynamo LICENSE
APACHE 2.0
Super Column Family LANGUAGE
Java
super_column_name super_column_name PROTOCOL
col_name col_name col_name col_name Thrift
row_key … Avro
… …
col_value col_value col_value col_value
timestamp timestamp timestamp timestamp PERSISTENCE
memtable,
SSTable
keyspace.get (“column_family”, key, [“super_column”], “column”)
CONSISTENCY
A B Random Partitioner (MD5) Tunnable
ALL OrderPreservingPartitioner R/W/N
P2P C ONE
F
Gossip QUORUM
x
E D Range Scans, Fulltext Index(Solandra)
http://nosql.findthebest.com/l/2/Cassandra, 103
105. Cassandra - Data Model AP
HashTable Object
LICENSE
APACHE 2.0
HashKey LANGUAGE
Java
PROTOCOL
Thrift
Avro
PERSISTENCE
memtable,
SSTable
CONSISTENCY
Tunnable
R/W/N
http://javamaster.wordpress.com/2010/03/22/apache-cassandra-quick-tour/ 104
106. Cassandra - Data Model AP
• Column {name:"emailAddress", value:"cassandra@apache.org"}
• Name-Value 구조체 {name:"age", value:"20"}
UserProfile={
• Column Family Cassandra={emailAddress:”casandra@apache.org”, age:”20”}
TerryCho={emailAddress:”terry.cho@apache.org”,
• Column들의 집합 gender:”male”}
• Hash-Key Column리스트 Cath= { emailAddress:”cath@apache.org” , age:”20”,
gender:”female”, address:”Seoul”}
}
• Super-Column
• Column안에 Column 포함 {name:”username”
• ex) username->{firstname, value: firstname{name:”firstname”,value=”Terry”}
lastname} value: lastname{name:”lastname”,value=”Cho”}
}
• Super-Column Family UserList={
Cath:{
• Column Family안에 Column username:{firstname:”Cath”,lastname:”Yoon”}
Family 포함 address:{city:”Seoul”,postcode:”1234”}}
Terry:{
username:{firstname:”Terry”,lastname:”Cho”}
account:{bank:”hana”,accounted:”1234”}}
}
http://javamaster.wordpress.com/2010/03/22/apache-cassandra-quick-tour/ 105
107. Cassandra – Use Case: eBay AP
□ A glimpse on eBay’s Cassandra deployment
□ Dozens of nodes across multiple clusters
□ 200 TB+ storage provisioned
□ 400M+ writes & 100M+ reads per day, and growing
□ #1: Social Signals on eBay product & item pages
□ #2: Hunch taste graph for eBay users & items
□ #3: Time series use cases (many):
□ Mobile notification logging and tracking
□ Tracking for fraud detection
□ SOA request/response payload logging
□ ReadLaser server logs and analytics
□ Cassandra meets requirements
□ Need Scalable counters
□ Need real(or near) time analytics on collected social data
□ Need good write performance
□ Reads are not latency sensitive
http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376 106
108. HBase CP
□ ASF, 2010, Apache 2.0, Java
□ Model: Column(Data Model), Bigtable(Distributed Model)
□ Main Point: Billions of rows X millions of columns
□ Protocol: HTTP/REST (also Thrift)
□ Concurrency Control: Locks
□ Transaction: Local
□ Data Storage: HDFS
□ Features
□ Query predicate push down via server side scan and get filters
□ Optimizations for real time queries
□ A high performance Thrift gateway
□ HTTP supports XML, Protobuf, and binary
□ Rolling restart for configuration changes and minor upgrades
□ Random access performance is like MySQL
□ A cluster consists of several different types of nodes
□ Major Users: Facebook
□ Best Use: random read write to large database
□ Example Usage: Live messaging
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/10/HBase 107
109. HBase Pros / Cons CP
Pros Cons
Map/Reduce support Secondary indexes generally not
More of a CA approach and AP supported
Supports predicate push down for Security is non-existent
performance gains Requires a Hadoop infrastructure
Automatic partitioning and to function
rebalancing of regins
Data is stored in a sorted
order(not indexed)
RESTful API
Strong and vibrant ecosystem
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 108
110. HBase vs. BigTable Terminology CP
Cons
HBase BigTable
Table Table
Region Tablet
RegionServer Tablet Server
MemStore Memtable
Hfile SSTable
WAL Commit Log
Flush Minor compaction
Minor Compaction Merging compaction
Major Compaction Major compaction
HDFS GFS
MapReduce MapReduce
ZooKeeper Chubby
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 109
111. HBase: Architecture CP
• Zookeeper as Coordinator (instead of Chubby) LICENSE
• Hmaster: support for multiple masters APACHE 2.0
• HDFS, S3, S3N, EBS (with Gzip/LZO CF compression)
LANGUAGE
• Data sorted by key but evenly distributed across the cluster
Java
PROTOCOL
REST, HTTP
Thrift, Avro
PERSISTENCE
memtable,
SSTable
http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 110
114. HBase - Usecase: Facebook Message Service CP
□ New Message Service
□ combines chat, SMS, email, and Messages into a real-time conversation
□ Data pattern
A short set of temporal data that tends to be volatile
An ever-growing set of data that rarely gets accessed
□ chat service supports over 300 million users who send over 120 billion messages
per month
□ Cassandra's eventual consistency model to be a difficult pattern to reconcile
for our new Messages infrastructure.
□ HBase meets our requirements
□ Has a simpler consistency model than Cassandra.
□ Very good scalability and performance for their data patterns.
□ Most feature rich for their requirements: auto load balancing and failover,
compression support, multiple shards per server, etc.
□ HDFS, the filesystem used by HBase, supports replication, end-to-end
checksums, and automatic rebalancing.
□ Facebook's operational teams have a lot of experience using HDFS
because Facebook is a big user of Hadoop and Hadoop uses HDFS as its
distributed file system.
http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html 113
115. Hbase – Usecase: Adobe CP
□ When we started pushing 40 million records, Hbase squeaked and cracked.
After 20M inserts it failed so bad it wouldn’t respond or restart, it mangled
the data completely and we had to start over.
HBase community turned out to be great, they jumped and helped us,
and upgrading to a new HBase version fixed our problems
□ On December 2008, Our HBase cluster would write data but couldn’t answer
correctly to reads.
I was able to make another backup and restore it on a MySQL cluster
□ We decided to switch focus in the beginning of 2009. We were going to
provide a generic, real-time, structured data storage and processing system
that could handle any data volume.
http://hstack.org/why-were-using-hbase-part-1 114
117. MongoDB CP
□ 10gen, 2009, AGPL, C++
□ Model: Document(Data Model), Bigtable(Distributed Model)
□ Main Point: Full Index Support, Querying, Easy to Use
□ Protocol: Custom, binary(BSON)
□ Concurrency Control: Locks
□ Transaction: No
□ Data Storage: Disk
□ Features
□ Master/slave replication (auto failover with replica sets)
□ Sharding built-in
□ Uses memory mapped files for data storage
□ GridFS to store big data + metadata (not actually an FS)
□ Major Users: Craigslist, Foursquare, SAP, MTV, Disney, Shutterfly, Intuit
□ Example Usage: CMS system, comment storage, voting
□ Best Use: dynamic queries, frequently written, rarely read statistical data
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/10/HBase 116
118. MongoDB Pros / Cons CP
Pros Cons
Auto-sharding Does not support JSON: BSON
Auto-failover instead
Update in place Master-slave replication
Spatial index support Has had some growing pains(e.g.
Ad hoc query support Foursquare outage)
Any field in Mongo can be Not RESTful by default
indexed Failures require a manual
Very, very popular (lots of database repair operation(similar
production deployments) to MySQL)
Very easy transition from SQL Replication for availability, not
performance
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 117
120. MongoDB - Architecture CP
• mongod: the core database process LICENSE
• mongos: the controller and query router for sharded clusters AGPL v3
LANGUAGE
C++
PROTOCOL
REST/BSON
PERSISTENCE
B+ Trees,
Snapshots
CONCURRENCY
In-place Updates
REPLICATION
master-slave
replica sets
http://sett.ociweb.com/sett/settAug2011.html, http://www.infoq.com/articles/mongodb-java-php-python 119
121. CouchDB AP
□ ASF, 2005, Apache 2.0, Erlang
□ Model: Document(Data Model), Notes(Distributed Model)
□ Main Point: DB consistency, easy to use
□ Protocol: HTTP, REST
□ Concurrency Control: MVCC
□ Transaction: No
□ Data Storage: Disk
□ Features
□ ACID Semantics
□ Map/Reduce Views and Indexes
□ Distributed Architecture with Replication
□ Built for Offline
□ Major Users: LotsOfWords.com, CERN, BBC,
□ Example Usage: CRM, CMS systems
□ Best Use: accumulating, occasionally changing data with pre-defined
queries
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/3/CouchDB 120
122. CouchDB Pros / Cons AP
Pros Cons
Very simple API for development The simple API for development is
MVCC support for read somewhat limited
consistency No foreign keys
Full Map/Reduce support Conflict resolution devolves to the
Data is versioned application
Secondary indexes supported Versioning requires extensive disk
Some security support space
RESTful API, JSON support Versioning places large load on
I/O channels
Materialized views with
incremental update support Replication for performance, not
availability
http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 121
123. CouchDB AP
□ ASF, 2005, Apache 2.0, Erlang
LICENSE
APACHE 2.0
LANGUAGE
Erlang
PROTOCOL
REST/JSON
PERSISTENCE
Append Only,
B+ Tree
CONCURRENCY
MVCC
CONSISTENCY
Crash-only design
REPLICATION
Multi-master
http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 122