SlideShare une entreprise Scribd logo
1  sur  35
A very BIG data
Company
The challenge of serving
massive batch-computed
data sets on-line
The challenge of serving massive
batch-computed data sets online
David Gruzman
Serving batch-computed data
by David Gruzman
►
Today we will discuss the case
when we have multi-terabyte
dataset which is periodically
recalculated and have to be
served in the real time.
►
SimilarWeb allowed us to
reveal internals of their
Similar Web data flow – the context
►
Company assemble billions of events from their panel on the daily
basis.
►
Fast growing Hadoop cluster is used to process this data using
various kinds of statistical analysis and machine learning.
►
The data model is “web scale”. The data derived from the raw events
is processed into “top pages”,”demography”, “keywords” and many
other metrics company assemble.
►
Problem dimensionality is: Per domain, per day, per country. More
dimensions might appear.
How data is calculated
►
Data is imported into HDFS from the farm of application
servers.
►
Set of MR Jobs as well as Hive scripts is used to do
data processing.
►
Result data has a common structure of the “key-value”
where key – our dimensions or their subset. For
example
Key: “cnn.com_01012013_USA”
Value: “Top Pages: Page1, …. statistics:.... “
Abstract schema of the relevant part of
SimilarWeb IT
App
Server
s
Hadoop – Map Reduce
Hadoop – Hbase Stage
Hbase
Production
Hbase
Production
Hbase under heavy inserts
►
First of all – it do works
►
The question – what was done...
Hbase : Split storms
►
When you insert data evenly into many regions all of
them starts splitting roughly in the same time. Hbase
does not like it... It became not available, insertion job
failes, leases expired etc...
►
Solution : pre split table and disable automatic split.
►
Price : it is hard to achieve even distribution of the
data among regions. Hotspots possible...
Compaction storms
►
Under heavy load to all regions – all of them
starting minor compaction in the same time
►
Results are similar to the split storm... Nothing
good.
Inherent problem – delayed work
►
Hbase does not do ALL work required during
insert.
►
Part of the work Delayed till the compaction.
►
System who delay work is inherently
problematic for the prolonged high load.
►
It is good to work with spikes of activity, not with
steady heavy load.
Massive insert problem
►
There is a lot of overhead in randomly insert data.
►
What happens that MapReduce produce already sorted
data and Hbase is sorting it again.
►
Hbase is sorting data constantly, while MR do it in the
batch what is inherently more efficient
►
Hbase is strongly consistent system and under heavy
load all kinds of problems (leasing related) happens
Domino effect
HBase Snapshots – come to rescue
►
Snapshot is capability to get “point in time” state of the
table.
►
Technically snapshot is list of files which constitute the table.
So taking snapshot is pure meta-data operation.
►
When files are to be deleted for the table they are moved to
the archive directory.
►
Thus all operation like clone, restore – are just file renames
and metadata changes.
Hbase – snapshot export
Region
Before 1 Before 2
File after
Snapshot
Before1
In Archive
Before2
in
archive
Move / rename
Hbase – snapshot export
►
There is additional capability of snapshots –
export.
►
Technically it is like DISTCP and even not
required alive cluster on the destination side.
Only HDFS has to be operational.
►
What we gain – DISTCP speed and scalability.
►
What happens – files are copied into archive
directory. Hbase is using it's structure as a
So how snapshots help us?
►
As you remember SimilarWeb has several
Hbase clusters. One used as a company data-
warehouse and two used to serve production
►
So we prepare data on one cluster where we
have long time-outs and then move it using
snapshots to the production cluster.
So we get to the following solution
App
Serv
ers
Hadoop – Map Reduce
Hadoop – Hbase Stage
Hbase
Production
Hbase
Production
Snapshot
export
Is it ideal?
►
We effectively minimized impact on Hbase
region servers
►
But we left with Hbase high availability problem
►
Currently we have two Hbase servers to
overcome it
►
It is working but it is far from ideal HW utilization
Conceptual problem
►
In production we do not need strong consistency
and we pay for it with Partition tolerance in CAP
theorem. In practice – it is availability problem.
►
We do not need random writes and most of
Hbase is built for them
►
We actually have more complex system then we
need
BigTable vs Dynamo
►
There are two kinds of NoSQLs – built after
BigTable (Hbase, Hypertable) and after
Dynamo (Cassandra, Voldemort …)
►
BigTable – good for data warehouse. Capability
to scan data ranges is important
►
Dynamo – good for online serving since the
systems are more high-available
Evaluation process
►
We decided to do research what system better
suites need.
►
Need was formulated as “to be able to prepare
data files offline and copy them into system by
file level.”
►
In addition – high availability is a must so
systems built around consistent hashing idea
were preferred.
ElephantDB
►
https://github.com/nathanmarz/elephantdb
►
This is system created exactly for this case
►
It is capable of serving data from index prepared offline
►
It is very simple – contains about 5K lines of code
►
Main drawback – unknown... Very little known usages..
ElephantDB
►
Berkly DB java edition is used to serve local
indexes. It is common with Voldmort which also
has such option.
►
MR Job (Cascading) is used to prepare indexes.
►
Indexes cached locally by the servers in the
ring.
►
There is MR job for incremental change of data.
ElephantDB – batch read
►
Having data sitting in the DFS in a MR friendly
format enable us to do scans right there.
►
Opposite example – we usually scan Hbase
table to process it using MR. When there is no
filtering / predicate push-down it is serious
waste of resources
Elephant DB - drawbacks
►
First one – is rare use. We already mentioned it
►
It is read only. In case we also need random
writes – we will need to deploy another NoSQL.
Voldemort...
Project - Voldemort
►
NoSQL
►
Pluggable Storage engines
►
Pluggable serialization (TBD)
►
Consistent hashing
►
Eventual consistency
►
Support for batch-computed read-only stores
Voldemort logical architecture
How building data works
►
The job gets as parameter all cluster
configuration
►
Thereof it can build data specific for each node
Pull vs Push
►
It was interesting decision of the Linkedin
engineers to implement pull.
►
The explanation is that Voldemort as a system
should be able to throttle data load in order to
prevent system performance degradation.
Performance
We tested on 3 node dedicated clusters with SSD.
►
Throughput – 5-6K reads per second barely
change CPU level. Documentation tells about
20K requests per node.
►
Latency 10-15 milliseconds on not-cached data.
We are researching this number. It sounds too
much for SSD.
►
1 – 1.5 milliseconds for cached data.
Caching remarks
►
Voldemort (as well as MongoDB) is not develop
own caching mechanism but offload it to OS.
►
It is done by doing MMAP of the data files.
►
In my opinion – it is inferior approach since OS
do not have application specific statistics, add
not-needed context switches.
Voldemort summary
For:
►
Easy to install. It took 2 hours to build the cluster
even without installer..
►
Pluggable storage engines.
►
Support for efficient import of batch-computed data
►
Open Source
Against:
Method limitation
There is limit in pre-computing way when number
of dimension grow.
What we are doing – we have proprietary layer
build on LINQ and C# which makes missing
aggregation
We also evaluate Jethrodata which can do it in
SQL way.
It is RDBMS engine running on top of HDFS and
gives full index with join and group by capability
ElephantDB information used
►
http://www.slideshare.net/nathanmarz/
elephantdb
►
http://computerhelpkansascity.blogspot.co.il/2012/06
html

Contenu connexe

Plus de David Groozman

ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationDavid Groozman
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explainedDavid Groozman
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introductionDavid Groozman
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala InternalsDavid Groozman
 

Plus de David Groozman (6)

ImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integrationImpalaToGo and Tachyon integration
ImpalaToGo and Tachyon integration
 
ImpalaToGo design explained
ImpalaToGo design explainedImpalaToGo design explained
ImpalaToGo design explained
 
ImpalaToGo use case
ImpalaToGo use caseImpalaToGo use case
ImpalaToGo use case
 
Spark vstez
Spark vstezSpark vstez
Spark vstez
 
ImpalaToGo introduction
ImpalaToGo introductionImpalaToGo introduction
ImpalaToGo introduction
 
Cloudera Impala Internals
Cloudera Impala InternalsCloudera Impala Internals
Cloudera Impala Internals
 

Dernier

办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一
办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一
办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一diploma 1
 
Iconic Global Solution - web design, Digital Marketing services
Iconic Global Solution - web design, Digital Marketing servicesIconic Global Solution - web design, Digital Marketing services
Iconic Global Solution - web design, Digital Marketing servicesIconic global solution
 
Call Girls Meghani Nagar 7397865700 Independent Call Girls
Call Girls Meghani Nagar 7397865700  Independent Call GirlsCall Girls Meghani Nagar 7397865700  Independent Call Girls
Call Girls Meghani Nagar 7397865700 Independent Call Girlsssuser7cb4ff
 
Passbook project document_april_21__.pdf
Passbook project document_april_21__.pdfPassbook project document_april_21__.pdf
Passbook project document_april_21__.pdfvaibhavkanaujia
 
Dubai Calls Girl Tapes O525547819 Real Tapes Escort Services Dubai
Dubai Calls Girl Tapes O525547819 Real Tapes Escort Services DubaiDubai Calls Girl Tapes O525547819 Real Tapes Escort Services Dubai
Dubai Calls Girl Tapes O525547819 Real Tapes Escort Services Dubaikojalkojal131
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Rndexperts
 
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证nhjeo1gg
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in designnooreen17
 
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一z xss
 
FiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdfFiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdfShivakumar Viswanathan
 
Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...
Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...
Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...Yantram Animation Studio Corporation
 
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一Fi L
 
ARt app | UX Case Study
ARt app | UX Case StudyARt app | UX Case Study
ARt app | UX Case StudySophia Viganò
 
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...katerynaivanenko1
 
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一F La
 
办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一
办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一
办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一Fi L
 
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...Rishabh Aryan
 
shot list for my tv series two steps back
shot list for my tv series two steps backshot list for my tv series two steps back
shot list for my tv series two steps back17lcow074
 
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改yuu sss
 

Dernier (20)

办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一
办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一
办理(麻省罗威尔毕业证书)美国麻省大学罗威尔校区毕业证成绩单原版一比一
 
Iconic Global Solution - web design, Digital Marketing services
Iconic Global Solution - web design, Digital Marketing servicesIconic Global Solution - web design, Digital Marketing services
Iconic Global Solution - web design, Digital Marketing services
 
Call Girls Meghani Nagar 7397865700 Independent Call Girls
Call Girls Meghani Nagar 7397865700  Independent Call GirlsCall Girls Meghani Nagar 7397865700  Independent Call Girls
Call Girls Meghani Nagar 7397865700 Independent Call Girls
 
Passbook project document_april_21__.pdf
Passbook project document_april_21__.pdfPassbook project document_april_21__.pdf
Passbook project document_april_21__.pdf
 
Dubai Calls Girl Tapes O525547819 Real Tapes Escort Services Dubai
Dubai Calls Girl Tapes O525547819 Real Tapes Escort Services DubaiDubai Calls Girl Tapes O525547819 Real Tapes Escort Services Dubai
Dubai Calls Girl Tapes O525547819 Real Tapes Escort Services Dubai
 
Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025Top 10 Modern Web Design Trends for 2025
Top 10 Modern Web Design Trends for 2025
 
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
原版美国亚利桑那州立大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
在线办理ohio毕业证俄亥俄大学毕业证成绩单留信学历认证
 
Design principles on typography in design
Design principles on typography in designDesign principles on typography in design
Design principles on typography in design
 
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
办理(UC毕业证书)查尔斯顿大学毕业证成绩单原版一比一
 
FiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdfFiveHypotheses_UIDMasterclass_18April2024.pdf
FiveHypotheses_UIDMasterclass_18April2024.pdf
 
Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...
Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...
Unveiling the Future: Columbus, Ohio Condominiums Through the Lens of 3D Arch...
 
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
办理学位证(TheAuckland证书)新西兰奥克兰大学毕业证成绩单原版一比一
 
ARt app | UX Case Study
ARt app | UX Case StudyARt app | UX Case Study
ARt app | UX Case Study
 
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
MT. Marseille an Archipelago. Strategies for Integrating Residential Communit...
 
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
办理(宾州州立毕业证书)美国宾夕法尼亚州立大学毕业证成绩单原版一比一
 
办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一
办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一
办理学位证加州州立大学洛杉矶分校毕业证成绩单原版一比一
 
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
DAKSHIN BIHAR GRAMIN BANK: REDEFINING THE DIGITAL BANKING EXPERIENCE WITH A U...
 
shot list for my tv series two steps back
shot list for my tv series two steps backshot list for my tv series two steps back
shot list for my tv series two steps back
 
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
1比1办理美国北卡罗莱纳州立大学毕业证成绩单pdf电子版制作修改
 

The challenge of serving large amount of batch-computed data

  • 1. A very BIG data Company The challenge of serving massive batch-computed data sets on-line
  • 2. The challenge of serving massive batch-computed data sets online David Gruzman
  • 3. Serving batch-computed data by David Gruzman ► Today we will discuss the case when we have multi-terabyte dataset which is periodically recalculated and have to be served in the real time. ► SimilarWeb allowed us to reveal internals of their
  • 4. Similar Web data flow – the context ► Company assemble billions of events from their panel on the daily basis. ► Fast growing Hadoop cluster is used to process this data using various kinds of statistical analysis and machine learning. ► The data model is “web scale”. The data derived from the raw events is processed into “top pages”,”demography”, “keywords” and many other metrics company assemble. ► Problem dimensionality is: Per domain, per day, per country. More dimensions might appear.
  • 5. How data is calculated ► Data is imported into HDFS from the farm of application servers. ► Set of MR Jobs as well as Hive scripts is used to do data processing. ► Result data has a common structure of the “key-value” where key – our dimensions or their subset. For example Key: “cnn.com_01012013_USA” Value: “Top Pages: Page1, …. statistics:.... “
  • 6. Abstract schema of the relevant part of SimilarWeb IT App Server s Hadoop – Map Reduce Hadoop – Hbase Stage Hbase Production Hbase Production
  • 7. Hbase under heavy inserts ► First of all – it do works ► The question – what was done...
  • 8. Hbase : Split storms ► When you insert data evenly into many regions all of them starts splitting roughly in the same time. Hbase does not like it... It became not available, insertion job failes, leases expired etc... ► Solution : pre split table and disable automatic split. ► Price : it is hard to achieve even distribution of the data among regions. Hotspots possible...
  • 9. Compaction storms ► Under heavy load to all regions – all of them starting minor compaction in the same time ► Results are similar to the split storm... Nothing good.
  • 10. Inherent problem – delayed work ► Hbase does not do ALL work required during insert. ► Part of the work Delayed till the compaction. ► System who delay work is inherently problematic for the prolonged high load. ► It is good to work with spikes of activity, not with steady heavy load.
  • 11. Massive insert problem ► There is a lot of overhead in randomly insert data. ► What happens that MapReduce produce already sorted data and Hbase is sorting it again. ► Hbase is sorting data constantly, while MR do it in the batch what is inherently more efficient ► Hbase is strongly consistent system and under heavy load all kinds of problems (leasing related) happens
  • 13. HBase Snapshots – come to rescue ► Snapshot is capability to get “point in time” state of the table. ► Technically snapshot is list of files which constitute the table. So taking snapshot is pure meta-data operation. ► When files are to be deleted for the table they are moved to the archive directory. ► Thus all operation like clone, restore – are just file renames and metadata changes.
  • 14. Hbase – snapshot export Region Before 1 Before 2 File after Snapshot Before1 In Archive Before2 in archive Move / rename
  • 15. Hbase – snapshot export ► There is additional capability of snapshots – export. ► Technically it is like DISTCP and even not required alive cluster on the destination side. Only HDFS has to be operational. ► What we gain – DISTCP speed and scalability. ► What happens – files are copied into archive directory. Hbase is using it's structure as a
  • 16. So how snapshots help us? ► As you remember SimilarWeb has several Hbase clusters. One used as a company data- warehouse and two used to serve production ► So we prepare data on one cluster where we have long time-outs and then move it using snapshots to the production cluster.
  • 17. So we get to the following solution App Serv ers Hadoop – Map Reduce Hadoop – Hbase Stage Hbase Production Hbase Production Snapshot export
  • 18. Is it ideal? ► We effectively minimized impact on Hbase region servers ► But we left with Hbase high availability problem ► Currently we have two Hbase servers to overcome it ► It is working but it is far from ideal HW utilization
  • 19. Conceptual problem ► In production we do not need strong consistency and we pay for it with Partition tolerance in CAP theorem. In practice – it is availability problem. ► We do not need random writes and most of Hbase is built for them ► We actually have more complex system then we need
  • 20. BigTable vs Dynamo ► There are two kinds of NoSQLs – built after BigTable (Hbase, Hypertable) and after Dynamo (Cassandra, Voldemort …) ► BigTable – good for data warehouse. Capability to scan data ranges is important ► Dynamo – good for online serving since the systems are more high-available
  • 21. Evaluation process ► We decided to do research what system better suites need. ► Need was formulated as “to be able to prepare data files offline and copy them into system by file level.” ► In addition – high availability is a must so systems built around consistent hashing idea were preferred.
  • 22. ElephantDB ► https://github.com/nathanmarz/elephantdb ► This is system created exactly for this case ► It is capable of serving data from index prepared offline ► It is very simple – contains about 5K lines of code ► Main drawback – unknown... Very little known usages..
  • 23. ElephantDB ► Berkly DB java edition is used to serve local indexes. It is common with Voldmort which also has such option. ► MR Job (Cascading) is used to prepare indexes. ► Indexes cached locally by the servers in the ring. ► There is MR job for incremental change of data.
  • 24. ElephantDB – batch read ► Having data sitting in the DFS in a MR friendly format enable us to do scans right there. ► Opposite example – we usually scan Hbase table to process it using MR. When there is no filtering / predicate push-down it is serious waste of resources
  • 25. Elephant DB - drawbacks ► First one – is rare use. We already mentioned it ► It is read only. In case we also need random writes – we will need to deploy another NoSQL.
  • 27. Project - Voldemort ► NoSQL ► Pluggable Storage engines ► Pluggable serialization (TBD) ► Consistent hashing ► Eventual consistency ► Support for batch-computed read-only stores
  • 29. How building data works ► The job gets as parameter all cluster configuration ► Thereof it can build data specific for each node
  • 30. Pull vs Push ► It was interesting decision of the Linkedin engineers to implement pull. ► The explanation is that Voldemort as a system should be able to throttle data load in order to prevent system performance degradation.
  • 31. Performance We tested on 3 node dedicated clusters with SSD. ► Throughput – 5-6K reads per second barely change CPU level. Documentation tells about 20K requests per node. ► Latency 10-15 milliseconds on not-cached data. We are researching this number. It sounds too much for SSD. ► 1 – 1.5 milliseconds for cached data.
  • 32. Caching remarks ► Voldemort (as well as MongoDB) is not develop own caching mechanism but offload it to OS. ► It is done by doing MMAP of the data files. ► In my opinion – it is inferior approach since OS do not have application specific statistics, add not-needed context switches.
  • 33. Voldemort summary For: ► Easy to install. It took 2 hours to build the cluster even without installer.. ► Pluggable storage engines. ► Support for efficient import of batch-computed data ► Open Source Against:
  • 34. Method limitation There is limit in pre-computing way when number of dimension grow. What we are doing – we have proprietary layer build on LINQ and C# which makes missing aggregation We also evaluate Jethrodata which can do it in SQL way. It is RDBMS engine running on top of HDFS and gives full index with join and group by capability