FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale

This is not a contribution
FiloDB:
Reactive, Real-time, In-Memory  
Time Series at Scale
Evan Chan ( @evanfchan)
Apple
October 2018

Solving The Time
Series Problem

Operational Metrics

Requirements
• Massive scale, billions of metrics
• Resiliency and maximum uptime
• Real time (seconds, not minutes)
• Low latency querying
• High concurrency (thousands of dashboards, alerts)
• Easy debugging - ﬂexible ad-hoc queries

What Users Wanted
• Flexible data model and queries, tag-based querying
• User-defined “tags” on metrics and data
• Prevents abuse of hierarchical system
• Can query across regions, other boundaries
• Flexible rollups
• Longer views of fine grained data
• or flexible retention policies

Design for the
• Internal cloud @Apple similar to public cloud
• Containers and “stateless” apps
• Use of Docker, etc. promotes more containers = more metrics
• Stateless = more frequent restarts, more UUIDs => more
metrics
• Leverage hosted cloud services
• Hosted Cassandra, Kafka, other data services
• Let someone else manage persistent storage

Where are we going?
Dashboards
Real-time
Debugging
Events
Metrics
Tracing
???
Real-time ML/
AI
Actionable
Insights

(Re)Introducing FiloDB
A Prometheus-compatible, Distributed, In-Memory
Time Series Database
OPEN SOURCE!
http://www.github.com/ﬁlodb/FiloDB
Built on the proven reactive SMACK stack.

Core Principles
Designed for Cloud
Infrastructure
Built for Scale and
Resiliency
Flexible
Data
Model
Multi-Tenant

Proudly built on the
Reactive Stack

In-Memory  
Time Series

Facebook Gorilla
• Keep most recent time series data IN MEMORY,
stored using efﬁcient time series encoding
techniques
• Serve queries using separate process
• Allows dense, massively scalable TS storage +
very fast, rich queries of recent data
• https://github.com/facebookarchive/beringei

Operational Metrics Flow
Kafka
FiloDB Node
Cassandra
Dashboards
Real-time
Debugging
HTTP
Gateway
Gateway
Gateway
App
App
App
Collector
App

Shard 1
Shard 0
Data Flow on a NodeRecords
Records
Records
Monix / RX ingestion / back pressure
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Encoding
Encoding

Columnar Compression
timestamp value
Row-based
timestamp value
timestamp value
timestamp value
timestamp value
t1 t2 t3 t4 t5 t6 t7 t8
Column-based
v1 v2 v3 v4 v5 v6 v7 v8
• Compressing all timestamps together is much
more efﬁcient

Delta-Delta Encoding
• Encode increasing numbers (timestamps,
counters) as deltas from slope

Results
• Millions of time series and billions of samples per
node
• Up to 1 million samples/sec per node ingestion rate
peak (measured during recovery)
• Up to 8x better than previous system (Storm/HBase)
• Storage density of ~3 bytes per metric sample
• About 10x better than previous system (HBase)

Tackling Heap Issues
• 60+ second GC pauses / OOM
• Filled up old gen, GC stuck ﬁnding tiny bit of free space
• Solution: move as many permanent objects offheap as
possible
• Too high rate of allocation on ingest
• Temporary objects only, but producing too many
• Solution: Switch from Protobuf to custom, no-allocation
BinaryRecord

Off-heap Data Structures
• BinaryVector - one compressed column of
data (say timestamps, or values)
• BinaryRecord - one ingestion data record,
variable schema
• OffheapLFSortedIDMap - offheap lightweight
sorted map

Block
Moving Object Graphs
OffHeap
Write buffer
Chunks Chunks
OffheapOnheap
TSPartition
WriteBufferObject
ConcurrentSkipListMap
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
VectorObject
VectorObject

Block
Moving Object Graphs
OffHeap
Write buffer
Chunks Chunks
ChunkMap
OffheapOnheap
TSPartitionTSPartition
WriteBufferObject
ChunkSetInfoPartID
ChunkSetInfo
ConcurrentSkipListMap
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
ID ChunkSetInfo
VectorObject
VectorObject
Ptr Ptr Ptr Ptr

Flexible Distributed
Queries

Prometheus Compatible
• Don’t reinvent a popular time series query language
• Prom HTTP API gives out of box Grafana support
sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host)
• Filtering/indexing on many time series
• Time windowing-based aggregation with multiple windows
• Group by

Queries to Logical Plan
sum(http_requests{partition=“P2”,dc="DC0",job="A0"}) by (host)
Aggregate(Sum, 
PeriodicSeries( 
RawSeries( 
IntervalSelector(t1, t2, step), 
List(ColumnFilter(partition,Equals(P2)), 
ColumnFilter(dc,Equals(DC0)), 
ColumnFilter(job,Equals(A0)), 
ColumnFilter(__name__,Equals(http_requests))
AST

Shard 0
Physical Plan Execution
• Location transparency of Akka actors is crucial here
ReduceAggregateExec
Sum
SelectRawPartitionsExec
AggregateMapReduce
PeriodicSamplesMapper
AggregateMapReduce
Chunks
Chunks
Chunks
Chunks
Shard 1

Shard 0
Physical Plan Execution
• Look ma, plan change, no code changes!
ReduceAggregateExec
Sum
AggregateMapReduce
AggregateMapReduce
Chunks
Chunks
Chunks
Chunks
Shard 1
Query Service

Node 2Node 1
Actor Hierarchy
NodeCoordinator
Actor
IngestionActor QueryActor
MemStore
HTTP
NodeCoordinator
Actor
IngestionActor QueryActor
MemStore
HTTP
CLI / Akka Remote

Comparisons
• Queries possible on FiloDB and not on old system:
• Tag-based querying (ﬁlter, group by etc based on
ﬂexible tags)
• Histograms and quantiles
• Group by and topK queries
• Flexible time series joins
• 100’s millions samples queried/sec

Datasets and Data
Model

What Kind of Data Works?
• High cardinality of individual time series (operational metrics,
devices, business metrics)
• Many data points in each series, append only
Series1 {k1=v1, k2=v2}
Time

Flexible Tags
• Each time series is deﬁned by a metric name and
a unique combination of keys-values
• Index on tags allows ﬁlter/search by combo of tags
memstore_partitions_queried {
dataset=timeseries,
host=MacBook-Pro-229.local,
shard=0
}
memstore_partitions_queried {
dataset=timeseries,
host=MacBook-Pro-229.local,
shard=1
}

Flexible Schemas and
Datasets
• Datasets allow for namespacing different schemas, ingestion
sources, SLAs, # shards, and offheap memory isolation
• Main dataset with 2-day retention
• Pre-aggregates dataset with 1 week retention
• Histograms dataset with schema for efﬁcient histogram storage
• OpenTracing dataset — start, end, span duration, etc.
• Historical data using different schema

The Hard Stuff: Recovery
and Persistence

What is persisted?
• Raw time series data - using a custom format
designed for efﬁcient ingestion & recovery - is
stored using and ingested from Apache Kafka
• Compressed, columnar time series data is written
periodically to a ColumnStore, typically
Cassandra
• Time series metadata for reconstructing each
node’s index is persisted as well

Akka Cluster
Ingestion and Sharding
Kafka
Shard0 Shard1 Shard2 Shard3 Shard4
FiloDB
Node
FiloDB
Node
FiloDB
Node
FiloDB
Node
Gateway
S0 S1 S2 S3 S4

Recovery
Kafka
Shard0 Shard1 Shard2 Shard3 Shard4
FiloDB
Node
FAILU
RE
FiloDB
Node
FiloDB
Node
S0 S1 S2 S3 S4
New
Filo
Node
S2
CassandraChunkSink
On-demand paging
FiloDB
Node
S2
Queries
to other DC

Recovery
• The most recent raw data - before encoding - is
recovered by replaying Apache Kafka partitions
• Index metadata is recovered
• Compressed data is loaded on-demand from
Cassandra. This works because most data
written is never queried.

FiloDB vs Alternatives

vs Prometheus
• FiloDB supports PromQL, HTTP query API
• Prometheus is single-node only
• FiloDB is multi-schema and multi-tenant
• FiloDB designed to run as a resilient, distributed,
high-uptime cloud service
• FiloDB open source not as rich feature-wise (yet)

vs InfluxDB
• FiloDB data model is very close to Influx: multi-
schema, multiple columns, namespaces, tags on
series
FiloDB InfluxDB
Clustering Peer to peer distributed
Single node (OSS), 
clustered ($$)
Query language PromQL SQL (PromQL coming)
Maturity New Established

vs Cassandra
• C*: Very well established and widely used, robust
• Like FiloDB: real time, distributed, low-latency
• C*: Very simple queries ideally to one partition
• FiloDB: complex PromQL queries, topK,
groupBy, time series joins and windowing
• FiloDB: much higher storage density and
ingestion throughput for time series

vs Druid
• Druid and FiloDB have different data models
• Druid is an OLAP database with an explicit time
dimension. Dimensions are ﬁxed.
• FiloDB supports millions/billions of time series
with ﬂexible tags
• FiloDB stores raw data, Druid stores roll ups

Tradeoffs and Lessons

Tradeoffs of using the JVM
• Pluses: solid, proven libraries for building
distributed and data systems
• Apache Lucene
• Akka Cluster
• Minuses: Lack of low-level memory layout and
control
• The devil you know best

JVM Production Tips
• Get to know different GCs, Eden, OldGen, G1GC, etc.
really really well
• SJK (https://github.com/aragozin/jvm-tools)
• Runtime visibility
• Multiple APIs to access cluster state
• JMXbeans
• Measure measure measure!! (Use JMH)

Current Status
• Development at github.com/ﬁlodb/FiloDB
• Time/value schema ingestion and querying is
stable
• Looking for partners to work together, add
integrations, etc.

Try it out today
• Ingest data using https://github.com/inﬂuxdata/
telegraf
• Expose a Prometheus HTTP read endpoint in
your apps
• Use Grafana to visualize metrics

Roadmap
• Speed and efﬁciency improvements in core FiloDB
database
• Histogram optimizations
• Improved cluster state management
• Support for Spark/ML/AI jobs and metrics. How can we
improve observability for data engineers?
• Support for non-metrics schemas
• Long term storage

Thank you!
• Note: We are hiring! If you love reactive systems,
distributed systems, love to push the performance
envelope…. there’s a place for you.

Extra Slides

On Heap vs Off HeapRecords
Records
Records
Index
Write buffers
Write buffers
Write buffers
Chunks
Chunks
Chunks
Chunks
Chunks
Chunks
ChunkMap
ChunkMap
ChunkMap
OffheapOnheap
TSPartition
TSPartition
TSPartition
PartitionMap
Lucene MMap
Index Files

FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale

Similaire à FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale (20)

Plus de Evan Chan

Plus de Evan Chan (15)

Dernier

Dernier (20)

FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale