Presented at the Pydata london meetup on Jun 7.
Stephen Etheridge (@datalemming ), solution architect at Basho, the markers of RiakTS, the Enterprise grade time series scalable database, demoed via jupyter a live integration between RiakTS and Pyspark.
3. CONFIDENTIAL
Distributed Systems Software for Big Data, IoT and Hybrid Cloud applications
2011 Creators of Riak Distributed Systems
• Riak KV: Resilient NoSQL database
• Riak S2: Large Object Storage
2015 New Products
• Basho Data Platform: Integrated NoSQL
databases, caching, in-memory analytics, and
search
• Riak TS: Only Enterprise NoSQL database
optimized for Time Series data
100+ employees
Global Offices
• Seattle (HQ), Washington DC, London, Tokyo
Over 1/3 of the Fortune 50
BASHO SNAPSHOT
4. MEETING THE NEEDS OF THE ENTERPRISE
PRIORITIZED NEEDS
High Availability - Critical Data
High Scale - Heavy Reads & Writes
Geo Locality - Multiple Data Centers
Operational Simplicity – Resources
Don’t Scale as Clusters
Data Accuracy – Write Conflict Options
∂
TIME SERIES
USE CASES
IoT/Devices
Financial/Economic
Scientific Observations
∂
RIAK KV USE CASES
User Data
Session Data
Profile Data
Real-time Data
Log Data
5. 20 TERABYTES OF DATA PER
DAY BILLIONS OF MOBILE
DEVICES
10 BILLION data transactions a
day – 150,000 a second – Apple
Forecasting 2.8 BILLION locations
around the world
Generates 4GB OF DATA every
second
We’re focusing on helping
people make better decisions
with the weather.
6. CONFIDENTIAL
WHAT IS NEEDED FOR TIME SERIES?
Efficient way to store & retrieve
time series data
Query language that supports
range queries
High data volume
Enterprise scale solution
High availability
Basho Technologies | 7
7. What is Riak TS?
Basho Technologies | 8
Riak TS is Riak KV (a complete Riak KV build is included in Riak TS) with the
following additional features optimized to handle time series use cases:
• Tables- Riak TS introduces tables built on top of the underlying K/V
structure
• SQL – Riak TS supports a subset of standard SQL to create and query
time series data.
• Data Locality – Keys co-located by quanta to enable querying data across
time bounded series.
8. Riak TS Quanta
Basho Technologies | 9
The Quantam function in Riak TS takes three parameters:
• The name of a field in the table definition of type timestamp;
• A numeric quantity;
• One of the units of time from the list below:
• Days – ‘d’
• Hours – ‘h’
• Minutes – ‘m’
• Seconds – ‘s’
Important: A query covering more than a certain number of quanta (5 by default) will
generate too many sub-queries and the query system will refuse to run it. Assuming a
default quanta of 15 minutes, the maximum query time range is 75 minutes.
9. Supported Aggregate Functions
Riak TS supports aggregate functions including:
• COUNT() - Returns the number of entries that match a specified criteria.
• SUM() - Returns the sum of entries that match a specified criteria.
• MEAN() & AVG() - Returns the average of entries that match a specified criteria.
• MIN() - Returns the smallest value of entries that match a specified criteria.
• MAX() - Returns the largest value of entries that match a specified criteria.
• STDDEV() - Returns the statistical standard deviation of all entries that match a
specified criteria using Population Standard Deviation.
Basho Technologies | 10
10. Supported Data Types
Basho Technologies | 11
Riak TS tables support the following data types:
• Varchar - Any string content is valid, including Unicode. Can only be compared using
strict equality, and will not be typecast (e.g., to an integer) for comparison purposes.
Use single quotes to delimit varchar strings.
• Double - This type does not comply with its IEEE specification: NaN (not a number)
and INF (infinity) cannot be used.
• Sint64– Signed 64 bit integer
• Boolean - true or false (any case)
• Timestamps - Timestamps are integer values expressing UNIX epoch time in UTC in
milliseconds. Zero is not a valid timestamp.
11. Developing on Riak TS
Riak TS currently supports the Protocol Buffers API and five client
libraries including Java, Ruby, Python, Erlang, and Node.js.
Basho Technologies | 12
APIs Basho Clients Community Clients
• Protocol Buffers • Java
• Ruby
• Python
• Erlang
• Node.js
• .NET c#
• Not yet!
12. Supported Operations
Riak TS clients currently support following operations:
• Delete - Deletes a single row by it's key values.
• Fetch/Get - Fetches a single row by it's key values.
• Query - Allows you to query a Riak TS table with the given query string.
• Store/Put - Stores data in the Riak TS table.
• (Stream) ListKeys - Lists the primary keys of all the rows in a Riak TS
table.
Basho Technologies | 13
13. The Riak Python Client
• Compatible with Python 2.7 and above
• Can be installed easily with pip
• Pre-requisites
– python-dev
– libffi-dev
– libssl-dev
• Riak TS results object can be turned into a Pandas dataframe easily, otherwise it is a
list of lists!
• Demo with Aarhus data
14. Riak Spark Connector
• Enables you to connect Spark
applications to Riak TS with the
Spark RDD and Spark DataFrames
APIs
• Write applications in
– Scala (if you have to),
– Python (yay!),
– and Java (never!).
• Makes it easy to partition Riak data
so multiple Spark workers can
process the data in parallel,
• Has support for failover if a Riak
node goes down while your Spark
job is running.
• Comes as one JAR file that needs to
be pathed in!
– Riak TS 1.2+
– Apache Spark 1.6+
– Scala 2.10
– Java 8
15.
16. Riak TS Tables
Basho Technologies | 17
Riak TS tables are a new Riak KV Bucket Type (and there is a one to one
mapping of tables to bucket types). Tables are created using the riak-admin
command line or via one the supported clients:
CREATE TABLE GeoCheckin (
myfamily varchar not null,
myseries varchar not null,
time timestamp not null,
weather varchar not null,
temperature double,
PRIMARY KEY (
(myfamily, myseries, quantum(time, 15,
'm')), myfamily, myseries, time ) )
> riak-admin bucket-type create GeoCheckin
'{"props” : {"table_def” : ”…”} }’
17. Partition and Local Keys
Basho Technologies | 18
Riak TS has two types of keys that help determine how to distribute data across
a cluster and within local partitions of data:
• Partition keys – The partition key determines where data is placed within
a cluster (by vnode)
• Family – class or type of data (i.e. user, device type, etc.)
• Series – identifies the specific instances of the class/type, such as
username or device ID
• Quanta – the time interval to group data by
• Local keys – Local keys determine where and how data is written with the
vnode (currently identical to the partition key)
18. Querying Riak TS
Basho Technologies | 19
select * from WeatherStationData where
time > 1453224610000 and time < 1453225490000 and
device = 'Weather Station 0001' and
deviceId = 'abc-xxx-001-001'
select
MIN(temperature), AVG(temperature), MAX(temperature)
from WeatherStationData where
time > 1453224610000 and time < 1453225490000 and
device = 'Weather Station 0001' and
deviceId = 'abc-xxx-001-001'
select
(temperature * 2), (pressure - 1)
from WeatherStationData where
time > 1453224610000 and time < 1453225490000 and
device = 'Weather Station 0001' and
deviceId = 'abc-xxx-001-001'
Riak TS currently supports a subset of the SQL language that
includes basic aggregate and mathematic functions.