SlideShare une entreprise Scribd logo
1  sur  18
Python, PySpark and
Riak TS
Stephen Etheridge
Lead Solution Architect, EMEA
CONFIDENTIAL
Agenda
• Introduction to Riak TS
• The Riak Python client
• The Riak Spark connector and PySpark
Basho Technologies | 3
CONFIDENTIAL
Distributed Systems Software for Big Data, IoT and Hybrid Cloud applications
2011 Creators of Riak Distributed Systems
• Riak KV: Resilient NoSQL database
• Riak S2: Large Object Storage
2015 New Products
• Basho Data Platform: Integrated NoSQL
databases, caching, in-memory analytics, and
search
• Riak TS: Only Enterprise NoSQL database
optimized for Time Series data
100+ employees
Global Offices
• Seattle (HQ), Washington DC, London, Tokyo
Over 1/3 of the Fortune 50
BASHO SNAPSHOT
MEETING THE NEEDS OF THE ENTERPRISE
PRIORITIZED NEEDS
High Availability - Critical Data
High Scale - Heavy Reads & Writes
Geo Locality - Multiple Data Centers
Operational Simplicity – Resources
Don’t Scale as Clusters
Data Accuracy – Write Conflict Options
∂
TIME SERIES
USE CASES
IoT/Devices
Financial/Economic
Scientific Observations
∂
RIAK KV USE CASES
User Data
Session Data
Profile Data
Real-time Data
Log Data
20 TERABYTES OF DATA PER
DAY BILLIONS OF MOBILE
DEVICES
 10 BILLION data transactions a
day – 150,000 a second – Apple
 Forecasting 2.8 BILLION locations
around the world
 Generates 4GB OF DATA every
second
We’re focusing on helping
people make better decisions
with the weather.
CONFIDENTIAL
WHAT IS NEEDED FOR TIME SERIES?
 Efficient way to store & retrieve
time series data
 Query language that supports
range queries
 High data volume
 Enterprise scale solution
 High availability
Basho Technologies | 7
What is Riak TS?
Basho Technologies | 8
Riak TS is Riak KV (a complete Riak KV build is included in Riak TS) with the
following additional features optimized to handle time series use cases:
• Tables- Riak TS introduces tables built on top of the underlying K/V
structure
• SQL – Riak TS supports a subset of standard SQL to create and query
time series data.
• Data Locality – Keys co-located by quanta to enable querying data across
time bounded series.
Riak TS Quanta
Basho Technologies | 9
The Quantam function in Riak TS takes three parameters:
• The name of a field in the table definition of type timestamp;
• A numeric quantity;
• One of the units of time from the list below:
• Days – ‘d’
• Hours – ‘h’
• Minutes – ‘m’
• Seconds – ‘s’
Important: A query covering more than a certain number of quanta (5 by default) will
generate too many sub-queries and the query system will refuse to run it. Assuming a
default quanta of 15 minutes, the maximum query time range is 75 minutes.
Supported Aggregate Functions
Riak TS supports aggregate functions including:
• COUNT() - Returns the number of entries that match a specified criteria.
• SUM() - Returns the sum of entries that match a specified criteria.
• MEAN() & AVG() - Returns the average of entries that match a specified criteria.
• MIN() - Returns the smallest value of entries that match a specified criteria.
• MAX() - Returns the largest value of entries that match a specified criteria.
• STDDEV() - Returns the statistical standard deviation of all entries that match a
specified criteria using Population Standard Deviation.
Basho Technologies | 10
Supported Data Types
Basho Technologies | 11
Riak TS tables support the following data types:
• Varchar - Any string content is valid, including Unicode. Can only be compared using
strict equality, and will not be typecast (e.g., to an integer) for comparison purposes.
Use single quotes to delimit varchar strings.
• Double - This type does not comply with its IEEE specification: NaN (not a number)
and INF (infinity) cannot be used.
• Sint64– Signed 64 bit integer
• Boolean - true or false (any case)
• Timestamps - Timestamps are integer values expressing UNIX epoch time in UTC in
milliseconds. Zero is not a valid timestamp.
Developing on Riak TS
Riak TS currently supports the Protocol Buffers API and five client
libraries including Java, Ruby, Python, Erlang, and Node.js.
Basho Technologies | 12
APIs Basho Clients Community Clients
• Protocol Buffers • Java
• Ruby
• Python
• Erlang
• Node.js
• .NET c#
• Not yet!
Supported Operations
Riak TS clients currently support following operations:
• Delete - Deletes a single row by it's key values.
• Fetch/Get - Fetches a single row by it's key values.
• Query - Allows you to query a Riak TS table with the given query string.
• Store/Put - Stores data in the Riak TS table.
• (Stream) ListKeys - Lists the primary keys of all the rows in a Riak TS
table.
Basho Technologies | 13
The Riak Python Client
• Compatible with Python 2.7 and above
• Can be installed easily with pip
• Pre-requisites
– python-dev
– libffi-dev
– libssl-dev
• Riak TS results object can be turned into a Pandas dataframe easily, otherwise it is a
list of lists!
• Demo with Aarhus data
Riak Spark Connector
• Enables you to connect Spark
applications to Riak TS with the
Spark RDD and Spark DataFrames
APIs
• Write applications in
– Scala (if you have to),
– Python (yay!),
– and Java (never!).
• Makes it easy to partition Riak data
so multiple Spark workers can
process the data in parallel,
• Has support for failover if a Riak
node goes down while your Spark
job is running.
• Comes as one JAR file that needs to
be pathed in!
– Riak TS 1.2+
– Apache Spark 1.6+
– Scala 2.10
– Java 8
Riak TS Tables
Basho Technologies | 17
Riak TS tables are a new Riak KV Bucket Type (and there is a one to one
mapping of tables to bucket types). Tables are created using the riak-admin
command line or via one the supported clients:
CREATE TABLE GeoCheckin (
myfamily varchar not null,
myseries varchar not null,
time timestamp not null,
weather varchar not null,
temperature double,
PRIMARY KEY (
(myfamily, myseries, quantum(time, 15,
'm')), myfamily, myseries, time ) )
> riak-admin bucket-type create GeoCheckin
'{"props” : {"table_def” : ”…”} }’
Partition and Local Keys
Basho Technologies | 18
Riak TS has two types of keys that help determine how to distribute data across
a cluster and within local partitions of data:
• Partition keys – The partition key determines where data is placed within
a cluster (by vnode)
• Family – class or type of data (i.e. user, device type, etc.)
• Series – identifies the specific instances of the class/type, such as
username or device ID
• Quanta – the time interval to group data by
• Local keys – Local keys determine where and how data is written with the
vnode (currently identical to the partition key)
Querying Riak TS
Basho Technologies | 19
select * from WeatherStationData where
time > 1453224610000 and time < 1453225490000 and
device = 'Weather Station 0001' and
deviceId = 'abc-xxx-001-001'
select
MIN(temperature), AVG(temperature), MAX(temperature)
from WeatherStationData where
time > 1453224610000 and time < 1453225490000 and
device = 'Weather Station 0001' and
deviceId = 'abc-xxx-001-001'
select
(temperature * 2), (pressure - 1)
from WeatherStationData where
time > 1453224610000 and time < 1453225490000 and
device = 'Weather Station 0001' and
deviceId = 'abc-xxx-001-001'
Riak TS currently supports a subset of the SQL language that
includes basic aggregate and mathematic functions.

Contenu connexe

Tendances

Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWSPaolo latella
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015Robbie Strickland
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Summit
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyftMian Hamid
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache SparkJosef Adersberger
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...DataStax Academy
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkEvan Chan
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Gyula Fóra
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringAnant Rustagi
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in PracticeNavneet kumar
 
Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkQAware GmbH
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonClaudiu Barbura
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...InfluxData
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackAnirvan Chakraborty
 

Tendances (20)

Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Kinesis @ lyft
Kinesis @ lyftKinesis @ lyft
Kinesis @ lyft
 
Time Series Processing with Apache Spark
Time Series Processing with Apache SparkTime Series Processing with Apache Spark
Time Series Processing with Apache Spark
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
C* Summit 2013: Real-time Analytics using Cassandra, Spark and Shark by Evan ...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and SparkFiloDB - Breakthrough OLAP Performance with Cassandra and Spark
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Kafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroringKafka Lambda architecture with mirroring
Kafka Lambda architecture with mirroring
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice
 
Real World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and SparkReal World Analytics with Solr Cloud and Spark
Real World Analytics with Solr Cloud and Spark
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...Development and Applications of Distributed IoT Sensors for Intermittent Conn...
Development and Applications of Distributed IoT Sensors for Intermittent Conn...
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Real-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stackReal-time personal trainer on the SMACK stack
Real-time personal trainer on the SMACK stack
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Similaire à Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...DATAVERSITY
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...Dataconomy Media
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike, Inc.
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseAll Things Open
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLBasho Technologies
 
Intro to InfluxDB
Intro to InfluxDBIntro to InfluxDB
Intro to InfluxDBInfluxData
 
Axibase Time Series Database
Axibase Time Series DatabaseAxibase Time Series Database
Axibase Time Series Databaseheinrichvk
 
Swift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StorySwift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StoryBrian Cline
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupItamar Haber
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB InternalsInfluxData
 
Data stores: beyond relational databases
Data stores: beyond relational databasesData stores: beyond relational databases
Data stores: beyond relational databasesJavier García Magna
 
Tech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product UpdateTech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product UpdateAdaCore
 
Delivering SaaS Using IaaS - RightScale Compute 2013
Delivering SaaS Using IaaS - RightScale Compute 2013Delivering SaaS Using IaaS - RightScale Compute 2013
Delivering SaaS Using IaaS - RightScale Compute 2013RightScale
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...Dataconomy Media
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...Amazon Web Services
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...NETWAYS
 
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrQAware GmbH
 

Similaire à Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge (20)

Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
Webinar: Data Modeling and Shortcuts to Success in Scaling Time Series Applic...
 
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ..."Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
"Introduction to Kx Technology", James Corcoran, Head of Engineering EMEA at ...
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Re-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series DatabaseRe-Engineering PostgreSQL as a Time-Series Database
Re-Engineering PostgreSQL as a Time-Series Database
 
Data Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQLData Modeling IoT and Time Series data in NoSQL
Data Modeling IoT and Time Series data in NoSQL
 
Intro to InfluxDB
Intro to InfluxDBIntro to InfluxDB
Intro to InfluxDB
 
Axibase Time Series Database
Axibase Time Series DatabaseAxibase Time Series Database
Axibase Time Series Database
 
Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
Swift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer StorySwift at Scale: The IBM SoftLayer Story
Swift at Scale: The IBM SoftLayer Story
 
Redis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetupRedis Streams - Fiverr Tech5 meetup
Redis Streams - Fiverr Tech5 meetup
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Data stores: beyond relational databases
Data stores: beyond relational databasesData stores: beyond relational databases
Data stores: beyond relational databases
 
Tech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product UpdateTech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product Update
 
Delivering SaaS Using IaaS - RightScale Compute 2013
Delivering SaaS Using IaaS - RightScale Compute 2013Delivering SaaS Using IaaS - RightScale Compute 2013
Delivering SaaS Using IaaS - RightScale Compute 2013
 
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...James Corcoran, Head of Engineering EMEA, First Derivatives,  "Simplifying Bi...
James Corcoran, Head of Engineering EMEA, First Derivatives, "Simplifying Bi...
 
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
(BDT403) Best Practices for Building Real-time Streaming Applications with Am...
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
OSDC 2016 - Chronix - A fast and efficient time series storage based on Apach...
 
A Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache SolrA Fast and Efficient Time Series Storage Based on Apache Solr
A Fast and Efficient Time Series Storage Based on Apache Solr
 

Dernier

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 

Dernier (20)

The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 

Pydata london meetup - RiakTS, PySpark and Python by Stephen Etheridge

  • 1. Python, PySpark and Riak TS Stephen Etheridge Lead Solution Architect, EMEA
  • 2. CONFIDENTIAL Agenda • Introduction to Riak TS • The Riak Python client • The Riak Spark connector and PySpark Basho Technologies | 3
  • 3. CONFIDENTIAL Distributed Systems Software for Big Data, IoT and Hybrid Cloud applications 2011 Creators of Riak Distributed Systems • Riak KV: Resilient NoSQL database • Riak S2: Large Object Storage 2015 New Products • Basho Data Platform: Integrated NoSQL databases, caching, in-memory analytics, and search • Riak TS: Only Enterprise NoSQL database optimized for Time Series data 100+ employees Global Offices • Seattle (HQ), Washington DC, London, Tokyo Over 1/3 of the Fortune 50 BASHO SNAPSHOT
  • 4. MEETING THE NEEDS OF THE ENTERPRISE PRIORITIZED NEEDS High Availability - Critical Data High Scale - Heavy Reads & Writes Geo Locality - Multiple Data Centers Operational Simplicity – Resources Don’t Scale as Clusters Data Accuracy – Write Conflict Options ∂ TIME SERIES USE CASES IoT/Devices Financial/Economic Scientific Observations ∂ RIAK KV USE CASES User Data Session Data Profile Data Real-time Data Log Data
  • 5. 20 TERABYTES OF DATA PER DAY BILLIONS OF MOBILE DEVICES  10 BILLION data transactions a day – 150,000 a second – Apple  Forecasting 2.8 BILLION locations around the world  Generates 4GB OF DATA every second We’re focusing on helping people make better decisions with the weather.
  • 6. CONFIDENTIAL WHAT IS NEEDED FOR TIME SERIES?  Efficient way to store & retrieve time series data  Query language that supports range queries  High data volume  Enterprise scale solution  High availability Basho Technologies | 7
  • 7. What is Riak TS? Basho Technologies | 8 Riak TS is Riak KV (a complete Riak KV build is included in Riak TS) with the following additional features optimized to handle time series use cases: • Tables- Riak TS introduces tables built on top of the underlying K/V structure • SQL – Riak TS supports a subset of standard SQL to create and query time series data. • Data Locality – Keys co-located by quanta to enable querying data across time bounded series.
  • 8. Riak TS Quanta Basho Technologies | 9 The Quantam function in Riak TS takes three parameters: • The name of a field in the table definition of type timestamp; • A numeric quantity; • One of the units of time from the list below: • Days – ‘d’ • Hours – ‘h’ • Minutes – ‘m’ • Seconds – ‘s’ Important: A query covering more than a certain number of quanta (5 by default) will generate too many sub-queries and the query system will refuse to run it. Assuming a default quanta of 15 minutes, the maximum query time range is 75 minutes.
  • 9. Supported Aggregate Functions Riak TS supports aggregate functions including: • COUNT() - Returns the number of entries that match a specified criteria. • SUM() - Returns the sum of entries that match a specified criteria. • MEAN() & AVG() - Returns the average of entries that match a specified criteria. • MIN() - Returns the smallest value of entries that match a specified criteria. • MAX() - Returns the largest value of entries that match a specified criteria. • STDDEV() - Returns the statistical standard deviation of all entries that match a specified criteria using Population Standard Deviation. Basho Technologies | 10
  • 10. Supported Data Types Basho Technologies | 11 Riak TS tables support the following data types: • Varchar - Any string content is valid, including Unicode. Can only be compared using strict equality, and will not be typecast (e.g., to an integer) for comparison purposes. Use single quotes to delimit varchar strings. • Double - This type does not comply with its IEEE specification: NaN (not a number) and INF (infinity) cannot be used. • Sint64– Signed 64 bit integer • Boolean - true or false (any case) • Timestamps - Timestamps are integer values expressing UNIX epoch time in UTC in milliseconds. Zero is not a valid timestamp.
  • 11. Developing on Riak TS Riak TS currently supports the Protocol Buffers API and five client libraries including Java, Ruby, Python, Erlang, and Node.js. Basho Technologies | 12 APIs Basho Clients Community Clients • Protocol Buffers • Java • Ruby • Python • Erlang • Node.js • .NET c# • Not yet!
  • 12. Supported Operations Riak TS clients currently support following operations: • Delete - Deletes a single row by it's key values. • Fetch/Get - Fetches a single row by it's key values. • Query - Allows you to query a Riak TS table with the given query string. • Store/Put - Stores data in the Riak TS table. • (Stream) ListKeys - Lists the primary keys of all the rows in a Riak TS table. Basho Technologies | 13
  • 13. The Riak Python Client • Compatible with Python 2.7 and above • Can be installed easily with pip • Pre-requisites – python-dev – libffi-dev – libssl-dev • Riak TS results object can be turned into a Pandas dataframe easily, otherwise it is a list of lists! • Demo with Aarhus data
  • 14. Riak Spark Connector • Enables you to connect Spark applications to Riak TS with the Spark RDD and Spark DataFrames APIs • Write applications in – Scala (if you have to), – Python (yay!), – and Java (never!). • Makes it easy to partition Riak data so multiple Spark workers can process the data in parallel, • Has support for failover if a Riak node goes down while your Spark job is running. • Comes as one JAR file that needs to be pathed in! – Riak TS 1.2+ – Apache Spark 1.6+ – Scala 2.10 – Java 8
  • 15.
  • 16. Riak TS Tables Basho Technologies | 17 Riak TS tables are a new Riak KV Bucket Type (and there is a one to one mapping of tables to bucket types). Tables are created using the riak-admin command line or via one the supported clients: CREATE TABLE GeoCheckin ( myfamily varchar not null, myseries varchar not null, time timestamp not null, weather varchar not null, temperature double, PRIMARY KEY ( (myfamily, myseries, quantum(time, 15, 'm')), myfamily, myseries, time ) ) > riak-admin bucket-type create GeoCheckin '{"props” : {"table_def” : ”…”} }’
  • 17. Partition and Local Keys Basho Technologies | 18 Riak TS has two types of keys that help determine how to distribute data across a cluster and within local partitions of data: • Partition keys – The partition key determines where data is placed within a cluster (by vnode) • Family – class or type of data (i.e. user, device type, etc.) • Series – identifies the specific instances of the class/type, such as username or device ID • Quanta – the time interval to group data by • Local keys – Local keys determine where and how data is written with the vnode (currently identical to the partition key)
  • 18. Querying Riak TS Basho Technologies | 19 select * from WeatherStationData where time > 1453224610000 and time < 1453225490000 and device = 'Weather Station 0001' and deviceId = 'abc-xxx-001-001' select MIN(temperature), AVG(temperature), MAX(temperature) from WeatherStationData where time > 1453224610000 and time < 1453225490000 and device = 'Weather Station 0001' and deviceId = 'abc-xxx-001-001' select (temperature * 2), (pressure - 1) from WeatherStationData where time > 1453224610000 and time < 1453225490000 and device = 'Weather Station 0001' and deviceId = 'abc-xxx-001-001' Riak TS currently supports a subset of the SQL language that includes basic aggregate and mathematic functions.