3. 3
LETGO DATA PLATFORM IN NUMBERS
500GB 530+
Data daily Event Types
Events Processed Daily
565M 180TB
Storage (S3)
8K
Events Processed per Second
< 1sec
NRT Processing Time
8. 8
MOVING TO EVENTS
“A Domain Event captures the memory of something interesting (external stimulus) which affects the domain”
Martin Fowler
Domain EventsTracking Events
41. 41
STREAMING: REAL-TIME PATTERN DETECTION
Is it still available?
Is the price negotiable?
What condition is it in?
I offer you….$
Could we meet at……?
42. 42
STREAMING: REAL-TIME PATTERN DETECTION
Event A + Event B = Complex Event C
Event A + NOTHING in 2 hours = Complex Event D
Some common use cases:
• Fraud detection, ¿Scammers detection?
• Real time recommendations
43. 43
GO TO EVENTS: Tips
INGEST
Data Ingestion
Storage
PROCESSING
Stream
Batch
DISCOVER
Query
Data exploitation
Orchestration
49. 49
How we do it:
• We load Cities, States, Zip codes,
DMAs… polygons from Well-known
text (WKT)
• Create indexes using JTS Topology
Suite
• Custom Spark SQL UDF.
Tool:
BATCH: Geodata enrichment
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM ….
)
59. 59
QUERYING DATA
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url
'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?user=xxx&password=xx
x',
forward_spark_s3_credentials 'true'
)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name"
)
67. 67
QUERYING DATA: Batches with SQL
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
Solution
68. 68
QUERYING DATA: Batches with SQL
DISTRIBUTE BY (dt):
Only one file not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple files
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one file sorted by user_id, column_b.
Good for joins using this properties.
69. 69
QUERYING DATA: Batches with SQL
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)
84. 84
Transversal workshops: BI, Data Scientist, Data engineer.
Create a wiki with: Guidelines, tips, links...
Learn from your mistakes!
DATA EXPLOITATION: Data Scientists
Well done!
91. 91
TIPS: Bigger is better
20 slots
Total used: 3x6x4=72
Total wasted: 2x4=8
Total used: 3x26=78
Total wasted: 2
80 slots
We prefer big Hadoop instances.
We use m4.16xlarge instances.
We are migrating to m5.24xlarge instances.