SlideShare une entreprise Scribd logo
1  sur  69
Télécharger pour lire hors ligne
December 15, 2014
Luigi
NYC Data Science meetup
What is Luigi
Luigi is a workflow engine
If you run 10,000+ Hadoop jobs every day, you need one
If you play around with batch processing just for fun, you want one
Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at
It helps you with the plumbing of connecting lots of tasks into complicated pipelines,
especially if those tasks run on Hadoop
2
What do we use it for?
Music recommendations
A/B testing
Top lists
Ad targeting
Label reporting
Dashboards
… and a million other things!
3
Currently running 10,000+ Hadoop jobs every day
On average a Hadoop job is launched every 10s
There’s 2,000+ Luigi tasks in production
4
Some history
… let’s go back to 2008!
5
The year was 2008
I was writing my master’s thesis
about music recommendations
Had to run hundreds of long-running
tasks to compute the output
6
Toy example: classify skipped tracks
$ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled
$ python train_model.py /tmp/subsampled model.pickle
$ python inspect_model.py model.pickle
7
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Look at the
output
Reproducibility matters
…and automation.
!
The previous code is really hard to run again
8
Let’s make into a big workflow
9
$ python run_everything.py
Reality: crashes will happen
10
How do you resume this?
Ability to resume matters
When you are developing something interactively, you will try and fail a lot
Failures will happen, and you want to resume once you fixed it
You want the system to figure out exactly what it has to re-run and nothing else
Atomic file operations is crucial for the ability to resume
11
So let’s make it possible to resume
12
13
But still annoying parts
Hardcoded
junk
Generalization matters
You should be able to re-run your entire pipeline with a new value for a parameter
Command line integration means you can run interactive experiments
14
… now we’re getting
something
15
$ python run_everything.py --date-
first 2014-01-01 --date-last
2014-01-31 --n-trees 200
16
… but it’s hardly
readable
BOILERPLATE
Boilerplate matters!
We keep re-implementing the same functionality
Let’s factor it out to a framework
17
A lot of real-world
data pipelines are a
lot more complex
The ideal framework should make it
trivial to build up big data
pipelines where dependencies
are non-trivial (eg depend on date
algebra)
18
So I started thinking
Wanted to build something like GNU Make
19
What is Make and why is it pretty cool?
Build reusable rules
Specify what you want to build and then
backtrack to find out what you need
in order to get there
Reproducible runs
20
# the compiler: gcc for C program, define as g++ for C++
CC = gcc
!
# compiler flags:
# -g adds debugging information to the executable file
# -Wall turns on most, but not all, compiler warnings
CFLAGS = -g -Wall
!
# the build target executable:
TARGET = myprog
!
all: $(TARGET)
!
$(TARGET): $(TARGET).c
$(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c
!
clean:
$(RM) $(TARGET)
We want something that works for a wide range of systems
We need to support lots of systems
“80% of data science is data munging”
21
Data processing needs to interact with lots of systems
Need to support practically any type of task:
Hadoop jobs
Database dumps
Ingest into Cassandra
Send email
SCP file somewhere else
22
My first attempt: builder
Use XML config to build up the dependency graph!
23
Don’t use XML
… seriously, don’t use it
24
Dependencies need code
Pipelines deployed in production often have nontrivial ways they define dependencies between
tasks
!
!
!
!
!
!
!
!
… and many other cases
25
Recursion (and date algebra)
BloomFilter(date=2014-05-01)
BloomFilter(date=2014-04-30)
Log(date=2014-04-30)
Log(date=2014-04-29)
...
Date algebra
Toplist(date_interval=2014-01)
Log(date=2014-01-01)
Log(date=2014-01-02)
...
Log(date=2014-01-31)
Enum types
IdMap(type=artist) IdMap(type=track)
IdToIdMap(from_type=artist, to_type=track)
Don’t ever invent your own DSL
“It’s better to write domain specific code in a
general purpose language, than writing
general purpose code in a domain specific
language” – unknown author
!
!
Oozie is a good example of how messy it gets
26
2009: builder2
Solved all the things I just mentioned
- Dependency graph specified in Python
- Support for arbitrary tasks
- Error emails
- Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc
- Lots of other things :)
27
Graphs!
28
More graphs!
29
Even more graphs!
30
What were the good bits?
!
Build up dependency graphs and visualize them
Non-event to go from development to deployment
Built-in HDFS integration but decoupled from the core library
!
!
What went wrong?
!
Still too much boiler plate
Pretty bad command line integration
31
32
Introducing Luigi
A workflow engine in Python
33
Luigi – History at Spotify
Late 2011: Me and Elias Freider build it, release it into the
wild at Spotify, people start using it
“The Python era”
!
Late 2012: Open source it
Early 2013: First known company outside of Spotify:
Foursquare
!
34
Luigi is your friendly plumber
Simple dependency definitions
Emphasis on Hadoop/HDFS integration
Atomic file operations
Data flow visualization
Command line integration
35
Luigi Task 36
Luigi Task – breakdown 37
The business logic of the task Where it writes output What other tasks it depends on
Parameters for this task
Easy command line integration
So easy that you want to use Luigi for it
38
$ python my_task.py MyTask --param 43
INFO: Scheduled MyTask(param=43)
INFO: Scheduled SomeOtherTask(param=43)
INFO: Done scheduling tasks
INFO: [pid 20235] Running SomeOtherTask(param=43)
INFO: [pid 20235] Done SomeOtherTask(param=43)
INFO: [pid 20235] Running MyTask(param=43)
INFO: [pid 20235] Done MyTask(param=43)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$ cat /tmp/foo/bar-43.txt
hello, world
$
Let’s go back to the example
39
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Look at the
output
Code in Luigi
40
Extract the features
41
$ python demo.py SubsampleFeatures --date-interval 2013-11-01
DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete
INFO: Done scheduling tasks
DEBUG: Asking scheduler for work...
DEBUG: Pending tasks: 1
INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01)
...
INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2-
hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113
INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0%
INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0%
INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0%
...
INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87%
INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100%
INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01)
DEBUG: Asking scheduler for work...
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker was stopped. Shutting down Keep-Alive thread
$
Run on the command line
42
Step 2: Train a
machine
learning model
43
Let’s run everything on the command line from scratch
$ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03
DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self)
INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING)
INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false
INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0%
INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0%
INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0%
INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0%
INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0%
INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0%
INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100%
INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
SubsampleFeatures(test=False, date_interval=2011-01-03)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
TrainClassifier(date_interval=2011-01-03, n_trees=10)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
TrainClassifier(date_interval=2011-01-03, n_trees=10)
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running
InspectModel(date_interval=2011-01-03, n_trees=10)
time 0.1335%
ms_played 96.9351%
shuffle 0.0728%
local_track 0.0000%
bitrate 2.8586%
INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done
InspectModel(date_interval=2011-01-03, n_trees=10)
44
Let’s make it more complicated – cross validation
45
Log d
Log d+1
...
Log d+k-1
Subsample
and extract
features
Subsampled
features
Train
classifier
Classifier
Log e
Log e+1
...
Log e+k-1
Subsample
and extract
features
Subsampled
features
Cross validation
Cross validation
implementation
$ python xv.py CrossValidation
--date-interval-a 2012-11-01
--date-interval-b 2012-11-02
46
Run on the command line
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE)
INFO: Done scheduling tasks
INFO: Running Worker with 1 processes
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
2011-01-01 (train) AUC: 0.9040
2011-01-02 ( test) AUC: 0.9040
INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01,
date_interval_b=2011-01-02)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread
47
… no overfitting!
More trees!
$ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100
INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING)
INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE)
INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE)
INFO: Done scheduling tasks
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100)
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100)
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
2011-01-01 (train) AUC: 0.9074
2011-01-02 ( test) AUC: 0.8896
INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net,
username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02,
n_trees=100)
INFO: Done
INFO: There are no more tasks to run at this time
INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern,
pid=27835) was stopped. Shutting down Keep-Alive thread
48
… overfitting!
The nice things about Luigi
50
Overhead for a task is about 5 lines (class def + requires + output + run)
Easy command line integration
Section name
Minimal boiler plate
51
Everything is a directed acyclic graph
Makefile style
Tasks specify what they are dependent on not what other things depend on them
52
Luigi’s visualizer
53
Dive into any task
54
Run with multiple workers
$ python dataflow.py --workers 3 AggregateArtists --date-interval
2013-W08
55
Error notifications
56
Process synchronization
Luigi worker 1 Luigi worker 2
A
B C
A C
F
Luigi central planner
Prevents the same task from being run simultaneously, but all execution is being done by the
workers.
57
Luigi is a way of coordinating lots of different tasks
… but you still have to figure out how to implement and scale them!
58
Do general-purpose stuff
Don’t focus on a specific platform
!
… but comes “batteries included”
59
Built-in support for HDFS & Hadoop
At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and
Scalding. Luigi is a great glue!
!
Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in
production.
!
Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs,
Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into
Postgres, send email reports, etc.
60
The one time we accidentally deleted 50TB of data
We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3
days
61
Some things are still not perfect
62
The missing parts
Execution is tied to scheduling – you can’t schedule something to run “in the cloud”
Visualization could be a lot more useful
There’s no built scheduling – have to rely on crontab
These are all things we have in the backlog
63
Source:
What are some ideas for the future?
64
Separate scheduling and execution
65
Luigi central scheduler
Slave
Slave
Slave
Slave
...
Luigi in Scala?
66
Luigi implements some core beliefs
The #1 focus is on removing all boiler plate
The #2 focus is to be as general as possible
The #3 focus is to make it easy to go from test to production
!
!
67
Join the club!
Questions?
69

Contenu connexe

Tendances

Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Philip Fisher-Ogden
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoAlluxio, Inc.
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouserpolat
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Sadayuki Furuhashi
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevAltinity Ltd
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack IntroductionVikram Shinde
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...Altinity Ltd
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 

Tendances (20)

Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014Netflix viewing data architecture evolution - QCon 2014
Netflix viewing data architecture evolution - QCon 2014
 
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3GoHigh Performance Data Lake with Apache Hudi and Alluxio at T3Go
High Performance Data Lake with Apache Hudi and Alluxio at T3Go
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
 
Tuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptxTuning Apache Kafka Connectors for Flink.pptx
Tuning Apache Kafka Connectors for Flink.pptx
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
 
Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1Understanding Presto - Presto meetup @ Tokyo #1
Understanding Presto - Presto meetup @ Tokyo #1
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
Apache airflow
Apache airflowApache airflow
Apache airflow
 
Airflow and supervisor
Airflow and supervisorAirflow and supervisor
Airflow and supervisor
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Elastic Stack Introduction
Elastic Stack IntroductionElastic Stack Introduction
Elastic Stack Introduction
 
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 

En vedette

[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기Sumin Byeon
 
Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Esun Kim
 
영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출동윤 이
 
김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기Harns (Nak-Hyoung) Kim
 
게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지Harns (Nak-Hyoung) Kim
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임Imseong Kang
 
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]Sumin Byeon
 
Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4Huey Park
 
NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지Daehoon Han
 
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David StelzerDeveloping Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David StelzerJessica Tams
 
Deep learning as_WaveExtractor
Deep learning as_WaveExtractorDeep learning as_WaveExtractor
Deep learning as_WaveExtractor동윤 이
 
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012Esun Kim
 
8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈Harns (Nak-Hyoung) Kim
 
Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러Heungsub Lee
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
 
Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4동석 김
 
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다Lee Dustin
 
버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션동석 김
 

En vedette (20)

[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
 
Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)Online game server on Akka.NET (NDC2016)
Online game server on Akka.NET (NDC2016)
 
영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출영상 데이터의 처리와 정보의 추출
영상 데이터의 처리와 정보의 추출
 
김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기김병관 성공캠프 SNS팀 자원봉사 후기
김병관 성공캠프 SNS팀 자원봉사 후기
 
게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지게임회사 취업을 위한 현실적인 전략 3가지
게임회사 취업을 위한 현실적인 전략 3가지
 
Docker
DockerDocker
Docker
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
 
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
 
Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4Behavior Tree in Unreal engine 4
Behavior Tree in Unreal engine 4
 
NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지NDC16 스매싱더배틀 1년간의 개발일지
NDC16 스매싱더배틀 1년간의 개발일지
 
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David StelzerDeveloping Success in Mobile with Unreal Engine 4 | David Stelzer
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
 
Deep learning as_WaveExtractor
Deep learning as_WaveExtractorDeep learning as_WaveExtractor
Deep learning as_WaveExtractor
 
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
 
8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈8년동안 테라에서 배운 8가지 교훈
8년동안 테라에서 배운 8가지 교훈
 
Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러Profiling - 실시간 대화식 프로파일러
Profiling - 실시간 대화식 프로파일러
 
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetupApproximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
 
Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4Custom fabric shader for unreal engine 4
Custom fabric shader for unreal engine 4
 
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다레퍼런스만 알면 언리얼 엔진이 제대로 보인다
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
 
버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션버텍스 셰이더로 하는 머리카락 애니메이션
버텍스 셰이더로 하는 머리카락 애니메이션
 

Similaire à Luigi workflow engine for NYC Data Science meetup

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowPyData
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance PythonIan Ozsvald
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowLaura Lorenz
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overviewprevota
 
BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up Craig Schumann
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101APNIC
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlowMatthias Feys
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linuxPavel Klimiankou
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowRomain Dorgueil
 
The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...Arnaud Joly
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...AMD Developer Central
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: ConcurrencyPlatonov Sergey
 
Python VS GO
Python VS GOPython VS GO
Python VS GOOfir Nir
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesTatiana Al-Chueyr
 

Similaire à Luigi workflow engine for NYC Data Science meetup (20)

How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Euro python2011 High Performance Python
Euro python2011 High Performance PythonEuro python2011 High Performance Python
Euro python2011 High Performance Python
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with AirflowHow I learned to time travel, or, data pipelining and scheduling with Airflow
How I learned to time travel, or, data pipelining and scheduling with Airflow
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
Yaetos Tech Overview
Yaetos Tech OverviewYaetos Tech Overview
Yaetos Tech Overview
 
BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up BP206 - Let's Give Your LotusScript a Tune-Up
BP206 - Let's Give Your LotusScript a Tune-Up
 
Handout3o
Handout3oHandout3o
Handout3o
 
Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018Intro - End to end ML with Kubeflow @ SignalConf 2018
Intro - End to end ML with Kubeflow @ SignalConf 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Network Automation: Ansible 101
Network Automation: Ansible 101Network Automation: Ansible 101
Network Automation: Ansible 101
 
Introduction to TensorFlow
Introduction to TensorFlowIntroduction to TensorFlow
Introduction to TensorFlow
 
Troubleshooting .net core on linux
Troubleshooting .net core on linuxTroubleshooting .net core on linux
Troubleshooting .net core on linux
 
Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*Data Analytics and Simulation in Parallel with MATLAB*
Data Analytics and Simulation in Parallel with MATLAB*
 
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache AirflowBusiness Dashboards using Bonobo ETL, Grafana and Apache Airflow
Business Dashboards using Bonobo ETL, Grafana and Apache Airflow
 
The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...The genesis of clusterlib - An open source library to tame your favourite sup...
The genesis of clusterlib - An open source library to tame your favourite sup...
 
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
PL-4047, Big Data Workload Analysis Using SWAT and Ipython Notebooks, by Moni...
 
Better Code: Concurrency
Better Code: ConcurrencyBetter Code: Concurrency
Better Code: Concurrency
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
PythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummiesPythonBrasil[8] - CPython for dummies
PythonBrasil[8] - CPython for dummies
 

Dernier

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 

Dernier (20)

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 

Luigi workflow engine for NYC Data Science meetup

  • 1. December 15, 2014 Luigi NYC Data Science meetup
  • 2. What is Luigi Luigi is a workflow engine If you run 10,000+ Hadoop jobs every day, you need one If you play around with batch processing just for fun, you want one Doesn’t help you with the code, that’s what Scalding, Pig, or anything else is good at It helps you with the plumbing of connecting lots of tasks into complicated pipelines, especially if those tasks run on Hadoop 2
  • 3. What do we use it for? Music recommendations A/B testing Top lists Ad targeting Label reporting Dashboards … and a million other things! 3
  • 4. Currently running 10,000+ Hadoop jobs every day On average a Hadoop job is launched every 10s There’s 2,000+ Luigi tasks in production 4
  • 5. Some history … let’s go back to 2008! 5
  • 6. The year was 2008 I was writing my master’s thesis about music recommendations Had to run hundreds of long-running tasks to compute the output 6
  • 7. Toy example: classify skipped tracks $ python subsample_extract_features.py /log/endsongcleaned/2011-01-?? /tmp/subsampled $ python train_model.py /tmp/subsampled model.pickle $ python inspect_model.py model.pickle 7 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Look at the output
  • 8. Reproducibility matters …and automation. ! The previous code is really hard to run again 8
  • 9. Let’s make into a big workflow 9 $ python run_everything.py
  • 10. Reality: crashes will happen 10 How do you resume this?
  • 11. Ability to resume matters When you are developing something interactively, you will try and fail a lot Failures will happen, and you want to resume once you fixed it You want the system to figure out exactly what it has to re-run and nothing else Atomic file operations is crucial for the ability to resume 11
  • 12. So let’s make it possible to resume 12
  • 13. 13 But still annoying parts Hardcoded junk
  • 14. Generalization matters You should be able to re-run your entire pipeline with a new value for a parameter Command line integration means you can run interactive experiments 14
  • 15. … now we’re getting something 15 $ python run_everything.py --date- first 2014-01-01 --date-last 2014-01-31 --n-trees 200
  • 16. 16 … but it’s hardly readable BOILERPLATE
  • 17. Boilerplate matters! We keep re-implementing the same functionality Let’s factor it out to a framework 17
  • 18. A lot of real-world data pipelines are a lot more complex The ideal framework should make it trivial to build up big data pipelines where dependencies are non-trivial (eg depend on date algebra) 18
  • 19. So I started thinking Wanted to build something like GNU Make 19
  • 20. What is Make and why is it pretty cool? Build reusable rules Specify what you want to build and then backtrack to find out what you need in order to get there Reproducible runs 20 # the compiler: gcc for C program, define as g++ for C++ CC = gcc ! # compiler flags: # -g adds debugging information to the executable file # -Wall turns on most, but not all, compiler warnings CFLAGS = -g -Wall ! # the build target executable: TARGET = myprog ! all: $(TARGET) ! $(TARGET): $(TARGET).c $(CC) $(CFLAGS) -o $(TARGET) $(TARGET).c ! clean: $(RM) $(TARGET)
  • 21. We want something that works for a wide range of systems We need to support lots of systems “80% of data science is data munging” 21
  • 22. Data processing needs to interact with lots of systems Need to support practically any type of task: Hadoop jobs Database dumps Ingest into Cassandra Send email SCP file somewhere else 22
  • 23. My first attempt: builder Use XML config to build up the dependency graph! 23
  • 24. Don’t use XML … seriously, don’t use it 24
  • 25. Dependencies need code Pipelines deployed in production often have nontrivial ways they define dependencies between tasks ! ! ! ! ! ! ! ! … and many other cases 25 Recursion (and date algebra) BloomFilter(date=2014-05-01) BloomFilter(date=2014-04-30) Log(date=2014-04-30) Log(date=2014-04-29) ... Date algebra Toplist(date_interval=2014-01) Log(date=2014-01-01) Log(date=2014-01-02) ... Log(date=2014-01-31) Enum types IdMap(type=artist) IdMap(type=track) IdToIdMap(from_type=artist, to_type=track)
  • 26. Don’t ever invent your own DSL “It’s better to write domain specific code in a general purpose language, than writing general purpose code in a domain specific language” – unknown author ! ! Oozie is a good example of how messy it gets 26
  • 27. 2009: builder2 Solved all the things I just mentioned - Dependency graph specified in Python - Support for arbitrary tasks - Error emails - Support for lots of common data plumbing stuff: Hadoop jobs, Postgres, etc - Lots of other things :) 27
  • 31. What were the good bits? ! Build up dependency graphs and visualize them Non-event to go from development to deployment Built-in HDFS integration but decoupled from the core library ! ! What went wrong? ! Still too much boiler plate Pretty bad command line integration 31
  • 32. 32
  • 33. Introducing Luigi A workflow engine in Python 33
  • 34. Luigi – History at Spotify Late 2011: Me and Elias Freider build it, release it into the wild at Spotify, people start using it “The Python era” ! Late 2012: Open source it Early 2013: First known company outside of Spotify: Foursquare ! 34
  • 35. Luigi is your friendly plumber Simple dependency definitions Emphasis on Hadoop/HDFS integration Atomic file operations Data flow visualization Command line integration 35
  • 37. Luigi Task – breakdown 37 The business logic of the task Where it writes output What other tasks it depends on Parameters for this task
  • 38. Easy command line integration So easy that you want to use Luigi for it 38 $ python my_task.py MyTask --param 43 INFO: Scheduled MyTask(param=43) INFO: Scheduled SomeOtherTask(param=43) INFO: Done scheduling tasks INFO: [pid 20235] Running SomeOtherTask(param=43) INFO: [pid 20235] Done SomeOtherTask(param=43) INFO: [pid 20235] Running MyTask(param=43) INFO: [pid 20235] Done MyTask(param=43) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ cat /tmp/foo/bar-43.txt hello, world $
  • 39. Let’s go back to the example 39 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Look at the output
  • 42. $ python demo.py SubsampleFeatures --date-interval 2013-11-01 DEBUG: Checking if SubsampleFeatures(test=False, date_interval=2013-11-01) is complete INFO: Scheduled SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Checking if EndSongCleaned(date_interval=2013-11-01) is complete INFO: Done scheduling tasks DEBUG: Asking scheduler for work... DEBUG: Pending tasks: 1 INFO: [pid 24345] Running SubsampleFeatures(test=False, date_interval=2013-11-01) ... INFO: 13/11/08 02:15:11 INFO streaming.StreamJob: Tracking URL: http://lon2- hadoopmaster-a1.c.lon.spotify.net:50030/jobdetails.jsp?jobid=job_201310180017_157113 INFO: 13/11/08 02:15:12 INFO streaming.StreamJob: map 0% reduce 0% INFO: 13/11/08 02:15:27 INFO streaming.StreamJob: map 2% reduce 0% INFO: 13/11/08 02:15:30 INFO streaming.StreamJob: map 7% reduce 0% ... INFO: 13/11/08 02:16:10 INFO streaming.StreamJob: map 100% reduce 87% INFO: 13/11/08 02:16:13 INFO streaming.StreamJob: map 100% reduce 100% INFO: [pid 24345] Done SubsampleFeatures(test=False, date_interval=2013-11-01) DEBUG: Asking scheduler for work... INFO: Done INFO: There are no more tasks to run at this time INFO: Worker was stopped. Shutting down Keep-Alive thread $ Run on the command line 42
  • 43. Step 2: Train a machine learning model 43
  • 44. Let’s run everything on the command line from scratch $ python luigi_workflow_full.py InspectModel --date-interval 2011-01-03 DEBUG: Checking if InspectModel(date_interval=2011-01-03, n_trees=10) id" % self) INFO: Scheduled InspectModel(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-03, n_trees=10) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-03) (PENDING) INFO: Scheduled EndSongCleaned(date=2011-01-03) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: 14/12/17 02:07:20 INFO mapreduce.Job: Running job: job_1418160358293_86477 INFO: 14/12/17 02:07:31 INFO mapreduce.Job: Job job_1418160358293_86477 running in uber mode : false INFO: 14/12/17 02:07:31 INFO mapreduce.Job: map 0% reduce 0% INFO: 14/12/17 02:08:34 INFO mapreduce.Job: map 2% reduce 0% INFO: 14/12/17 02:08:36 INFO mapreduce.Job: map 3% reduce 0% INFO: 14/12/17 02:08:38 INFO mapreduce.Job: map 5% reduce 0% INFO: 14/12/17 02:08:39 INFO mapreduce.Job: map 10% reduce 0% INFO: 14/12/17 02:08:40 INFO mapreduce.Job: map 17% reduce 0% INFO: 14/12/17 02:16:30 INFO mapreduce.Job: map 100% reduce 100% INFO: 14/12/17 02:16:32 INFO mapreduce.Job: Job job_1418160358293_86477 completed successfully INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done SubsampleFeatures(test=False, date_interval=2011-01-03) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done TrainClassifier(date_interval=2011-01-03, n_trees=10) INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) running InspectModel(date_interval=2011-01-03, n_trees=10) time 0.1335% ms_played 96.9351% shuffle 0.0728% local_track 0.0000% bitrate 2.8586% INFO: [pid 23869] Worker Worker(salt=912880805, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=23869) done InspectModel(date_interval=2011-01-03, n_trees=10) 44
  • 45. Let’s make it more complicated – cross validation 45 Log d Log d+1 ... Log d+k-1 Subsample and extract features Subsampled features Train classifier Classifier Log e Log e+1 ... Log e+k-1 Subsample and extract features Subsampled features Cross validation
  • 46. Cross validation implementation $ python xv.py CrossValidation --date-interval-a 2012-11-01 --date-interval-b 2012-11-02 46
  • 47. Run on the command line $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=10) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: Running Worker with 1 processes INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) 2011-01-01 (train) AUC: 0.9040 2011-01-02 ( test) AUC: 0.9040 INFO: [pid 18533] Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=752525444, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=18533) was stopped. Shutting down Keep-Alive thread 47 … no overfitting!
  • 48. More trees! $ python cross_validation.py CrossValidation --date-interval-a 2011-01-01 --date-interval-b 2011-01-02 --n-trees 100 INFO: Scheduled CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) (PENDING) INFO: Scheduled TrainClassifier(date_interval=2011-01-01, n_trees=100) (PENDING) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-02) (DONE) INFO: Scheduled SubsampleFeatures(test=False, date_interval=2011-01-01) (DONE) INFO: Done scheduling tasks INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done TrainClassifier(date_interval=2011-01-01, n_trees=100) INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) running CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) 2011-01-01 (train) AUC: 0.9074 2011-01-02 ( test) AUC: 0.8896 INFO: [pid 27835] Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) done CrossValidation(date_interval_a=2011-01-01, date_interval_b=2011-01-02, n_trees=100) INFO: Done INFO: There are no more tasks to run at this time INFO: Worker Worker(salt=539404294, workers=1, host=lon3-edgenode-a22.lon3.spotify.net, username=erikbern, pid=27835) was stopped. Shutting down Keep-Alive thread 48 … overfitting!
  • 49.
  • 50. The nice things about Luigi 50
  • 51. Overhead for a task is about 5 lines (class def + requires + output + run) Easy command line integration Section name Minimal boiler plate 51
  • 52. Everything is a directed acyclic graph Makefile style Tasks specify what they are dependent on not what other things depend on them 52
  • 54. Dive into any task 54
  • 55. Run with multiple workers $ python dataflow.py --workers 3 AggregateArtists --date-interval 2013-W08 55
  • 57. Process synchronization Luigi worker 1 Luigi worker 2 A B C A C F Luigi central planner Prevents the same task from being run simultaneously, but all execution is being done by the workers. 57
  • 58. Luigi is a way of coordinating lots of different tasks … but you still have to figure out how to implement and scale them! 58
  • 59. Do general-purpose stuff Don’t focus on a specific platform ! … but comes “batteries included” 59
  • 60. Built-in support for HDFS & Hadoop At Spotify we’re abandoning Python for batch processing tasks, replacing it with Crunch and Scalding. Luigi is a great glue! ! Our team, the Lambda team: 15 engs, running 1,000+ Hadoop jobs daily, having 400+ Luigi Tasks in production. ! Our recommendation pipeline is a good example: Python M/R jobs, ML algos in C++, Java M/R jobs, Scalding, ML stuff in Python using scikit-learn, import stuff into Cassandra, import stuff into Postgres, send email reports, etc. 60
  • 61. The one time we accidentally deleted 50TB of data We didn’t have to write a single line of code to fix it – Luigi rescheduled 1000s of task and ran it for 3 days 61
  • 62. Some things are still not perfect 62
  • 63. The missing parts Execution is tied to scheduling – you can’t schedule something to run “in the cloud” Visualization could be a lot more useful There’s no built scheduling – have to rely on crontab These are all things we have in the backlog 63
  • 64. Source: What are some ideas for the future? 64
  • 65. Separate scheduling and execution 65 Luigi central scheduler Slave Slave Slave Slave ...
  • 67. Luigi implements some core beliefs The #1 focus is on removing all boiler plate The #2 focus is to be as general as possible The #3 focus is to make it easy to go from test to production ! ! 67