Building a modern data platform with scala, akka, apache beam

•

1 j'aime•703 vues

Gave a talk on at the Scala Meetup on 20 September 2018 on the subject of building a modern data platform with Scala, Akka, Apache Beam. List of references are as follows: - Dataflow/Apache Beam (streamingsystems.org/Slides/Eugene Kirpichov - STREAM 2016 Dataflow and Apache Beam.pdf) - The Dataflow Model (https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) - MillWheel (https://ai.google/research/pubs/pub41378) - FlumeJava (https://ai.google/research/pubs/pub35650) - Why Curiosity Matters (https://hbr.org/2018/09/curiosity) - Spotify Scio (https://github.com/spotify/scio) - Typelevel Cats (typelevel.org/cats) - Verizon Quiver (https://github.com/Verizon/quiver) - Streaming 101 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101) - Streaming 102 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102) - Beam vs Spark (https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison) - Hierarchical scheduling in diverse data center workloads (https://people.eecs.berkeley.edu/~alig/papers/h-drf.pdf) - Beam comparison (https://github.com/dataArtisans/beam_comp) - Dataflow Pipeline Execution Parameters (https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-local-pipeline-options)

Logiciels

Components
Front-End
Aggregation
Pipeline
Database
Caches
Secrets
Proxies
Machine
Learning

*Focal Points
• Black boxes

• Data ﬂow patterns

• Particularly important when you are designing for a
migration

• Data correctness requirements

• Resist the temptation to build the “ideal” system
* Might be different for you

You cannot change something if you don’t
understand how it works.
More IMPORTANTLY, you cannot change something
if you don’t understand why it works the way it
does.
- unknown

Team dynamics
• Know where the team is at

• Know where the team should be (roughly)

• How to eﬀect changes by the team, eﬀectively

• Training/Re-training

• Changing mindsets is hardest !

Arming the team
•Recognise that learning requires
time
•Recognise that applying the learnt
knowledge requires time
•Recognise that being effective at
applying knowledge requires time

There is NO perfect data
architecture
What you need now is going to be
different from what you need in
the future

Create a Culture of Learning &
Appetite for Adventure
This is really important

API : Model : Engine
•Proper abstraction to support both streaming and batching
•Decomposes pipeline into
•What
•Where
•When
•How
•Separate data processing from the underlying physical
implementation

Beam
* Read Google’s VLDB paper - see reference

Why Beam - Pipeline
decomposition
* source: https://data-artisans.com/blog/why-apache-beam

Why Beam - Programming
model
Source: https://data-artisans.com/blog/why-apache-beam

Compute. Scaling Compute.
Diverse Workloads

What is Mesos
Read the technical paper ; see reference

Why Mesos - Part 1
• Our DSL’s scheduling logic is greatly simpliﬁed because we
don’t have to consider:
• Framework requirement
• Resource availability
• Organizational policies
• Global schedule of tasks

Why Mesos - Part 2
• Beam pipelines are scheduled by DSL

• Developer focus on building Beam job(s);
jobs are stringed by DSL

• Developer is free from worrying about
where resources are - solved by Mesos
resource-oﬀering framework.
All the architectural decisions should favour enabling the system to adapt to change

Observations
•There is NO perfect data architecture
•What you need now is going to be
different from what you need in
the future
•Build a team that adapts to
change; learning is key.

References
• Dataﬂow / Apache Beam - Eugene Kirpichov

• The Dataﬂow Model - Tyler Akidau, Sam Whittle et al

• MillWheel : Fault-Tolerant Stream Processing at Internet Scale - Tyler Akidau, Sam Whittle et al

• FlumeJava : Easy, Eﬃcient Parallel Data Pipelines - Craig Chambers, Nathan Weizenbaum et al

• Mesos - Matei Zahari et al

• Why Curiosity Matters - Harvard Business Review September 2018

• Spotify Scio - Spotify’s Scala API around Apache Beam

• Typelevel Cats Lightweight, modular, and extensible library for functional programming in Scala

• Verizon Quiver : A reasonable library for modelling multi-graphs in Scala

• Scala - The Scala Programming Language

References
• Apache Beam VLDB paper - Tyler Akidau et al @ Google

• Streaming 101

• Streaming 102

• Beam vs Spark

• Hierarchical scheduling in diverse data center workloads : Battacharya, Ali Ghodsi, Ion
Stoica et al

• Beam Comparison : Data Artisans

• Dataﬂow/Beam & Spark : A programming model comparison : Tyler et al @ Google

• Dataﬂow Pipeline Execution Parameters

Recommandé

Ideas spracklen-finalsupportlogic

Machine learninginsparkMadhukara Phatak

Yaroslav Ravlinko "Build your own Machine Learning Platform or how to develo...Lviv Startup Club

Make Life Suck Less (Building Scalable Systems)guest0f8e278

Dev/Test in the Cloud - FChris Riley ☁

Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches Rui Romano

Introduction To Serverless ArchitectureBen Sherman

Big data prototyping in AWS cloudSamuel Yee

Recommandé

Ideas spracklen-finalsupportlogic

Machine learninginsparkMadhukara Phatak

Yaroslav Ravlinko "Build your own Machine Learning Platform or how to develo...Lviv Startup Club

Make Life Suck Less (Building Scalable Systems)guest0f8e278

Dev/Test in the Cloud - FChris Riley ☁

Top 10 of Data & BI Summit Series: Power BI Tips & Tricks from the Trenches Rui Romano

Introduction To Serverless ArchitectureBen Sherman

Big data prototyping in AWS cloudSamuel Yee

Get Intelligent with MetabaseAnant Corporation

Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsJosh Carlisle

Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...Mail.ru Group

SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...Wilco Turnhout

Data modeling trends for AnalyticsIke Ellis

Machine Learning StartupBen Lackey

Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke

Intro to the Cloudwlscaudill

Machine Learning Using Cloud ServicesSC5.io

Machine learning systems for engineersCameron Joannidis

DrupalCon Austin: Planning for PerformanceJeff Beeman

AzureML – zero to heroGovind Kanshi

[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReadersRakuten Group, Inc.

Rubyslava beyond the_monolitholahmichal

Basecamp Tutorial Summer 2011dwestbrook

Future of ai on the jvmAdam Gibson

Introduction to the Data GridOutSystems

H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajSri Ambati

Mapping Life Science Informatics to the CloudChris Dagdigian

Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin

Data modeling trends for analyticsIke Ellis

Scaling a High Traffic Web Application: Our Journey from Java to PHP120bi

Contenu connexe

Tendances

Get Intelligent with MetabaseAnant Corporation

Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless DreamsJosh Carlisle

Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...Mail.ru Group

SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...Wilco Turnhout

Data modeling trends for AnalyticsIke Ellis

Machine Learning StartupBen Lackey

Making Data Science Scalable - 5 Lessons LearnedLaurenz Wuttke

Intro to the Cloudwlscaudill

Machine Learning Using Cloud ServicesSC5.io

Machine learning systems for engineersCameron Joannidis

DrupalCon Austin: Planning for PerformanceJeff Beeman

AzureML – zero to heroGovind Kanshi

[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReadersRakuten Group, Inc.

Rubyslava beyond the_monolitholahmichal

Basecamp Tutorial Summer 2011dwestbrook

Future of ai on the jvmAdam Gibson

Introduction to the Data GridOutSystems

H2O World - Solving Customer Churn with Machine Learning - Julian BharadwajSri Ambati

Tendances (18)

Get Intelligent with Metabase

Rainbows, Unicorns, and other Fairy Tales in the Land of Serverless Dreams

Дмитрий Бабаев (ft. Павел Мезенцев): Data science using Big Data. Pragmatic a...

SharePoint Connections Conference Amsterdam - Pitfalls and success factors of...

Data modeling trends for Analytics

Machine Learning Startup

Making Data Science Scalable - 5 Lessons Learned

Intro to the Cloud

Machine Learning Using Cloud Services

Machine learning systems for engineers

DrupalCon Austin: Planning for Performance

AzureML – zero to hero

[Rakuten TechConf2014] [C-2] Big Data for eBooks and eReaders

Rubyslava beyond the_monolith

Basecamp Tutorial Summer 2011

Future of ai on the jvm

Introduction to the Data Grid

H2O World - Solving Customer Churn with Machine Learning - Julian Bharadwaj

Similaire à Building a modern data platform with scala, akka, apache beam

Mapping Life Science Informatics to the CloudChris Dagdigian

Enterprise Data World 2018 - Building Cloud Self-Service Analytical SolutionDmitry Anoshin

Data modeling trends for analyticsIke Ellis

Scaling a High Traffic Web Application: Our Journey from Java to PHP120bi

Scaling High Traffic Web ApplicationsAchievers Tech

Data Ingestion EngineAdam Doyle

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017AWS Chicago

5 Things that Make Hadoop a Game ChangerCaserta

Software Architecture and Architectors: useless VS valuableComsysto Reply GmbH

DataOps with Project AmaterasuDataWorks Summit/Hadoop Summit

How to use Big Data and Data Lake concept in business using Hadoop and Spark...Institute of Contemporary Sciences

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Cloudera, Inc.

The final frontierTerry Bunio

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Demystifying data engineeringThang Bui (Bob)

Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra

Couchbase Connect 2016Michael Kehoe

Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriarNilesh Shah

Meta scale kognitio hadoop webinarMichael Hiskey

Intro to Big DataZohar Elkayam

Similaire à Building a modern data platform with scala, akka, apache beam (20)

Mapping Life Science Informatics to the Cloud

Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution

Data modeling trends for analytics

Scaling a High Traffic Web Application: Our Journey from Java to PHP

Scaling High Traffic Web Applications

Data Ingestion Engine

Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017

5 Things that Make Hadoop a Game Changer

Software Architecture and Architectors: useless VS valuable

DataOps with Project Amaterasu

How to use Big Data and Data Lake concept in business using Hadoop and Spark...

Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...

The final frontier

Architecting Your First Big Data Implementation

Demystifying data engineering

Data Lakehouse, Data Mesh, and Data Fabric (r1)

Couchbase Connect 2016

Adf and ala design c sharp corner toronto chapter feb 2019 meetup nik shahriar

Meta scale kognitio hadoop webinar

Intro to Big Data

Plus de Raymond Tay

Principled io in_scala_2019_distributionRaymond Tay

Practical catsRaymond Tay

Toying with sparkRaymond Tay

Distributed computing for new bloodsRaymond Tay

Functional programming with_scalaRaymond Tay

Introduction to cuda geek camp singapore 2011Raymond Tay

Introduction to ErlangRaymond Tay

Introduction to CUDARaymond Tay

Plus de Raymond Tay (8)

Principled io in_scala_2019_distribution

Practical cats

Toying with spark

Distributed computing for new bloods

Functional programming with_scala

Introduction to cuda geek camp singapore 2011

Introduction to Erlang

Introduction to CUDA

Dernier

2.pdf Ejercicios de programación competitivaDiego Iván Oliveros Acosta

Understanding Flamingo - DeepMind's VLM Architecturerahul_net

SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler

Implementing Zero Trust strategy with AzureDinusha Kumarasiri

UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz

Salesforce Implementation Services PPT By ABSYZABSYZ Inc

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko

Post Quantum Cryptography – The Impact on Identityteam-WIBU

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López

React Server Component in Next.js by Hanief UtamaHanief Utama

Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Large Language Models for Test Case Evolution and RepairLionel Briand

MYjobs Presentation Django-based projectAnoyGreter

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services

Dernier (20)

2.pdf Ejercicios de programación competitiva

Understanding Flamingo - DeepMind's VLM Architecture

SensoDat: Simulation-based Sensor Dataset of Self-driving Cars

Implementing Zero Trust strategy with Azure

UI5ers live - Custom Controls wrapping 3rd-party libs.pptx

Salesforce Implementation Services PPT By ABSYZ

GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf

Post Quantum Cryptography – The Impact on Identity

Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024

Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...

React Server Component in Next.js by Hanief Utama

Folding Cheat Sheet #4 - fourth in a series

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

Large Language Models for Test Case Evolution and Repair

MYjobs Presentation Django-based project

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Cloud Data Center Network Construction - IEEE

Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...

Building a modern data platform with scala, akka, apache beam

1. Design. Build. Data Pipelines

2. Nature of our systems API API

3. Components Front-End Aggregation Pipeline Database Caches Secrets Proxies Machine Learning

6. Really, don’t

8. Deciding what to do

10.

11.

12.

13.

14. Systems Design in Data Pipeline

15. *Focal Points • Black boxes • Data ﬂow patterns • Particularly important when you are designing for a migration • Data correctness requirements • Resist the temptation to build the “ideal” system * Might be different for you

16. REALISE THIS IS THIS

17. You cannot change something if you don’t understand how it works. More IMPORTANTLY, you cannot change something if you don’t understand why it works the way it does. - unknown

18. Team dynamics • Know where the team is at • Know where the team should be (roughly) • How to eﬀect changes by the team, eﬀectively • Training/Re-training • Changing mindsets is hardest !

19. Arming the team •Recognise that learning requires time •Recognise that applying the learnt knowledge requires time •Recognise that being effective at applying knowledge requires time

20. There is NO perfect data architecture What you need now is going to be different from what you need in the future

21.

22. Create a Culture of Learning & Appetite for Adventure This is really important

23.

24.

25. API : Model : Engine •Proper abstraction to support both streaming and batching •Decomposes pipeline into •What •Where •When •How •Separate data processing from the underlying physical implementation

26. Beam * Read Google’s VLDB paper - see reference

27. Why Beam - Pipeline decomposition * source: https://data-artisans.com/blog/why-apache-beam

28. Why Beam - Programming model Source: https://data-artisans.com/blog/why-apache-beam

29. DSL <=> Beam pipeline

30. DSL <=> Beam pipeline

31. Data types

32. Patterns

33. Monads ∈ DSL

34. Monads ∈ DSL

35. Monad Transformers ∈ DSL

36. Compute. Scaling Compute. Diverse Workloads

37. Data Architecture

38. What is Mesos Read the technical paper ; see reference

39. Why Mesos - Part 1 • Our DSL’s scheduling logic is greatly simpliﬁed because we don’t have to consider: • Framework requirement • Resource availability • Organizational policies • Global schedule of tasks

40. Why Mesos - Part 2 • Beam pipelines are scheduled by DSL • Developer focus on building Beam job(s); jobs are stringed by DSL • Developer is free from worrying about where resources are - solved by Mesos resource-oﬀering framework. All the architectural decisions should favour enabling the system to adapt to change

41.

42. Observations •There is NO perfect data architecture •What you need now is going to be different from what you need in the future •Build a team that adapts to change; learning is key.

43. References • Dataflow / Apache Beam - Eugene Kirpichov • The Dataflow Model - Tyler Akidau, Sam Whittle et al • MillWheel : Fault-Tolerant Stream Processing at Internet Scale - Tyler Akidau, Sam Whittle et al • FlumeJava : Easy, Efficient Parallel Data Pipelines - Craig Chambers, Nathan Weizenbaum et al • Mesos - Matei Zahari et al • Why Curiosity Matters - Harvard Business Review September 2018 • Spotify Scio - Spotify’s Scala API around Apache Beam • Typelevel Cats Lightweight, modular, and extensible library for functional programming in Scala • Verizon Quiver : A reasonable library for modelling multi-graphs in Scala • Scala - The Scala Programming Language

44. References • Apache Beam VLDB paper - Tyler Akidau et al @ Google • Streaming 101 • Streaming 102 • Beam vs Spark • Hierarchical scheduling in diverse data center workloads : Battacharya, Ali Ghodsi, Ion Stoica et al • Beam Comparison : Data Artisans • Dataﬂow/Beam & Spark : A programming model comparison : Tyler et al @ Google • Dataﬂow Pipeline Execution Parameters