Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Download to read offline

Building a modern data platform with scala, akka, apache beam

Download to read offline

Gave a talk on at the Scala Meetup on 20 September 2018 on the subject of building a modern data platform with Scala, Akka, Apache Beam.

List of references are as follows:
- Dataflow/Apache Beam (streamingsystems.org/Slides/Eugene Kirpichov - STREAM 2016 Dataflow and Apache Beam.pdf)
- The Dataflow Model (https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)
- MillWheel (https://ai.google/research/pubs/pub41378)
- FlumeJava (https://ai.google/research/pubs/pub35650)
- Why Curiosity Matters (https://hbr.org/2018/09/curiosity)
- Spotify Scio (https://github.com/spotify/scio)
- Typelevel Cats (typelevel.org/cats)
- Verizon Quiver (https://github.com/Verizon/quiver)
- Streaming 101 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)
- Streaming 102 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102)
- Beam vs Spark (https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison)
- Hierarchical scheduling in diverse data center workloads (https://people.eecs.berkeley.edu/~alig/papers/h-drf.pdf)
- Beam comparison (https://github.com/dataArtisans/beam_comp)
- Dataflow Pipeline Execution Parameters (https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-local-pipeline-options)

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Building a modern data platform with scala, akka, apache beam

  1. 1. Design. Build. Data Pipelines
  2. 2. Nature of our systems API API
  3. 3. Components Front-End Aggregation Pipeline Database Caches Secrets Proxies Machine Learning
  4. 4. Really, don’t
  5. 5. Deciding what to do
  6. 6. Systems Design in Data Pipeline
  7. 7. *Focal Points • Black boxes • Data flow patterns • Particularly important when you are designing for a migration • Data correctness requirements • Resist the temptation to build the “ideal” system * Might be different for you
  8. 8. REALISE THIS IS THIS
  9. 9. You cannot change something if you don’t understand how it works. More IMPORTANTLY, you cannot change something if you don’t understand why it works the way it does. - unknown
  10. 10. Team dynamics • Know where the team is at • Know where the team should be (roughly) • How to effect changes by the team, effectively • Training/Re-training • Changing mindsets is hardest !
  11. 11. Arming the team •Recognise that learning requires time •Recognise that applying the learnt knowledge requires time •Recognise that being effective at applying knowledge requires time
  12. 12. There is NO perfect data architecture What you need now is going to be different from what you need in the future
  13. 13. Create a Culture of Learning & Appetite for Adventure This is really important
  14. 14. API : Model : Engine •Proper abstraction to support both streaming and batching •Decomposes pipeline into •What •Where •When •How •Separate data processing from the underlying physical implementation
  15. 15. Beam * Read Google’s VLDB paper - see reference
  16. 16. Why Beam - Pipeline decomposition * source: https://data-artisans.com/blog/why-apache-beam
  17. 17. Why Beam - Programming model Source: https://data-artisans.com/blog/why-apache-beam
  18. 18. DSL <=> Beam pipeline
  19. 19. DSL <=> Beam pipeline
  20. 20. Data types
  21. 21. Patterns
  22. 22. Monads ∈ DSL
  23. 23. Monads ∈ DSL
  24. 24. Monad Transformers ∈ DSL
  25. 25. Compute. Scaling Compute. Diverse Workloads
  26. 26. Data Architecture
  27. 27. What is Mesos Read the technical paper ; see reference
  28. 28. Why Mesos - Part 1 • Our DSL’s scheduling logic is greatly simplified because we don’t have to consider: • Framework requirement • Resource availability • Organizational policies • Global schedule of tasks
  29. 29. Why Mesos - Part 2 • Beam pipelines are scheduled by DSL • Developer focus on building Beam job(s); jobs are stringed by DSL • Developer is free from worrying about where resources are - solved by Mesos resource-offering framework. All the architectural decisions should favour enabling the system to adapt to change
  30. 30. Observations •There is NO perfect data architecture •What you need now is going to be different from what you need in the future •Build a team that adapts to change; learning is key.
  31. 31. References • Dataflow / Apache Beam - Eugene Kirpichov • The Dataflow Model - Tyler Akidau, Sam Whittle et al • MillWheel : Fault-Tolerant Stream Processing at Internet Scale - Tyler Akidau, Sam Whittle et al • FlumeJava : Easy, Efficient Parallel Data Pipelines - Craig Chambers, Nathan Weizenbaum et al • Mesos - Matei Zahari et al • Why Curiosity Matters - Harvard Business Review September 2018 • Spotify Scio - Spotify’s Scala API around Apache Beam • Typelevel Cats Lightweight, modular, and extensible library for functional programming in Scala • Verizon Quiver : A reasonable library for modelling multi-graphs in Scala • Scala - The Scala Programming Language
  32. 32. References • Apache Beam VLDB paper - Tyler Akidau et al @ Google • Streaming 101 • Streaming 102 • Beam vs Spark • Hierarchical scheduling in diverse data center workloads : Battacharya, Ali Ghodsi, Ion Stoica et al • Beam Comparison : Data Artisans • Dataflow/Beam & Spark : A programming model comparison : Tyler et al @ Google • Dataflow Pipeline Execution Parameters

Gave a talk on at the Scala Meetup on 20 September 2018 on the subject of building a modern data platform with Scala, Akka, Apache Beam. List of references are as follows: - Dataflow/Apache Beam (streamingsystems.org/Slides/Eugene Kirpichov - STREAM 2016 Dataflow and Apache Beam.pdf) - The Dataflow Model (https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf) - MillWheel (https://ai.google/research/pubs/pub41378) - FlumeJava (https://ai.google/research/pubs/pub35650) - Why Curiosity Matters (https://hbr.org/2018/09/curiosity) - Spotify Scio (https://github.com/spotify/scio) - Typelevel Cats (typelevel.org/cats) - Verizon Quiver (https://github.com/Verizon/quiver) - Streaming 101 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101) - Streaming 102 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102) - Beam vs Spark (https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison) - Hierarchical scheduling in diverse data center workloads (https://people.eecs.berkeley.edu/~alig/papers/h-drf.pdf) - Beam comparison (https://github.com/dataArtisans/beam_comp) - Dataflow Pipeline Execution Parameters (https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-local-pipeline-options)

Views

Total views

722

On Slideshare

0

From embeds

0

Number of embeds

29

Actions

Downloads

10

Shares

0

Comments

0

Likes

0

×