Gave a talk on at the Scala Meetup on 20 September 2018 on the subject of building a modern data platform with Scala, Akka, Apache Beam.
List of references are as follows:
- Dataflow/Apache Beam (streamingsystems.org/Slides/Eugene Kirpichov - STREAM 2016 Dataflow and Apache Beam.pdf)
- The Dataflow Model (https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf)
- MillWheel (https://ai.google/research/pubs/pub41378)
- FlumeJava (https://ai.google/research/pubs/pub35650)
- Why Curiosity Matters (https://hbr.org/2018/09/curiosity)
- Spotify Scio (https://github.com/spotify/scio)
- Typelevel Cats (typelevel.org/cats)
- Verizon Quiver (https://github.com/Verizon/quiver)
- Streaming 101 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101)
- Streaming 102 (https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102)
- Beam vs Spark (https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison)
- Hierarchical scheduling in diverse data center workloads (https://people.eecs.berkeley.edu/~alig/papers/h-drf.pdf)
- Beam comparison (https://github.com/dataArtisans/beam_comp)
- Dataflow Pipeline Execution Parameters (https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-local-pipeline-options)
15. *Focal Points
• Black boxes
• Data flow patterns
• Particularly important when you are designing for a
migration
• Data correctness requirements
• Resist the temptation to build the “ideal” system
* Might be different for you
17. You cannot change something if you don’t
understand how it works.
More IMPORTANTLY, you cannot change something
if you don’t understand why it works the way it
does.
- unknown
18. Team dynamics
• Know where the team is at
• Know where the team should be (roughly)
• How to effect changes by the team, effectively
• Training/Re-training
• Changing mindsets is hardest !
19. Arming the team
•Recognise that learning requires
time
•Recognise that applying the learnt
knowledge requires time
•Recognise that being effective at
applying knowledge requires time
20. There is NO perfect data
architecture
What you need now is going to be
different from what you need in
the future
21.
22. Create a Culture of Learning &
Appetite for Adventure
This is really important
23.
24.
25. API : Model : Engine
•Proper abstraction to support both streaming and batching
•Decomposes pipeline into
•What
•Where
•When
•How
•Separate data processing from the underlying physical
implementation
39. Why Mesos - Part 1
• Our DSL’s scheduling logic is greatly simplified because we
don’t have to consider:
• Framework requirement
• Resource availability
• Organizational policies
• Global schedule of tasks
40. Why Mesos - Part 2
• Beam pipelines are scheduled by DSL
• Developer focus on building Beam job(s);
jobs are stringed by DSL
• Developer is free from worrying about
where resources are - solved by Mesos
resource-offering framework.
All the architectural decisions should favour enabling the system to adapt to change
41.
42. Observations
•There is NO perfect data architecture
•What you need now is going to be
different from what you need in
the future
•Build a team that adapts to
change; learning is key.
43. References
• Dataflow / Apache Beam - Eugene Kirpichov
• The Dataflow Model - Tyler Akidau, Sam Whittle et al
• MillWheel : Fault-Tolerant Stream Processing at Internet Scale - Tyler Akidau, Sam Whittle et al
• FlumeJava : Easy, Efficient Parallel Data Pipelines - Craig Chambers, Nathan Weizenbaum et al
• Mesos - Matei Zahari et al
• Why Curiosity Matters - Harvard Business Review September 2018
• Spotify Scio - Spotify’s Scala API around Apache Beam
• Typelevel Cats Lightweight, modular, and extensible library for functional programming in Scala
• Verizon Quiver : A reasonable library for modelling multi-graphs in Scala
• Scala - The Scala Programming Language
44. References
• Apache Beam VLDB paper - Tyler Akidau et al @ Google
• Streaming 101
• Streaming 102
• Beam vs Spark
• Hierarchical scheduling in diverse data center workloads : Battacharya, Ali Ghodsi, Ion
Stoica et al
• Beam Comparison : Data Artisans
• Dataflow/Beam & Spark : A programming model comparison : Tyler et al @ Google
• Dataflow Pipeline Execution Parameters