SlideShare une entreprise Scribd logo
1  sur  37
Automating Apache
Cassandra Operations with
Apache Airflow
Go beyond cron jobs to manage ETL, Data Hygiene, Data
Import/Export
Rahul Xavier Singh Anant Corporation | Cassandra.Link
Data Engineer’s Lunch 11/14/2022
Playbook
Design
Framework
Airflow
Spark
Approach
Airflow/Spark
Cassandra ETL Spark in Airflow
Bonus: Deleting Data in
Cassandra at Scale in Airflow
Code/Demos
Cassandra Operations with Google
Dataproc / Spark in Airflow (Astra)
SQL Queries with Presto and
Cassandra in Airflow
Airflow and Spark
Agenda
We help platform owners
reach beyond their
potential to serve a global
customer base that
demands
Everything, Now.
We design with our
Playbook, build with our
Framework, and manage
platforms with our Approach
so our clients
Think & Grow Big.
Customer Success
Challenge
Business
Platform
Playbook
Framework
Approach
Technology
Management
Solutions
[Data] Services Catalog
Fully Managed Service
Subscriptions
We offer Professional Services to engineer Solutions and
offer Managed Services to clients where it makes sense, after an
Assessment
7
Business / Platform Dream
Enterprise
Consciousness :
- People
- Processes,
- Information
- Systems
Connected /
Synchronized.
Business has been chasing
this dream for a while. As
technologies improve, this
becomes more accessible. Image Source: Digital Business
Technology Platforms, Gartner 2016
Modern
Open Data
Platform
Playbook
9
Thinking about Cassandra as a Data Fabric
XDCR: Cross datacenter
replication is the
ultimate data fabric.
Resilience,
performance,
availability, and scale.
Made widely available
by Cassandra and
Couchbase
10
Generic Data Platform Operations
Distributed Realtime Components
To create globally distributed and real time platforms, we
need to use distributed realtime technologies to build your
platform. Here are some. Which ones should you choose?
12
How do you choose from the landscape?
Lots and lots of components in the
Data & AI Landscape. Which ones are
the right ones for your business?
13
So Many Different “Modern Stacks?”
Lots of “reference” architectures
available. They tend not to think about
the speed layer since they are
focusing on analytics. Many don’t
mention realtime databases… but we
can learn from them.
Playbook /
Framework
Framework Components
● Major Components
○ Persistent Queues ( RAM/BUS)
○ Queue Processing & Compute ( CPU)
○ Persistent Storage (DISK/RAM)
○ Reporting Engine (Display)
○ Orchestration Framework (Motherboard)
○ Scheduler (Operating System)
● Strategies
○ Cloud Native on Google
○ Self-Managed Open Source
○ Self-Managed Commercial Source
○ Managed Commercial Source
Customers want options, so we decided to
create a Framework that can scale with
whatever Infrastructure and Software strategy
they want to use.
16
Playbook for Modern Open Data Platform
Platform Design Evaluate Framework
Cloud
- Public
- Private
- Hybrid
Data
- Data:Object
- Data:Stream
- Data:Table
- Data:Index
- Processor:Batch
- Processor:Stream
DataOps
- ETL/ELT/EtLT
- Reverse ETL
- Orchestration
DevOps
- Infrastructure as
Code
- Systems
Automation
- Application CICD
Architecture (Design)
- Cloud
- Data
- DevOps
- DataOps
Engineering
- Configuration
- Scripting
- Programming
Operation
- Setup / Deploy
- Monitoring/Alerts
- Administration
User Experience
- No-Code/Low Code Apps/Form Builders
- Automatic API Generator/Platform
- Customer App/API Framework
Execute Approach
Discovery (Inventory)
- People
- Process
- Information (Objects)
- Systems (Apps)
17
Framework
Data Modernization / Automation / Integration
In addition to vastly scalable tools, there are also modern
innovations that can help teams automate and maximize
human capital by making data platform management easier.
Playbook /
Approach
Approach
20
Apache Airflow +
Apache Spark +
Spark Python/Scala/Java/R +
Airflow Python DAG =
DataOps for Apache Cassandra
Good enough for rock and roll.
● Scheduling and automating workflows and tasks
● Automating repeated processes
○ Common ETL tasks
○ Machine learning model training
○ Data hygiene
○ Delta migrations
● Write workflows in Python
○ Anything Python compatible works
○ Dependencies for workflow sections
○ Workflows are a DAG of tasks
● Recurring, One-time Scheduled or Adhoc
○ Cron-like syntax or frequency tags
○ “Only run again if data changed”
● Monitor tasks and collect/view logs
Apache Airflow
Apache Spark
● Unified analytics engine
● High performance batch and streaming data
● Also has a DAG, scheduler, a query optimizer,
and a physical execution engine.
● Offers over 80 high-level operators that make
it easy to build parallel apps. And you can use
it interactively from the Scala, Python, R, and
SQL shells. C# also available.
● Powers a stack of libraries including SQL and
DataFrames, MLlib for machine learning,
GraphX, and Spark Streaming.
● You can run Spark using its standalone cluster
mode, on EC2, on Hadoop YARN, on Mesos, or
on Kubernetes. Access data in basically
anything.
Bonus Round
25
Coldish
● S3
● HDFS
● ADLS
● GFS
Warm
● Hive / *
● Data Warehouse
● Data Lakehouse
Big Data Options
Hot
● Cassandra*
● Datastax*
● Scylla*
● Yugabyte*
● Mongo
● REDIS
● …
Hot*
● Astra*
● Scylla Cloud*
● YugaByte Cloud*
● Azure CosmosDB*
● AWS Keyspaces*
● AWS Dynamo
● Google BigTable
● …
* PSSST. These all use CQL!!!
26
Cleaning Big Data: Same $h1t Different Day
Data Cleaning as part of Data Engineering
- Step 1: Remove duplicate or irrelevant
observations
- Step 2: Fix structural errors
- Step 3: Filter unwanted outliers
- Step 4: Handle missing data
- Step 5: Validate and QA
https://www.tableau.com/learn/articl
es/what-is-data-cleaning
Data Cleaning after the Fact
- Enforce a custom data retention policy (TTL)
- Enforce GDPR / Right to be Forgotten
- Move application, customer, user from one
system to another
- Remove x “versions” or “revisions” of data
- Remove test data from a stress test
27
Cleaning Big Data: In SQL ….
Data Cleaning in SQL
- Find what do you want to delete.
- Delete it.
28
Cleaning Big Data: In Spark SQL ….
Data Cleaning in SPARK SQL
- What do you want to delete.
- Delete it.
WARN: Doesn’t work with all data in
Spark SQL. Only if connector supports
Table Delete
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-delete-from.html
https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/SupportsDelete.html
29
Cleaning Big Data: Cleaning data in Spark / SQL
Data Cleaning in Spark
for Cassandra
- What do you want
to delete.
- Delete it.
https://stackoverflow.com/questions/28563809/delete-from-cassandra-table-in-spark
30
Cleaning Big Data: Deduping in Spark SQL
Deduping Data in
Spark for Cassandra
- What do you want
to dedupe.
- Do some
deduping.
- What you want to
delete.
- Delete it.
31
Airflow DAG to Migrate
Cassandra Data
Airflow can help us take any data process, compiled, interpreted etc, coordinate the
steps as a DAG (“Directed Acyclic Graph”) and then to make it even more awesome,
parametrize it either via the Airflow GUI or our own table somewhere.
github/scylladb/scylla-migrator
32
Airflow DAG to Clean Cassandra Data
Since we write abstracted code, we can replace the “Migrator” process with a Delete,
Dedupe, Validate. Whatever.
Airflow allows us to reuse conventions we set in a team for large scale operations, and
most importantly… make it easy for people to run Data Operations like this without
being Cassandra , Spark, Python experts.
Demo
● https://github.com/Anant/example-cassandra-etl-with-airflow-and-spark
● Astra: Set up Astra Account / Database / Keyspace / Access
● Gitpod: Set up Airflow and Spark
● Airflow: Connect Airflow and Spark
● Trigger DAG with PySpark jobs
via Airflow UI
● Confirm data in Astra
Other Demos with Airflow
● https://github.com/anant?q=airflow
● Most have videos / blogs
○ See “Cassandra.Lunch” Repo
○ See anant.us/blog
● Airflow + Google Dataproc + Astra
● Airflow + DBT + Great Expectations
● Airflow + Cassandra + Presto
● Airflow + Cassandra
● Airflow + Spark
● Airflow + Amundsen + Cassandra
(DSE)
35
Considerations for Spark/Airflow Solution
Considerations for Airflow
- Figure out if you are going to manage it / run it.
- Figure out for whom you are going to run it
(Platform, Environment, Stack, App, Customer?)
- Not all DAGs just work. Sometimes they need
tweaking across Environments, Stacks, Apps,
Customers.
- The same DAG may fail over time. Need to watch
execution times.
- Who has access to it?
Considerations for Spark
- Figure out if you want to manage it / run it.
- Not all Spark code is created equally.
- Not all Spark languages run the same.
- Compiled Jobs with input parameters can work
better in the long run, less room for code drift.
- Don’t let people do adhoc delete operations until
and unless it’s absolutely necessary.
- Who has access to it?
36
Key Takeaways for Cassandra Data Operations
- You can look, but Apache Spark is basically
it. Look no further.
- Learn Spark, Python / Scala is fine. Just
start using Apache Spark.
- Airflow, Jenkins, Luigi, Prefect, any
scheduler can work, but Airflow has been
proven for this.
- Airflow works with more than just Apache
Cassandra, Apache Spark.. There are
numerous Connections and Operators.
Don’t reinvent the wheel.
Use Apache Spark
Use a Scheduler ( Apache
Airflow w/ Python )
37
Thank you and Dream Big.
Hire us
- Design Workshops
- Innovation Sprints
- Service Catalog
Anant.us
- Read our Playbook
- Join our Mailing List
- Read up on Data Platforms
- Watch our Videos
- Download Examples
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | #301 | Washington, DC 20037

Contenu connexe

Similaire à Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache Airflow

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache SparkMammoth Data
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdfMaheshPandit16
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkIke Ellis
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!Edureka!
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development Spark Summit
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentationtestSri1
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudKaran Singh
 

Similaire à Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache Airflow (20)

Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
 
Intro to Spark development
 Intro to Spark development  Intro to Spark development
Intro to Spark development
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 
NVIDIA Rapids presentation
NVIDIA Rapids presentationNVIDIA Rapids presentation
NVIDIA Rapids presentation
 
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
 

Plus de Anant Corporation

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfAnant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotAnant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapAnant Corporation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksAnant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & FutureAnant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackAnant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsAnant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraAnant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessAnant Corporation
 
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
Data Engineer’s Lunch #67: Machine Learning - Feature SelectionData Engineer’s Lunch #67: Machine Learning - Feature Selection
Data Engineer’s Lunch #67: Machine Learning - Feature SelectionAnant Corporation
 

Plus de Anant Corporation (20)

QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdfKono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache PinotData Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
YugabyteDB Developer Tools
YugabyteDB Developer ToolsYugabyteDB Developer Tools
YugabyteDB Developer Tools
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer RoadmapEpisode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward TalksCassandra Lunch 130: Recap of Cassandra Forward Talks
Cassandra Lunch 130: Recap of Cassandra Forward Talks
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with ArcionData Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & FutureCassandra Lunch 129: What’s New:  Apache Cassandra 4.1+ Features & Future
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
CL 121
CL 121CL 121
CL 121
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOpsApache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache CassandraApache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
Data Engineer’s Lunch #67: Machine Learning - Feature SelectionData Engineer’s Lunch #67: Machine Learning - Feature Selection
Data Engineer’s Lunch #67: Machine Learning - Feature Selection
 

Dernier

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 

Dernier (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 

Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache Airflow

  • 1. Automating Apache Cassandra Operations with Apache Airflow Go beyond cron jobs to manage ETL, Data Hygiene, Data Import/Export Rahul Xavier Singh Anant Corporation | Cassandra.Link Data Engineer’s Lunch 11/14/2022
  • 2. Playbook Design Framework Airflow Spark Approach Airflow/Spark Cassandra ETL Spark in Airflow Bonus: Deleting Data in Cassandra at Scale in Airflow Code/Demos Cassandra Operations with Google Dataproc / Spark in Airflow (Astra) SQL Queries with Presto and Cassandra in Airflow Airflow and Spark Agenda
  • 3. We help platform owners reach beyond their potential to serve a global customer base that demands Everything, Now.
  • 4. We design with our Playbook, build with our Framework, and manage platforms with our Approach so our clients Think & Grow Big.
  • 6. Challenge Business Platform Playbook Framework Approach Technology Management Solutions [Data] Services Catalog Fully Managed Service Subscriptions We offer Professional Services to engineer Solutions and offer Managed Services to clients where it makes sense, after an Assessment
  • 7. 7 Business / Platform Dream Enterprise Consciousness : - People - Processes, - Information - Systems Connected / Synchronized. Business has been chasing this dream for a while. As technologies improve, this becomes more accessible. Image Source: Digital Business Technology Platforms, Gartner 2016
  • 9. 9 Thinking about Cassandra as a Data Fabric XDCR: Cross datacenter replication is the ultimate data fabric. Resilience, performance, availability, and scale. Made widely available by Cassandra and Couchbase
  • 11. Distributed Realtime Components To create globally distributed and real time platforms, we need to use distributed realtime technologies to build your platform. Here are some. Which ones should you choose?
  • 12. 12 How do you choose from the landscape? Lots and lots of components in the Data & AI Landscape. Which ones are the right ones for your business?
  • 13. 13 So Many Different “Modern Stacks?” Lots of “reference” architectures available. They tend not to think about the speed layer since they are focusing on analytics. Many don’t mention realtime databases… but we can learn from them.
  • 15. Framework Components ● Major Components ○ Persistent Queues ( RAM/BUS) ○ Queue Processing & Compute ( CPU) ○ Persistent Storage (DISK/RAM) ○ Reporting Engine (Display) ○ Orchestration Framework (Motherboard) ○ Scheduler (Operating System) ● Strategies ○ Cloud Native on Google ○ Self-Managed Open Source ○ Self-Managed Commercial Source ○ Managed Commercial Source Customers want options, so we decided to create a Framework that can scale with whatever Infrastructure and Software strategy they want to use.
  • 16. 16 Playbook for Modern Open Data Platform Platform Design Evaluate Framework Cloud - Public - Private - Hybrid Data - Data:Object - Data:Stream - Data:Table - Data:Index - Processor:Batch - Processor:Stream DataOps - ETL/ELT/EtLT - Reverse ETL - Orchestration DevOps - Infrastructure as Code - Systems Automation - Application CICD Architecture (Design) - Cloud - Data - DevOps - DataOps Engineering - Configuration - Scripting - Programming Operation - Setup / Deploy - Monitoring/Alerts - Administration User Experience - No-Code/Low Code Apps/Form Builders - Automatic API Generator/Platform - Customer App/API Framework Execute Approach Discovery (Inventory) - People - Process - Information (Objects) - Systems (Apps)
  • 18. Data Modernization / Automation / Integration In addition to vastly scalable tools, there are also modern innovations that can help teams automate and maximize human capital by making data platform management easier.
  • 21. Apache Airflow + Apache Spark + Spark Python/Scala/Java/R + Airflow Python DAG = DataOps for Apache Cassandra Good enough for rock and roll.
  • 22. ● Scheduling and automating workflows and tasks ● Automating repeated processes ○ Common ETL tasks ○ Machine learning model training ○ Data hygiene ○ Delta migrations ● Write workflows in Python ○ Anything Python compatible works ○ Dependencies for workflow sections ○ Workflows are a DAG of tasks ● Recurring, One-time Scheduled or Adhoc ○ Cron-like syntax or frequency tags ○ “Only run again if data changed” ● Monitor tasks and collect/view logs Apache Airflow
  • 23. Apache Spark ● Unified analytics engine ● High performance batch and streaming data ● Also has a DAG, scheduler, a query optimizer, and a physical execution engine. ● Offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python, R, and SQL shells. C# also available. ● Powers a stack of libraries including SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming. ● You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, on Mesos, or on Kubernetes. Access data in basically anything.
  • 25. 25 Coldish ● S3 ● HDFS ● ADLS ● GFS Warm ● Hive / * ● Data Warehouse ● Data Lakehouse Big Data Options Hot ● Cassandra* ● Datastax* ● Scylla* ● Yugabyte* ● Mongo ● REDIS ● … Hot* ● Astra* ● Scylla Cloud* ● YugaByte Cloud* ● Azure CosmosDB* ● AWS Keyspaces* ● AWS Dynamo ● Google BigTable ● … * PSSST. These all use CQL!!!
  • 26. 26 Cleaning Big Data: Same $h1t Different Day Data Cleaning as part of Data Engineering - Step 1: Remove duplicate or irrelevant observations - Step 2: Fix structural errors - Step 3: Filter unwanted outliers - Step 4: Handle missing data - Step 5: Validate and QA https://www.tableau.com/learn/articl es/what-is-data-cleaning Data Cleaning after the Fact - Enforce a custom data retention policy (TTL) - Enforce GDPR / Right to be Forgotten - Move application, customer, user from one system to another - Remove x “versions” or “revisions” of data - Remove test data from a stress test
  • 27. 27 Cleaning Big Data: In SQL …. Data Cleaning in SQL - Find what do you want to delete. - Delete it.
  • 28. 28 Cleaning Big Data: In Spark SQL …. Data Cleaning in SPARK SQL - What do you want to delete. - Delete it. WARN: Doesn’t work with all data in Spark SQL. Only if connector supports Table Delete https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-delete-from.html https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/connector/catalog/SupportsDelete.html
  • 29. 29 Cleaning Big Data: Cleaning data in Spark / SQL Data Cleaning in Spark for Cassandra - What do you want to delete. - Delete it. https://stackoverflow.com/questions/28563809/delete-from-cassandra-table-in-spark
  • 30. 30 Cleaning Big Data: Deduping in Spark SQL Deduping Data in Spark for Cassandra - What do you want to dedupe. - Do some deduping. - What you want to delete. - Delete it.
  • 31. 31 Airflow DAG to Migrate Cassandra Data Airflow can help us take any data process, compiled, interpreted etc, coordinate the steps as a DAG (“Directed Acyclic Graph”) and then to make it even more awesome, parametrize it either via the Airflow GUI or our own table somewhere. github/scylladb/scylla-migrator
  • 32. 32 Airflow DAG to Clean Cassandra Data Since we write abstracted code, we can replace the “Migrator” process with a Delete, Dedupe, Validate. Whatever. Airflow allows us to reuse conventions we set in a team for large scale operations, and most importantly… make it easy for people to run Data Operations like this without being Cassandra , Spark, Python experts.
  • 33. Demo ● https://github.com/Anant/example-cassandra-etl-with-airflow-and-spark ● Astra: Set up Astra Account / Database / Keyspace / Access ● Gitpod: Set up Airflow and Spark ● Airflow: Connect Airflow and Spark ● Trigger DAG with PySpark jobs via Airflow UI ● Confirm data in Astra
  • 34. Other Demos with Airflow ● https://github.com/anant?q=airflow ● Most have videos / blogs ○ See “Cassandra.Lunch” Repo ○ See anant.us/blog ● Airflow + Google Dataproc + Astra ● Airflow + DBT + Great Expectations ● Airflow + Cassandra + Presto ● Airflow + Cassandra ● Airflow + Spark ● Airflow + Amundsen + Cassandra (DSE)
  • 35. 35 Considerations for Spark/Airflow Solution Considerations for Airflow - Figure out if you are going to manage it / run it. - Figure out for whom you are going to run it (Platform, Environment, Stack, App, Customer?) - Not all DAGs just work. Sometimes they need tweaking across Environments, Stacks, Apps, Customers. - The same DAG may fail over time. Need to watch execution times. - Who has access to it? Considerations for Spark - Figure out if you want to manage it / run it. - Not all Spark code is created equally. - Not all Spark languages run the same. - Compiled Jobs with input parameters can work better in the long run, less room for code drift. - Don’t let people do adhoc delete operations until and unless it’s absolutely necessary. - Who has access to it?
  • 36. 36 Key Takeaways for Cassandra Data Operations - You can look, but Apache Spark is basically it. Look no further. - Learn Spark, Python / Scala is fine. Just start using Apache Spark. - Airflow, Jenkins, Luigi, Prefect, any scheduler can work, but Airflow has been proven for this. - Airflow works with more than just Apache Cassandra, Apache Spark.. There are numerous Connections and Operators. Don’t reinvent the wheel. Use Apache Spark Use a Scheduler ( Apache Airflow w/ Python )
  • 37. 37 Thank you and Dream Big. Hire us - Design Workshops - Innovation Sprints - Service Catalog Anant.us - Read our Playbook - Join our Mailing List - Read up on Data Platforms - Watch our Videos - Download Examples www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | #301 | Washington, DC 20037

Notes de l'éditeur

  1. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  2. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  3. Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  4. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  5. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  6. Challenge Currently the components are broken up in to different vendors and parts. Similar to building a computer every time for every client.
  7. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  8. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  9. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  10. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  11. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  12. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  13. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  14. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  15. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  16. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  17. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.
  18. What makes a good story? Once you get good at it, presenting becomes easy. Shared stories with people we’ve bonded with (community for example). This format is not good for Metastories.