SlideShare a Scribd company logo
1 of 61
Download to read offline
Full Stack Deep Learning
Josh Tobin, Sergey Karayev, Pieter Abbeel
Data Management
Full Stack Deep Learning
“All-in-one”
Deployment
or or
Development Training/Evaluation
or
Data
or
Experiment Management
Labeling
Interchange
Storage
Monitoring
?
Hardware /
Mobile
Software Engineering
Frameworks
Database
Hyperparameter TuningDistributed Training
Web
Versioning
Processing
CI / Testing
?
Resource Management
Data Management - overview
Full Stack Deep Learning 3
https://veekaybee.github.io/2019/02/13/data-science-is-different/
Data Management - overview
Full Stack Deep Learning
In this module
4
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
In this module
5
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
• Most DL applications require lots of labeled data

• Exceptions: RL, GANs, "semi-supervised" learning

• Publicly available datasets = No competitive advantage

• But can serve as starting point
6
Sources
Data Management - sources
Full Stack Deep Learning
Usually: spend $$$ and time to label own data
7
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
Data Management - sources
Full Stack Deep Learning 8Data Management - sources
Data flywheel
Enables rapid improvement with user labels
Full Stack Deep Learning
9
Data Management - sources
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Semi-supervised learning
Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image source:
Doersch et al., 2015)
Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)
Use parts of data to label other parts
Full Stack Deep Learning
Image data augmentation
• Must do for training vision models

• Keras, fast.ai provide functions that do this

• Hook into your Dataset class to do in CPU
workers in parallel to GPU training
10
https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c
Data Management - sources
Full Stack Deep Learning
Other data augmentation
• Tabular

• Delete some cells to simulate missing
data

• Text

• No well established techniques, but
replace words with synonyms, change
order of things.

• Speech/video

• Change speed, inserts pauses, etc
11
https://github.com/makcedward/nlpaug
Data Management - sources
Full Stack Deep Learning 12
https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/
Data Management - sources
Synthetic data
Underrated idea that is almost
always worth starting with
Full Stack Deep Learning 13
This can get pretty deep!
Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests
Data Management - sources
Full Stack Deep Learning 14
https://microsoft.github.io/AirSim/
Data Management - sources
https://openai.com/blog/ingredients-for-robotics-research/
Especially for driving and robotics
Full Stack Deep Learning
Questions?
15
Full Stack Deep Learning
In this module
16
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Labeling
1. User Interfaces

2. Sources of labor

3. Service companies
17Data Management - labeling
Full Stack Deep Learning 18
Standard set of features:

- bounding boxes,
segmentations,
keypoints, cuboids

- set of applicable
classes
Data Management - labeling
Full Stack Deep Learning 19
Training the annotators is crucial
Quality assurance is key
Data Management - labeling
Full Stack Deep Learning 20
• Hire own annotators, promote best ones to quality control

• Pros: secure, fast (once hired), less QC needed

• Cons: expensive, slow to scale, admin overhead

• ...or, crowdsource (Mechanical Turk)
• Pros: cheaper, more scalable

• Cons: not secure, significant QC effort required

• ...or, full-service data labeling companies
Sources of Labor
Data Management - labeling
Full Stack Deep Learning
Service Companies
21
• Data labeling requires separate software stack, temporary labor, and
quality assurance. Makes sense to outsource.
• Dedicate several days to selecting the best one for you:

• Label gold standard data yourself

• Sales calls with several contenders, ask for work sample on same data

• Ensure agreement with your gold standard, and evaluate on value
Data Management - labeling
Full Stack Deep Learning 22
FigureEight is the original AI data labeling company
Data Management - labeling
Full Stack Deep Learning 23
Scale.ai is a dominant up-and-comer
Data Management - labeling
Full Stack Deep Learning 24
And there are a ton of others
Data Management - labeling
Full Stack Deep Learning 25
And there are a ton of others
Data Management - labeling
Full Stack Deep Learning
Software
26
• Full-service data labeling is always pricy

• But some companies offer their software without labor
Data Management - labeling
Full Stack Deep Learning 27
Prodigy
Data Management - labeling
Full Stack Deep Learning 28
Prodigy
Data Management - labeling
Full Stack Deep Learning 29
• Conclusions

• outsource to full-service company if you can afford it

• if not, then at least use existing software

• hiring part-time makes more sense than trying to make crowdsourcing
work
Data Management - labeling
Full Stack Deep Learning
Questions?
30
Full Stack Deep Learning
In this module
31
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage
Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Storage
1. Building blocks

- Filesystem

- Object Storage

- Database

- "Data Lake"

2. What goes where

3. Where to learn more
32Data Management - storage
Full Stack Deep Learning
Filesystem
• Foundational layer of storage.

• Fundamental unit is a "file", which can be text or binary, is not versioned,
and is easily overwritten.

• Can be as simple as a locally mounted disk containing all the files you need.

• Can be networked (e.g. NFS): accessible over network by multiple machines.

• Can be distributed (e.g. HDFS): stored and accessed over multiple
machines.

• Be mindful of access patterns! Can be fast, but not parallel.
33Data Management - storage
Full Stack Deep Learning
Object Storage
• An API over the filesystem. GET, PUT, DELETE files to a service, without
worrying where they are actually stored.

• Fundamental unit is an "object". Usually binary: image, sound file, etc.

• Versioning, redundancy can be built into the service.

• Can be parallel, but not fast.

• Amazon S3 is the canonical example.

• For on-prem setup, Ceph is a good solution.
34Data Management - storage
Full Stack Deep Learning
Database
• Persistent, fast, scalable storage and retrieval of structured data.

• Mental model: everything is actually in RAM, but software ensures that
everything is logged to disk and never lost.

• Fundamental unit is a "row": unique id, references to other rows, values in
columns.

• Not for binary data! Store references instead.

• Usually for data that will be accessed again (not logs).

• Postgres is the right choice >90% of the time. Best-in-class SQL, great support
for unstructured JSON, actively developed.
35Data Management - storage
Full Stack Deep Learning
SQL
• SQL is the right interface for
structured data.

• If you avoid using it, you will
eventually reinvent a slow
and crappy subset of it.
36Data Management - storage
Full Stack Deep Learning
"Data Lake"
• Unstructured aggregation of data from multiple sources, e.g. databases,
logs, expensive data transformations.

• "Schema on read": dump everything in, then transform for specific
needs later.

37
https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02
Data Management - storage
Full Stack Deep Learning
What goes where
• Binary data (images, sound files, compressed texts) is stored as
objects.

• Metadata (labels, user activity) is stored in database.

• If need features which are not obtainable from database (logs), set up
data lake and a process to aggregate needed data.

• At training time, copy the data that is needed onto filesystem (local or
networked).
38Data Management - storage
Full Stack Deep Learning
Where to learn more
39
https://dataintensive.net
Data Management - storage
Full Stack Deep Learning
Questions?
40
Full Stack Deep Learning
In this module
41
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning
Dataset is updated through user activity or additional labeling, affects model

5. Processing

Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Data Versioning
Level 0: unversioned

Level 1: versioned via snapshot at training time

Level 2: versioned as a mix of assets and code

Level 3: specialized data versioning solution
42Data Management - versioning
Full Stack Deep Learning
Level 0
• Data lives are on filesystem/S3 and database

• Problem: Deployments must be versioned. Deployed machine learning
models are part code, part data. If data is not versioned, deployed
models are not versioned.

• Problem you will face: inability to get back to a previous level of
performance
43Data Management - versioning
Full Stack Deep Learning
Level 1
• Data is versioned by storing a snapshot of everything at training time

• This allows you to version deployed models, and to get back to past
performance, but is super hacky.

• Would be far better to be able to version data just as easily as code.
44Data Management - versioning
Full Stack Deep Learning
Level 2
• Data is versioned as a mix of assets and code.

• Heavy files stored in S3, with unique ids. Training data is stored as JSON or
similar, referring to these ids and include relevant metadata (labels, user activity,
etc).

• JSON files can get big, but using git-lfs lets us store them just as easily as code

• Can improve further with "lazydata": only syncing files that are needed.

• The git signature + of the raw data file defines the version of the dataset

• Often helpful to add timestamp
45Data Management - versioning
Full Stack Deep Learning
Level 3
• Specialized solutions for versioning data.

• Avoid these until you can fully explain how they will improve your project.

• Leading solutions are DVC, Pachyderm, Quill.
46Data Management - versioning
Full Stack Deep Learning
DVC
47
1
2 3
4
Data Management - versioning
Full Stack Deep Learning
Pachyderm
48Data Management - versioning
Full Stack Deep Learning
Dolt
49
A nice simple solution for
versioning databases,
that speaks SQL.
Data Management - versioning
Full Stack Deep Learning
Dolt
50Data Management - versioning
Full Stack Deep Learning
Questions?
51
Full Stack Deep Learning
In this module
52
1. Sources

Where training data comes from

2. Labeling

How to label proprietary data at scale

3. Storage

Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately

4. Versioning

Dataset is updated through user activity or additional labeling, affects model

5. Processing
Aggregating and converting raw data and metadata
Data Management - overview
Full Stack Deep Learning
Motivational Example
•We have to train a photo popularity predictor every night.

• For each photo, training data must include:

• Metadata such as posting time, title, location

• Some features of the user, such as how many times they
logged in today.

• Outputs of photo classifiers (content, style)
53
In database
Need to compute
from logs
Need to run classifiers
Data Management - processing
Full Stack Deep Learning 54
Task Dependencies
• Some tasks can't be
started until other tasks
are finished.

• Finishing a task should
"kick off" its dependencies
Data Management - processing
Full Stack Deep Learning
• Run everything from scratch. If things exist, no need to recompute.
55
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Workflows
Data Management - processing
Full Stack Deep Learning
Makefile limitations
• What if re-computation needs to depend on content, not on date?

• What if the dependencies are not files, but disparate programs and
databases?

• What if the work needs to be spread over multiple machines?

• What if there are 100s of "Makefiles" all executing at the same time, with
shared dependencies?
56Data Management - processing
Full Stack Deep Learning
Airflow
57
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Management - processing
Full Stack Deep Learning
Distributing work
• The workflow manager has a queue for the tasks, and manages workers
that pull from it, restarting jobs if they fail.
58
http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/
Data Management - processing
Full Stack Deep Learning
Try to keep things simple
• Don't overengineer

• e.g. unix: powerful parallelism,
streaming, highly optimized
59
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
26 minutes
70 seconds
18 seconds
in parallel
in parallel
Data Management - processing
Full Stack Deep Learning
Questions?
60
Full Stack Deep Learning
Thank you!
61

More Related Content

What's hot

MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflowDatabricks
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Kishor Datta Gupta
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksVincenzo Lomonaco
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
 
Python and SQL Server
Python and SQL ServerPython and SQL Server
Python and SQL ServerJulie Smith
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper reviewtaeseon ryu
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflowCharmi Chokshi
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Hady Elsahar
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowJan Wiegelmann
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataGetInData
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchBill Liu
 

What's hot (20)

MLOps Using MLflow
MLOps Using MLflowMLOps Using MLflow
MLOps Using MLflow
 
Siamese networks
Siamese networksSiamese networks
Siamese networks
 
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
Robust Filtering Schemes for Machine Learning Systems to Defend Adversarial A...
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 
Comparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural NetworksComparing Incremental Learning Strategies for Convolutional Neural Networks
Comparing Incremental Learning Strategies for Convolutional Neural Networks
 
Transfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine LearningTransfer Learning -- The Next Frontier for Machine Learning
Transfer Learning -- The Next Frontier for Machine Learning
 
Python and SQL Server
Python and SQL ServerPython and SQL Server
Python and SQL Server
 
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
 
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
Lecture 10: ML Testing & Explainability (Full Stack Deep Learning - Spring 2021)
 
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
Lecture 6: Infrastructure & Tooling (Full Stack Deep Learning - Spring 2021)
 
MLOps in action
MLOps in actionMLOps in action
MLOps in action
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Neural Language Generation Head to Toe
Neural Language Generation Head to Toe Neural Language Generation Head to Toe
Neural Language Generation Head to Toe
 
Distributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlowDistributed Deep Learning with Hadoop and TensorFlow
Distributed Deep Learning with Hadoop and TensorFlow
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
Lecture 4: Transformers (Full Stack Deep Learning - Spring 2021)
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
 
Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
An Introduction to Neural Architecture Search
An Introduction to Neural Architecture SearchAn Introduction to Neural Architecture Search
An Introduction to Neural Architecture Search
 

Similar to Data Management Overview

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And HadoopEdureka!
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...DATAVERSITY
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera, Inc.
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsIDERA Software
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsDATAVERSITY
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesDATAVERSITY
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.pptRutujaPatil247341
 

Similar to Data Management Overview (20)

ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Understanding Big Data And Hadoop
Understanding Big Data And HadoopUnderstanding Big Data And Hadoop
Understanding Big Data And Hadoop
 
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
ADV Slides: The Evolution of the Data Platform and What It Means to Enterpris...
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Geek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure EnvironmentsGeek Sync | Deployment and Management of Complex Azure Environments
Geek Sync | Deployment and Management of Complex Azure Environments
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
GDSC Cloud Jam.pptx
GDSC Cloud Jam.pptxGDSC Cloud Jam.pptx
GDSC Cloud Jam.pptx
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data Torrent
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
ADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic SolutionsADV Slides: Comparing the Enterprise Analytic Solutions
ADV Slides: Comparing the Enterprise Analytic Solutions
 
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data LakesADV Slides: Building and Growing Organizational Analytics with Data Lakes
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
 
unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
data analytics lecture 3.2.ppt
data analytics lecture 3.2.pptdata analytics lecture 3.2.ppt
data analytics lecture 3.2.ppt
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

More from Sergey Karayev

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Sergey Karayev
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Sergey Karayev
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Sergey Karayev
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Sergey Karayev
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningSergey Karayev
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningSergey Karayev
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningSergey Karayev
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019Sergey Karayev
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Sergey Karayev
 

More from Sergey Karayev (16)

Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
Lecture 13: ML Teams (Full Stack Deep Learning - Spring 2021)
 
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
Lecture 12: Research Directions (Full Stack Deep Learning - Spring 2021)
 
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
Lecture 9: AI Ethics (Full Stack Deep Learning - Spring 2021)
 
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
Lecture 8: Data Management (Full Stack Deep Learning - Spring 2021)
 
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
Lecture 7: Troubleshooting Deep Neural Networks (Full Stack Deep Learning - S...
 
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
Lecture 5: ML Projects (Full Stack Deep Learning - Spring 2021)
 
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
Lecture 3: RNNs - Full Stack Deep Learning - Spring 2021
 
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
 
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
Lecture 2.A: Convolutional Networks - Full Stack Deep Learning - Spring 2021
 
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
Lab 1: Intro and Setup - Full Stack Deep Learning - Spring 2021
 
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
Lecture 1: Deep Learning Fundamentals - Full Stack Deep Learning - Spring 2021
 
Machine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep LearningMachine Learning Teams - Full Stack Deep Learning
Machine Learning Teams - Full Stack Deep Learning
 
Research Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep LearningResearch Directions - Full Stack Deep Learning
Research Directions - Full Stack Deep Learning
 
Infrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep LearningInfrastructure and Tooling - Full Stack Deep Learning
Infrastructure and Tooling - Full Stack Deep Learning
 
AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019AI Masterclass at ASU GSV 2019
AI Masterclass at ASU GSV 2019
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 

Recently uploaded

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Data Management Overview

  • 1. Full Stack Deep Learning Josh Tobin, Sergey Karayev, Pieter Abbeel Data Management
  • 2. Full Stack Deep Learning “All-in-one” Deployment or or Development Training/Evaluation or Data or Experiment Management Labeling Interchange Storage Monitoring ? Hardware / Mobile Software Engineering Frameworks Database Hyperparameter TuningDistributed Training Web Versioning Processing CI / Testing ? Resource Management Data Management - overview
  • 3. Full Stack Deep Learning 3 https://veekaybee.github.io/2019/02/13/data-science-is-different/ Data Management - overview
  • 4. Full Stack Deep Learning In this module 4 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 5. Full Stack Deep Learning In this module 5 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 6. Full Stack Deep Learning • Most DL applications require lots of labeled data • Exceptions: RL, GANs, "semi-supervised" learning • Publicly available datasets = No competitive advantage • But can serve as starting point 6 Sources Data Management - sources
  • 7. Full Stack Deep Learning Usually: spend $$$ and time to label own data 7 https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg Data Management - sources
  • 8. Full Stack Deep Learning 8Data Management - sources Data flywheel Enables rapid improvement with user labels
  • 9. Full Stack Deep Learning 9 Data Management - sources https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html Semi-supervised learning Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image source: Doersch et al., 2015) Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk) Use parts of data to label other parts
  • 10. Full Stack Deep Learning Image data augmentation • Must do for training vision models • Keras, fast.ai provide functions that do this • Hook into your Dataset class to do in CPU workers in parallel to GPU training 10 https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c Data Management - sources
  • 11. Full Stack Deep Learning Other data augmentation • Tabular • Delete some cells to simulate missing data • Text • No well established techniques, but replace words with synonyms, change order of things. • Speech/video • Change speed, inserts pauses, etc 11 https://github.com/makcedward/nlpaug Data Management - sources
  • 12. Full Stack Deep Learning 12 https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/ Data Management - sources Synthetic data Underrated idea that is almost always worth starting with
  • 13. Full Stack Deep Learning 13 This can get pretty deep! Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests Data Management - sources
  • 14. Full Stack Deep Learning 14 https://microsoft.github.io/AirSim/ Data Management - sources https://openai.com/blog/ingredients-for-robotics-research/ Especially for driving and robotics
  • 15. Full Stack Deep Learning Questions? 15
  • 16. Full Stack Deep Learning In this module 16 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 17. Full Stack Deep Learning Data Labeling 1. User Interfaces 2. Sources of labor 3. Service companies 17Data Management - labeling
  • 18. Full Stack Deep Learning 18 Standard set of features: - bounding boxes, segmentations, keypoints, cuboids - set of applicable classes Data Management - labeling
  • 19. Full Stack Deep Learning 19 Training the annotators is crucial Quality assurance is key Data Management - labeling
  • 20. Full Stack Deep Learning 20 • Hire own annotators, promote best ones to quality control • Pros: secure, fast (once hired), less QC needed • Cons: expensive, slow to scale, admin overhead • ...or, crowdsource (Mechanical Turk) • Pros: cheaper, more scalable • Cons: not secure, significant QC effort required • ...or, full-service data labeling companies Sources of Labor Data Management - labeling
  • 21. Full Stack Deep Learning Service Companies 21 • Data labeling requires separate software stack, temporary labor, and quality assurance. Makes sense to outsource. • Dedicate several days to selecting the best one for you: • Label gold standard data yourself • Sales calls with several contenders, ask for work sample on same data • Ensure agreement with your gold standard, and evaluate on value Data Management - labeling
  • 22. Full Stack Deep Learning 22 FigureEight is the original AI data labeling company Data Management - labeling
  • 23. Full Stack Deep Learning 23 Scale.ai is a dominant up-and-comer Data Management - labeling
  • 24. Full Stack Deep Learning 24 And there are a ton of others Data Management - labeling
  • 25. Full Stack Deep Learning 25 And there are a ton of others Data Management - labeling
  • 26. Full Stack Deep Learning Software 26 • Full-service data labeling is always pricy • But some companies offer their software without labor Data Management - labeling
  • 27. Full Stack Deep Learning 27 Prodigy Data Management - labeling
  • 28. Full Stack Deep Learning 28 Prodigy Data Management - labeling
  • 29. Full Stack Deep Learning 29 • Conclusions • outsource to full-service company if you can afford it • if not, then at least use existing software • hiring part-time makes more sense than trying to make crowdsourcing work Data Management - labeling
  • 30. Full Stack Deep Learning Questions? 30
  • 31. Full Stack Deep Learning In this module 31 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 32. Full Stack Deep Learning Data Storage 1. Building blocks - Filesystem - Object Storage - Database - "Data Lake" 2. What goes where 3. Where to learn more 32Data Management - storage
  • 33. Full Stack Deep Learning Filesystem • Foundational layer of storage. • Fundamental unit is a "file", which can be text or binary, is not versioned, and is easily overwritten. • Can be as simple as a locally mounted disk containing all the files you need. • Can be networked (e.g. NFS): accessible over network by multiple machines. • Can be distributed (e.g. HDFS): stored and accessed over multiple machines. • Be mindful of access patterns! Can be fast, but not parallel. 33Data Management - storage
  • 34. Full Stack Deep Learning Object Storage • An API over the filesystem. GET, PUT, DELETE files to a service, without worrying where they are actually stored. • Fundamental unit is an "object". Usually binary: image, sound file, etc. • Versioning, redundancy can be built into the service. • Can be parallel, but not fast. • Amazon S3 is the canonical example. • For on-prem setup, Ceph is a good solution. 34Data Management - storage
  • 35. Full Stack Deep Learning Database • Persistent, fast, scalable storage and retrieval of structured data. • Mental model: everything is actually in RAM, but software ensures that everything is logged to disk and never lost. • Fundamental unit is a "row": unique id, references to other rows, values in columns. • Not for binary data! Store references instead. • Usually for data that will be accessed again (not logs). • Postgres is the right choice >90% of the time. Best-in-class SQL, great support for unstructured JSON, actively developed. 35Data Management - storage
  • 36. Full Stack Deep Learning SQL • SQL is the right interface for structured data. • If you avoid using it, you will eventually reinvent a slow and crappy subset of it. 36Data Management - storage
  • 37. Full Stack Deep Learning "Data Lake" • Unstructured aggregation of data from multiple sources, e.g. databases, logs, expensive data transformations. • "Schema on read": dump everything in, then transform for specific needs later. 37 https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02 Data Management - storage
  • 38. Full Stack Deep Learning What goes where • Binary data (images, sound files, compressed texts) is stored as objects. • Metadata (labels, user activity) is stored in database. • If need features which are not obtainable from database (logs), set up data lake and a process to aggregate needed data. • At training time, copy the data that is needed onto filesystem (local or networked). 38Data Management - storage
  • 39. Full Stack Deep Learning Where to learn more 39 https://dataintensive.net Data Management - storage
  • 40. Full Stack Deep Learning Questions? 40
  • 41. Full Stack Deep Learning In this module 41 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 42. Full Stack Deep Learning Data Versioning Level 0: unversioned Level 1: versioned via snapshot at training time Level 2: versioned as a mix of assets and code Level 3: specialized data versioning solution 42Data Management - versioning
  • 43. Full Stack Deep Learning Level 0 • Data lives are on filesystem/S3 and database • Problem: Deployments must be versioned. Deployed machine learning models are part code, part data. If data is not versioned, deployed models are not versioned. • Problem you will face: inability to get back to a previous level of performance 43Data Management - versioning
  • 44. Full Stack Deep Learning Level 1 • Data is versioned by storing a snapshot of everything at training time • This allows you to version deployed models, and to get back to past performance, but is super hacky. • Would be far better to be able to version data just as easily as code. 44Data Management - versioning
  • 45. Full Stack Deep Learning Level 2 • Data is versioned as a mix of assets and code. • Heavy files stored in S3, with unique ids. Training data is stored as JSON or similar, referring to these ids and include relevant metadata (labels, user activity, etc). • JSON files can get big, but using git-lfs lets us store them just as easily as code • Can improve further with "lazydata": only syncing files that are needed. • The git signature + of the raw data file defines the version of the dataset • Often helpful to add timestamp 45Data Management - versioning
  • 46. Full Stack Deep Learning Level 3 • Specialized solutions for versioning data. • Avoid these until you can fully explain how they will improve your project. • Leading solutions are DVC, Pachyderm, Quill. 46Data Management - versioning
  • 47. Full Stack Deep Learning DVC 47 1 2 3 4 Data Management - versioning
  • 48. Full Stack Deep Learning Pachyderm 48Data Management - versioning
  • 49. Full Stack Deep Learning Dolt 49 A nice simple solution for versioning databases, that speaks SQL. Data Management - versioning
  • 50. Full Stack Deep Learning Dolt 50Data Management - versioning
  • 51. Full Stack Deep Learning Questions? 51
  • 52. Full Stack Deep Learning In this module 52 1. Sources
 Where training data comes from
 2. Labeling
 How to label proprietary data at scale
 3. Storage Data (images, sound files, etc) and metadata (labels, user activity) should be stored appropriately 4. Versioning Dataset is updated through user activity or additional labeling, affects model 5. Processing Aggregating and converting raw data and metadata Data Management - overview
  • 53. Full Stack Deep Learning Motivational Example •We have to train a photo popularity predictor every night. • For each photo, training data must include: • Metadata such as posting time, title, location • Some features of the user, such as how many times they logged in today. • Outputs of photo classifiers (content, style) 53 In database Need to compute from logs Need to run classifiers Data Management - processing
  • 54. Full Stack Deep Learning 54 Task Dependencies • Some tasks can't be started until other tasks are finished. • Finishing a task should "kick off" its dependencies Data Management - processing
  • 55. Full Stack Deep Learning • Run everything from scratch. If things exist, no need to recompute. 55 https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418 Data Workflows Data Management - processing
  • 56. Full Stack Deep Learning Makefile limitations • What if re-computation needs to depend on content, not on date? • What if the dependencies are not files, but disparate programs and databases? • What if the work needs to be spread over multiple machines? • What if there are 100s of "Makefiles" all executing at the same time, with shared dependencies? 56Data Management - processing
  • 57. Full Stack Deep Learning Airflow 57 https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418 Data Management - processing
  • 58. Full Stack Deep Learning Distributing work • The workflow manager has a queue for the tasks, and manages workers that pull from it, restarting jobs if they fail. 58 http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/ Data Management - processing
  • 59. Full Stack Deep Learning Try to keep things simple • Don't overengineer • e.g. unix: powerful parallelism, streaming, highly optimized 59 https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html 26 minutes 70 seconds 18 seconds in parallel in parallel Data Management - processing
  • 60. Full Stack Deep Learning Questions? 60
  • 61. Full Stack Deep Learning Thank you! 61