SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Building a data processing pipeline in Python
Joe Cabrera
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
PyGotham, 2015
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Outline
1 The problem
2 Data ingestion
3 Data parsing
4 Data cleansing
5 Scaling out
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Largely dispersed across the web
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
No standard data processing library
Pandas
Bubbles
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Data processing
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Requests and Futures
Requests makes it easy to send the required parameters
Concurrent Futures allows for the asynchronous execution
of download requests
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Parsers
Python tokenize
BeautifulSoup
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Why BeautifulSoup
More forgiving than standard XML or HTML libraries
Supports regex
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Celery job scheduling
Each download job is a task
Each parse job is a task
Each cleanse job is a task
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Re-insert cleansed data
Cleanup data after raw ingest
Separate stores for raw and clean data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Distributed task queue
Distribute data processing jobs to many machines
Distribute jobs on a given machine across many CPUs
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
SQL-Alchemy basic sharding API
Each databases each has a shard id
We query for data based on which shard contains the data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Questions
Thanks!
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
Joe Cabrera Building a data processing pipeline in Python

Contenu connexe

Tendances

The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD CloudRuben Verborgh
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your dataBrendan Tierney
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search SystemTrey Grainger
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problemsgrepalex
 
How to Light a Beacon
How to Light a BeaconHow to Light a Beacon
How to Light a BeaconMiro Cupak
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation EnginesTrey Grainger
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State Universityyouthelectronix
 
Linking media, data, and services
Linking media, data, and servicesLinking media, data, and services
Linking media, data, and servicesRuben Verborgh
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refinedanpaulsmith
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitectureSkillspeed
 
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Krist Wongsuphasawat
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterKrist Wongsuphasawat
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantageAlexandra Johnson
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with PythonDonald Miner
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at AirbnbNeo4j
 

Tendances (19)

The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
 
How to Light a Beacon
How to Light a BeaconHow to Light a Beacon
How to Light a Beacon
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
 
Linking media, data, and services
Linking media, data, and servicesLinking media, data, and services
Linking media, data, and services
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refine
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
 
Logs & Visualizations at Twitter
Logs & Visualizations at TwitterLogs & Visualizations at Twitter
Logs & Visualizations at Twitter
 
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 

En vedette

Pyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.jsPyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.jsNick Kridler
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
Functional Programming with Ruby
Functional Programming with RubyFunctional Programming with Ruby
Functional Programming with Rubytokland
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Clemens Valiente
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARAlexander Falk
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with LuigiTeemu Kurppa
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakHakka Labs
 

En vedette (9)

Pyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.jsPyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.js
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Functional Programming with Ruby
Functional Programming with RubyFunctional Programming with Ruby
Functional Programming with Ruby
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Similaire à Building Python Data Pipeline for Parsing & Cleansing Poorly Formatted Data

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfAltinity Ltd
 
DevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfDevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfMichael Chi
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataHostedbyConfluent
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014gdusbabek
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...DataWorks Summit/Hadoop Summit
 
Performance tuning
Performance tuningPerformance tuning
Performance tuningEric Phan
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionSri Ambati
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentCloudera, Inc.
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariKarissa Rae McKelvey
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
R meetup talk scaling data science with dgit
R meetup talk   scaling data science with dgitR meetup talk   scaling data science with dgit
R meetup talk scaling data science with dgitVenkata Pingali
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationInside Analysis
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemasterAthemaster Co., Ltd.
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineAmazon Web Services
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonSri Ambati
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonJo-fai Chow
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache DruidImply
 

Similaire à Building Python Data Pipeline for Parsing & Cleansing Poorly Formatted Data (20)

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Database story by DevOps
Database story by DevOpsDatabase story by DevOps
Database story by DevOps
 
DevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfDevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdf
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
R meetup talk scaling data science with dgit
R meetup talk   scaling data science with dgitR meetup talk   scaling data science with dgit
R meetup talk scaling data science with dgit
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 

Dernier

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Dernier (20)

VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Building Python Data Pipeline for Parsing & Cleansing Poorly Formatted Data

  • 1. The problem Data ingestion Data parsing Data cleansing Scaling out Building a data processing pipeline in Python Joe Cabrera https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com PyGotham, 2015 Joe Cabrera Building a data processing pipeline in Python
  • 2. The problem Data ingestion Data parsing Data cleansing Scaling out Outline 1 The problem 2 Data ingestion 3 Data parsing 4 Data cleansing 5 Scaling out Joe Cabrera Building a data processing pipeline in Python
  • 3. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 4. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 5. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 6. The problem Data ingestion Data parsing Data cleansing Scaling out Largely dispersed across the web Joe Cabrera Building a data processing pipeline in Python
  • 7. The problem Data ingestion Data parsing Data cleansing Scaling out No standard data processing library Pandas Bubbles Joe Cabrera Building a data processing pipeline in Python
  • 8. The problem Data ingestion Data parsing Data cleansing Scaling out Data processing Joe Cabrera Building a data processing pipeline in Python
  • 9. The problem Data ingestion Data parsing Data cleansing Scaling out Requests and Futures Requests makes it easy to send the required parameters Concurrent Futures allows for the asynchronous execution of download requests Joe Cabrera Building a data processing pipeline in Python
  • 10. The problem Data ingestion Data parsing Data cleansing Scaling out Parsers Python tokenize BeautifulSoup Joe Cabrera Building a data processing pipeline in Python
  • 11. The problem Data ingestion Data parsing Data cleansing Scaling out Why BeautifulSoup More forgiving than standard XML or HTML libraries Supports regex Joe Cabrera Building a data processing pipeline in Python
  • 12. The problem Data ingestion Data parsing Data cleansing Scaling out Celery job scheduling Each download job is a task Each parse job is a task Each cleanse job is a task Joe Cabrera Building a data processing pipeline in Python
  • 13. The problem Data ingestion Data parsing Data cleansing Scaling out Re-insert cleansed data Cleanup data after raw ingest Separate stores for raw and clean data Joe Cabrera Building a data processing pipeline in Python
  • 14. The problem Data ingestion Data parsing Data cleansing Scaling out Distributed task queue Distribute data processing jobs to many machines Distribute jobs on a given machine across many CPUs Joe Cabrera Building a data processing pipeline in Python
  • 15. The problem Data ingestion Data parsing Data cleansing Scaling out SQL-Alchemy basic sharding API Each databases each has a shard id We query for data based on which shard contains the data Joe Cabrera Building a data processing pipeline in Python
  • 16. The problem Data ingestion Data parsing Data cleansing Scaling out Questions Thanks! https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com Joe Cabrera Building a data processing pipeline in Python