SlideShare une entreprise Scribd logo
1  sur  13
Télécharger pour lire hors ligne
Chunked
Edwin de Jonge
Statistics Netherlands / UseR! 2016
What is chunked?
Short answer:
for data in text files
Process Data in GB-sized Text Files:
(pre)Process text files to:
select columns
filter rows
derive new variables
Save result into:
Another text file
A database
Option 1: Read data with R
Use:
read.csv uh, readr::read_csv
datatable::fread
Fast reading of data into memory!
However. . .
You will need a lot of RAM!
Text files tend to be 1 to 100 Gb.
Even though these procedures use memory mapping the
resulting data.frame does not!
development cycle of processing script is long. . .
Option 2: Use unix tools
Good choice!
sed
awk
grep
fast processing!
However. . .
It is nice to stay in R-universe (one data-processing tool)
Instead of learning at least 3 extra tools sed, awk and grep
voodoo.
Does it work on my OS/shell?
I want to use dplyr verbs! (dplyr-deprivation. . . )
Option 3: Import data in DB
Import data into DB
Use DB tool to import data.
Process database with dplyr.
However
It is not really a R, but a DB solution
May be not efficient.
Process in chunks?
Option 4: Use chunked!
Idea:
Process data chunk by chunk using dplyr verbs
Memory efficient, only one chunk at a time in memory
Lazy processing
Development cycle is short: test on first chunk.
Read (and write) on chunk at a time using R package LaF.
All dplyr verbs on chunk_wise objects are recorded and
replayed when writing.
Scenario 1: TXT -> TXT
Preprocess a text file with data
read_chunkwise("my_data.csv", chunk_size = 5000) %>%
select(col1, col2) %>%
filter(col1 > 1) %>%
mutate(col3 = col1 + 1) %>%
write_chunkwise("output.csv")
This code:
evals chunk by chunk
allows for column name completion in Rstudio!
Scenario 2: TXT -> DB
Insert processed text data in DB
db <- src_sqlite('test.db', create=TRUE)
tbl <-
read_chunkwise("./large_file_in.csv") %>%
select(col1, col2, col5) %>%
filter(col1 > 10) %>%
mutate(col6 = col1 + col2) %>%
write_chunkwise(db, 'my_large_table')
Scenario 3: DB -> TXT
Extract a large table from a DB to a text file
tbl<-
( src_sqlite("test.db") %>%
tbl("my_table")
) %>%
read_chunkwise(chunk_size=5000) %>%
select(col1, col2, col5) %>%
filter(col1 > 10) %>%
mutate(col6 = col1 + col2) %>%
write_chunkwise('my_large_table.csv')
Caveat
Working:
Working on chunks is memory efficient
filter, select,
rename,mutate,mutate_each,transmute,do, tbl_vars,
inner_join, left_join, semi_join,anti_join all work ,
also with name completion!
However:
summarize and group_by work chunkwise (and not for all
data!)
No arrange, right_join, full_join
Thank you!
Interested?
install.packages("chunked")
Or visit http://github.com/edwindj/chunked

Contenu connexe

Tendances

Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashingJohan Tibell
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioRupak Roy
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreKelly Technologies
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Export Data using R Studio
Export Data using R StudioExport Data using R Studio
Export Data using R StudioRupak Roy
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialGasperi Jerome
 
Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure Anand Ingle
 
Presentation on Heap Sort
Presentation on Heap Sort Presentation on Heap Sort
Presentation on Heap Sort Amit Kundu
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Phillip Trelford
 
Dynamic Memory Allocation
Dynamic Memory AllocationDynamic Memory Allocation
Dynamic Memory Allocationvaani pathak
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC OverheadTakipi
 
Malloc() and calloc() in c
Malloc() and calloc() in cMalloc() and calloc() in c
Malloc() and calloc() in cMahesh Tibrewal
 

Tendances (20)

R seminar dplyr package
R seminar dplyr packageR seminar dplyr package
R seminar dplyr package
 
Faster persistent data structures through hashing
Faster persistent data structures through hashingFaster persistent data structures through hashing
Faster persistent data structures through hashing
 
pandas - Python Data Analysis
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data Analysis
 
Manipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R StudioManipulating Data using DPLYR in R Studio
Manipulating Data using DPLYR in R Studio
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Åsted .Net (CSI .Net)
Åsted .Net (CSI .Net)Åsted .Net (CSI .Net)
Åsted .Net (CSI .Net)
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Export Data using R Studio
Export Data using R StudioExport Data using R Studio
Export Data using R Studio
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Apriori algorithm
Apriori algorithmApriori algorithm
Apriori algorithm
 
PyHEP 2019: Python 3.8
PyHEP 2019: Python 3.8PyHEP 2019: Python 3.8
PyHEP 2019: Python 3.8
 
RESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatialRESTo - restful semantic search tool for geospatial
RESTo - restful semantic search tool for geospatial
 
Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure Binary Heap Tree, Data Structure
Binary Heap Tree, Data Structure
 
Presentation on Heap Sort
Presentation on Heap Sort Presentation on Heap Sort
Presentation on Heap Sort
 
Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015Beyond Lists - Functional Kats Conf Dublin 2015
Beyond Lists - Functional Kats Conf Dublin 2015
 
Dynamic Memory Allocation
Dynamic Memory AllocationDynamic Memory Allocation
Dynamic Memory Allocation
 
5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead5 Coding Hacks to Reduce GC Overhead
5 Coding Hacks to Reduce GC Overhead
 
List
ListList
List
 
Malloc() and calloc() in c
Malloc() and calloc() in cMalloc() and calloc() in c
Malloc() and calloc() in c
 
Algorithms: I
Algorithms: IAlgorithms: I
Algorithms: I
 

En vedette

Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Ram Narasimhan
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using RC. Tobin Magle
 
TestR: generating unit tests for R internals
TestR: generating unit tests for R internalsTestR: generating unit tests for R internals
TestR: generating unit tests for R internalsRoman Tsegelskyi
 
Heatmaps best practices Strata Hadoop
Heatmaps best practices Strata HadoopHeatmaps best practices Strata Hadoop
Heatmaps best practices Strata HadoopEdwin de Jonge
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsEdwin de Jonge
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsAjay Ohri
 
20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料安隆 沖
 
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016Penn State University
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8Muhammad Nabi Ahmad
 
Paquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en RPaquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en RNestor Montaño
 
Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)Vladimir Gutierrez, PhD
 
Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)Fan Li
 
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016Penn State University
 
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016Penn State University
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetDr. Volkan OBAN
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R StudioSusan Johnston
 

En vedette (19)

Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)Data Manipulation Using R (& dplyr)
Data Manipulation Using R (& dplyr)
 
Data and donuts: Data Visualization using R
Data and donuts: Data Visualization using RData and donuts: Data Visualization using R
Data and donuts: Data Visualization using R
 
Dplyr and Plyr
Dplyr and PlyrDplyr and Plyr
Dplyr and Plyr
 
Fast data munging in R
Fast data munging in RFast data munging in R
Fast data munging in R
 
TestR: generating unit tests for R internals
TestR: generating unit tests for R internalsTestR: generating unit tests for R internals
TestR: generating unit tests for R internals
 
Heatmaps best practices Strata Hadoop
Heatmaps best practices Strata HadoopHeatmaps best practices Strata Hadoop
Heatmaps best practices Strata Hadoop
 
ffbase, statistical functions for large datasets
ffbase, statistical functions for large datasetsffbase, statistical functions for large datasets
ffbase, statistical functions for large datasets
 
Managing large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and conceptsManaging large datasets in R – ff examples and concepts
Managing large datasets in R – ff examples and concepts
 
20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料20160611 kintone Café 高知 Vol.3 LT資料
20160611 kintone Café 高知 Vol.3 LT資料
 
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016
 
Rlecturenotes
RlecturenotesRlecturenotes
Rlecturenotes
 
R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8R Brown-bag seminars : Seminar-8
R Brown-bag seminars : Seminar-8
 
Paquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en RPaquete ggplot - Potencia y facilidad para generar gráficos en R
Paquete ggplot - Potencia y facilidad para generar gráficos en R
 
Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)Análisis espacial con R (asignatura de Master - UPM)
Análisis espacial con R (asignatura de Master - UPM)
 
Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)Learn to use dplyr (Feb 2015 Philly R User Meetup)
Learn to use dplyr (Feb 2015 Philly R User Meetup)
 
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
WF ED 540, Class Meeting 3 - select, filter, arrange, 2016
 
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016WF ED 540, Class Meeting 3 - mutate and summarise, 2016
WF ED 540, Class Meeting 3 - mutate and summarise, 2016
 
Data Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat SheetData Wrangling with dplyr and tidyr Cheat Sheet
Data Wrangling with dplyr and tidyr Cheat Sheet
 
Reproducible Research in R and R Studio
Reproducible Research in R and R StudioReproducible Research in R and R Studio
Reproducible Research in R and R Studio
 

Similaire à Chunked, dplyr for large text files

GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconPeter Lawrey
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the futureJeff Miccolis
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipcPeter Lawrey
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cachesqlserver.co.il
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReducecoolmirza143
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services
 
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Sujee Maniyam
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 

Similaire à Chunked, dplyr for large text files (20)

GC free coding in @Java presented @Geecon
GC free coding in @Java presented @GeeconGC free coding in @Java presented @Geecon
GC free coding in @Java presented @Geecon
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
node.js, javascript and the future
node.js, javascript and the futurenode.js, javascript and the future
node.js, javascript and the future
 
Advanced off heap ipc
Advanced off heap ipcAdvanced off heap ipc
Advanced off heap ipc
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
Inferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on SparkInferno Scalable Deep Learning on Spark
Inferno Scalable Deep Learning on Spark
 
Distributed Computing & MapReduce
Distributed Computing & MapReduceDistributed Computing & MapReduce
Distributed Computing & MapReduce
 
Interpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with SawzallInterpreting the Data:Parallel Analysis with Sawzall
Interpreting the Data:Parallel Analysis with Sawzall
 
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...
 
User biglm
User biglmUser biglm
User biglm
 
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 

Plus de Edwin de Jonge

Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesEdwin de Jonge
 
Data error! But where?
Data error! But where?Data error! But where?
Data error! But where?Edwin de Jonge
 
Daff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameDaff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameEdwin de Jonge
 
Uncertainty visualisation
Uncertainty visualisationUncertainty visualisation
Uncertainty visualisationEdwin de Jonge
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationEdwin de Jonge
 
Tabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataTabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataEdwin de Jonge
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statisticsEdwin de Jonge
 
Statmine, Visuele dataexploratie
Statmine, Visuele dataexploratieStatmine, Visuele dataexploratie
Statmine, Visuele dataexploratieEdwin de Jonge
 
StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)Edwin de Jonge
 
StatMine, visual exploration of output data
StatMine, visual exploration of output dataStatMine, visual exploration of output data
StatMine, visual exploration of output dataEdwin de Jonge
 

Plus de Edwin de Jonge (13)

sdcSpatial user!2019
sdcSpatial user!2019sdcSpatial user!2019
sdcSpatial user!2019
 
Validatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rulesValidatetools, resolve and simplify contradictive or data validation rules
Validatetools, resolve and simplify contradictive or data validation rules
 
Data error! But where?
Data error! But where?Data error! But where?
Data error! But where?
 
Daff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frameDaff: diff, patch and merge for data.frame
Daff: diff, patch and merge for data.frame
 
Uncertainty visualisation
Uncertainty visualisationUncertainty visualisation
Uncertainty visualisation
 
Big data experiments
Big data experimentsBig data experiments
Big data experiments
 
StatMine
StatMineStatMine
StatMine
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Tabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large dataTabplotd3, interactive inspection of large data
Tabplotd3, interactive inspection of large data
 
Big data as a source for official statistics
Big data as a source for official statisticsBig data as a source for official statistics
Big data as a source for official statistics
 
Statmine, Visuele dataexploratie
Statmine, Visuele dataexploratieStatmine, Visuele dataexploratie
Statmine, Visuele dataexploratie
 
StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)StatMine (New Technologies and Techniques for Statistics)
StatMine (New Technologies and Techniques for Statistics)
 
StatMine, visual exploration of output data
StatMine, visual exploration of output dataStatMine, visual exploration of output data
StatMine, visual exploration of output data
 

Dernier

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Dernier (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

Chunked, dplyr for large text files

  • 1. Chunked Edwin de Jonge Statistics Netherlands / UseR! 2016
  • 2. What is chunked? Short answer: for data in text files
  • 3. Process Data in GB-sized Text Files: (pre)Process text files to: select columns filter rows derive new variables Save result into: Another text file A database
  • 4. Option 1: Read data with R Use: read.csv uh, readr::read_csv datatable::fread Fast reading of data into memory! However. . . You will need a lot of RAM! Text files tend to be 1 to 100 Gb. Even though these procedures use memory mapping the resulting data.frame does not! development cycle of processing script is long. . .
  • 5. Option 2: Use unix tools Good choice! sed awk grep fast processing! However. . . It is nice to stay in R-universe (one data-processing tool) Instead of learning at least 3 extra tools sed, awk and grep voodoo. Does it work on my OS/shell? I want to use dplyr verbs! (dplyr-deprivation. . . )
  • 6. Option 3: Import data in DB Import data into DB Use DB tool to import data. Process database with dplyr. However It is not really a R, but a DB solution May be not efficient.
  • 8. Option 4: Use chunked! Idea: Process data chunk by chunk using dplyr verbs Memory efficient, only one chunk at a time in memory Lazy processing Development cycle is short: test on first chunk. Read (and write) on chunk at a time using R package LaF. All dplyr verbs on chunk_wise objects are recorded and replayed when writing.
  • 9. Scenario 1: TXT -> TXT Preprocess a text file with data read_chunkwise("my_data.csv", chunk_size = 5000) %>% select(col1, col2) %>% filter(col1 > 1) %>% mutate(col3 = col1 + 1) %>% write_chunkwise("output.csv") This code: evals chunk by chunk allows for column name completion in Rstudio!
  • 10. Scenario 2: TXT -> DB Insert processed text data in DB db <- src_sqlite('test.db', create=TRUE) tbl <- read_chunkwise("./large_file_in.csv") %>% select(col1, col2, col5) %>% filter(col1 > 10) %>% mutate(col6 = col1 + col2) %>% write_chunkwise(db, 'my_large_table')
  • 11. Scenario 3: DB -> TXT Extract a large table from a DB to a text file tbl<- ( src_sqlite("test.db") %>% tbl("my_table") ) %>% read_chunkwise(chunk_size=5000) %>% select(col1, col2, col5) %>% filter(col1 > 10) %>% mutate(col6 = col1 + col2) %>% write_chunkwise('my_large_table.csv')
  • 12. Caveat Working: Working on chunks is memory efficient filter, select, rename,mutate,mutate_each,transmute,do, tbl_vars, inner_join, left_join, semi_join,anti_join all work , also with name completion! However: summarize and group_by work chunkwise (and not for all data!) No arrange, right_join, full_join