Chunked, dplyr for large text files

•

0 j'aime•2,385 vues

This document discusses processing large datasets stored in text files using chunked, an R package that allows working with data in chunks to overcome memory limitations. It presents chunked as Option 4 for working with large data, describing how it reads and writes data chunk-by-chunk using lazy processing. Scenarios demonstrated include preprocessing a text file and writing output (TXT->TXT), importing data into a database (TXT->DB), and extracting from a database to a text file (DB->TXT). While many dplyr verbs work chunkwise, it notes summarize, group_by, arrange, right_join, and full_join currently do not.

Logiciels

Chunked
Edwin de Jonge
Statistics Netherlands / UseR! 2016

What is chunked?
Short answer:
for data in text ﬁles

Process Data in GB-sized Text Files:
(pre)Process text ﬁles to:
select columns
ﬁlter rows
derive new variables
Save result into:
Another text ﬁle
A database

Option 1: Read data with R
Use:
read.csv uh, readr::read_csv
datatable::fread
Fast reading of data into memory!
However. . .
You will need a lot of RAM!
Text ﬁles tend to be 1 to 100 Gb.
Even though these procedures use memory mapping the
resulting data.frame does not!
development cycle of processing script is long. . .

Option 2: Use unix tools
Good choice!
sed
awk
grep
fast processing!
However. . .
It is nice to stay in R-universe (one data-processing tool)
Instead of learning at least 3 extra tools sed, awk and grep
voodoo.
Does it work on my OS/shell?
I want to use dplyr verbs! (dplyr-deprivation. . . )

Option 3: Import data in DB
Import data into DB
Use DB tool to import data.
Process database with dplyr.
However
It is not really a R, but a DB solution
May be not eﬃcient.

Option 4: Use chunked!
Idea:
Process data chunk by chunk using dplyr verbs
Memory eﬃcient, only one chunk at a time in memory
Lazy processing
Development cycle is short: test on ﬁrst chunk.
Read (and write) on chunk at a time using R package LaF.
All dplyr verbs on chunk_wise objects are recorded and
replayed when writing.

Scenario 1: TXT -> TXT
Preprocess a text ﬁle with data
read_chunkwise("my_data.csv", chunk_size = 5000) %>%
select(col1, col2) %>%
filter(col1 > 1) %>%
mutate(col3 = col1 + 1) %>%
write_chunkwise("output.csv")
This code:
evals chunk by chunk
allows for column name completion in Rstudio!

Scenario 2: TXT -> DB
Insert processed text data in DB
db <- src_sqlite('test.db', create=TRUE)
tbl <-
read_chunkwise("./large_file_in.csv") %>%
select(col1, col2, col5) %>%
filter(col1 > 10) %>%
mutate(col6 = col1 + col2) %>%
write_chunkwise(db, 'my_large_table')

Scenario 3: DB -> TXT
Extract a large table from a DB to a text ﬁle
tbl<-
( src_sqlite("test.db") %>%
tbl("my_table")
) %>%
read_chunkwise(chunk_size=5000) %>%
select(col1, col2, col5) %>%
filter(col1 > 10) %>%
mutate(col6 = col1 + col2) %>%
write_chunkwise('my_large_table.csv')

Caveat
Working:
Working on chunks is memory eﬃcient
filter, select,
rename,mutate,mutate_each,transmute,do, tbl_vars,
inner_join, left_join, semi_join,anti_join all work ,
also with name completion!
However:
summarize and group_by work chunkwise (and not for all
data!)
No arrange, right_join, full_join

Thank you!
Interested?
install.packages("chunked")
Or visit http://github.com/edwindj/chunked

Contenu connexe

Tendances

R seminar dplyr packageMuhammad Nabi Ahmad

Faster persistent data structures through hashingJohan Tibell

pandas - Python Data AnalysisAndrew Henshaw

Manipulating Data using DPLYR in R StudioRupak Roy

Hadoop institutes-in-bangaloreKelly Technologies

Åsted .Net (CSI .Net)Kjetil Klaussen

No more struggles with Apache Spark workloads in productionChetan Khatri

Export Data using R StudioRupak Roy

Hadoop map reduce conceptsSubhas Kumar Ghosh

Apriori algorithmJunghoon Kim

PyHEP 2019: Python 3.8Henry Schreiner

RESTo - restful semantic search tool for geospatialGasperi Jerome

Binary Heap Tree, Data Structure Anand Ingle

Presentation on Heap Sort Amit Kundu

Beyond Lists - Functional Kats Conf Dublin 2015Phillip Trelford

Dynamic Memory Allocationvaani pathak

5 Coding Hacks to Reduce GC OverheadTakipi

ListJoyjit Choudhury

Malloc() and calloc() in cMahesh Tibrewal

Algorithms: IJoyjit Choudhury

Tendances (20)

R seminar dplyr package

Faster persistent data structures through hashing

pandas - Python Data Analysis

Manipulating Data using DPLYR in R Studio

Hadoop institutes-in-bangalore

Åsted .Net (CSI .Net)

No more struggles with Apache Spark workloads in production

Export Data using R Studio

Hadoop map reduce concepts

Apriori algorithm

PyHEP 2019: Python 3.8

RESTo - restful semantic search tool for geospatial

Binary Heap Tree, Data Structure

Presentation on Heap Sort

Beyond Lists - Functional Kats Conf Dublin 2015

Dynamic Memory Allocation

5 Coding Hacks to Reduce GC Overhead

List

Malloc() and calloc() in c

Algorithms: I

En vedette

Data Manipulation Using R (& dplyr)Ram Narasimhan

Data and donuts: Data Visualization using RC. Tobin Magle

Dplyr and PlyrPaul Richards

Fast data munging in RAlexander Konduforov

TestR: generating unit tests for R internalsRoman Tsegelskyi

Heatmaps best practices Strata HadoopEdwin de Jonge

ffbase, statistical functions for large datasetsEdwin de Jonge

Managing large datasets in R – ff examples and conceptsAjay Ohri

20160611 kintone Café 高知 Vol.3　LT資料安隆沖

WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016Penn State University

Rlecturenotesthecar1992

R Brown-bag seminars : Seminar-8Muhammad Nabi Ahmad

Paquete ggplot - Potencia y facilidad para generar gráficos en RNestor Montaño

Análisis espacial con R (asignatura de Master - UPM)Vladimir Gutierrez, PhD

Learn to use dplyr (Feb 2015 Philly R User Meetup)Fan Li

WF ED 540, Class Meeting 3 - select, filter, arrange, 2016Penn State University

WF ED 540, Class Meeting 3 - mutate and summarise, 2016Penn State University

Data Wrangling with dplyr and tidyr Cheat SheetDr. Volkan OBAN

Reproducible Research in R and R StudioSusan Johnston

En vedette (19)

Data Manipulation Using R (& dplyr)

Data and donuts: Data Visualization using R

Dplyr and Plyr

Fast data munging in R

TestR: generating unit tests for R internals

Heatmaps best practices Strata Hadoop

ffbase, statistical functions for large datasets

Managing large datasets in R – ff examples and concepts

20160611 kintone Café 高知 Vol.3　LT資料

WF ED 540, Class Meeting 3 - Introduction to dplyr, 2016

Rlecturenotes

R Brown-bag seminars : Seminar-8

Paquete ggplot - Potencia y facilidad para generar gráficos en R

Análisis espacial con R (asignatura de Master - UPM)

Learn to use dplyr (Feb 2015 Philly R User Meetup)

WF ED 540, Class Meeting 3 - select, filter, arrange, 2016

WF ED 540, Class Meeting 3 - mutate and summarise, 2016

Data Wrangling with dplyr and tidyr Cheat Sheet

Reproducible Research in R and R Studio

Similaire à Chunked, dplyr for large text files

GC free coding in @Java presented @GeeconPeter Lawrey

Processing Big Data: An Introduction to Data Intensive ComputingCollin Bennett

introduction to data processing using Hadoop and PigRicardo Varela

node.js, javascript and the futureJeff Miccolis

Advanced off heap ipcPeter Lawrey

Perly Parallel Processing of Fixed Width Data RecordsWorkhorse Computing

Seminar Presentation HadoopVarun Narang

Hadoop trainting-in-hyderabad@kelly technologiesKelly Technologies

Hadoop bigdata overviewharithakannan

Hadoop trainting in hyderabad@kelly technologiesKelly Technologies

Things you can find in the plan cachesqlserver.co.il

Inferno Scalable Deep Learning on SparkDataWorks Summit/Hadoop Summit

Distributed Computing & MapReducecoolmirza143

Interpreting the Data:Parallel Analysis with SawzallTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...Amazon Web Services

User biglmjohnatan pladott

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Yahoo Developer Network

Cost effective BigData Processing on Amazon EC2Sujee Maniyam

Getting Started on HadoopPaco Nathan

Similaire à Chunked, dplyr for large text files (20)

GC free coding in @Java presented @Geecon

Processing Big Data: An Introduction to Data Intensive Computing

introduction to data processing using Hadoop and Pig

node.js, javascript and the future

Advanced off heap ipc

Perly Parallel Processing of Fixed Width Data Records

Seminar Presentation Hadoop

Hadoop trainting-in-hyderabad@kelly technologies

Hadoop bigdata overview

Hadoop trainting in hyderabad@kelly technologies

Things you can find in the plan cache

Inferno Scalable Deep Learning on Spark

Distributed Computing & MapReduce

Interpreting the Data:Parallel Analysis with Sawzall

(SDD401) Amazon Elastic MapReduce Deep Dive and Best Practices | AWS re:Inven...

User biglm

Lecture 4: Data-Intensive Computing for Text Analysis (Fall 2011)

Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...

Cost effective BigData Processing on Amazon EC2

Getting Started on Hadoop

Plus de Edwin de Jonge

sdcSpatial user!2019Edwin de Jonge

Validatetools, resolve and simplify contradictive or data validation rulesEdwin de Jonge

Data error! But where?Edwin de Jonge

Daff: diff, patch and merge for data.frameEdwin de Jonge

Uncertainty visualisationEdwin de Jonge

Big data experimentsEdwin de Jonge

StatMineEdwin de Jonge

Big Data VisualizationEdwin de Jonge

Tabplotd3, interactive inspection of large dataEdwin de Jonge

Big data as a source for official statisticsEdwin de Jonge

Statmine, Visuele dataexploratieEdwin de Jonge

StatMine (New Technologies and Techniques for Statistics)Edwin de Jonge

StatMine, visual exploration of output dataEdwin de Jonge

Plus de Edwin de Jonge (13)

sdcSpatial user!2019

Validatetools, resolve and simplify contradictive or data validation rules

Data error! But where?

Daff: diff, patch and merge for data.frame

Uncertainty visualisation

Big data experiments

StatMine

Big Data Visualization

Tabplotd3, interactive inspection of large data

Big data as a source for official statistics

Statmine, Visuele dataexploratie

StatMine (New Technologies and Techniques for Statistics)

StatMine, visual exploration of output data

Dernier

EY_Graph Database Powered SustainabilityNeo4j

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky

A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska

Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig

Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol

Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini

React Server Component in Next.js by Hanief UtamaHanief Utama

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110

Advantages of Odoo ERP 17 for Your BusinessEnvertis Software Solutions

CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray

Introduction Computer Science - Software Design.pdfFerryKemperman

Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater

What is Fashion PLM and Why Do You Need ItWave PLM

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions

Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent

How to submit a standout Adobe Champion ApplicationBradBedford3

Dernier (20)

EY_Graph Database Powered Sustainability

Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...

A healthy diet for your Java application Devoxx France.pdf

Automate your Kamailio Test Calls - Kamailio World 2024

Software Project Health Check: Best Practices and Techniques for Your Product...

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha

Xen Safety Embedded OSS Summit April 2024 v4.pdf

React Server Component in Next.js by Hanief Utama

Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...

Advantages of Odoo ERP 17 for Your Business

CRM Contender Series: HubSpot vs. Salesforce

Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...

Introduction Computer Science - Software Design.pdf

Ahmed Motair CV April 2024 (Senior SW Developer)

What is Fashion PLM and Why Do You Need It

Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...

Intelligent Home Wi-Fi Solutions | ThinkPalm

Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)

Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...

How to submit a standout Adobe Champion Application

Chunked, dplyr for large text files

1. Chunked Edwin de Jonge Statistics Netherlands / UseR! 2016

2. What is chunked? Short answer: for data in text ﬁles

3. Process Data in GB-sized Text Files: (pre)Process text files to: select columns filter rows derive new variables Save result into: Another text file A database

4. Option 1: Read data with R Use: read.csv uh, readr::read_csv datatable::fread Fast reading of data into memory! However. . . You will need a lot of RAM! Text ﬁles tend to be 1 to 100 Gb. Even though these procedures use memory mapping the resulting data.frame does not! development cycle of processing script is long. . .

5. Option 2: Use unix tools Good choice! sed awk grep fast processing! However. . . It is nice to stay in R-universe (one data-processing tool) Instead of learning at least 3 extra tools sed, awk and grep voodoo. Does it work on my OS/shell? I want to use dplyr verbs! (dplyr-deprivation. . . )

6. Option 3: Import data in DB Import data into DB Use DB tool to import data. Process database with dplyr. However It is not really a R, but a DB solution May be not eﬃcient.

7. Process in chunks?

8. Option 4: Use chunked! Idea: Process data chunk by chunk using dplyr verbs Memory eﬃcient, only one chunk at a time in memory Lazy processing Development cycle is short: test on ﬁrst chunk. Read (and write) on chunk at a time using R package LaF. All dplyr verbs on chunk_wise objects are recorded and replayed when writing.

9. Scenario 1: TXT -> TXT Preprocess a text ﬁle with data read_chunkwise("my_data.csv", chunk_size = 5000) %>% select(col1, col2) %>% filter(col1 > 1) %>% mutate(col3 = col1 + 1) %>% write_chunkwise("output.csv") This code: evals chunk by chunk allows for column name completion in Rstudio!

10. Scenario 2: TXT -> DB Insert processed text data in DB db <- src_sqlite('test.db', create=TRUE) tbl <- read_chunkwise("./large_file_in.csv") %>% select(col1, col2, col5) %>% filter(col1 > 10) %>% mutate(col6 = col1 + col2) %>% write_chunkwise(db, 'my_large_table')

11. Scenario 3: DB -> TXT Extract a large table from a DB to a text ﬁle tbl<- ( src_sqlite("test.db") %>% tbl("my_table") ) %>% read_chunkwise(chunk_size=5000) %>% select(col1, col2, col5) %>% filter(col1 > 10) %>% mutate(col6 = col1 + col2) %>% write_chunkwise('my_large_table.csv')

12. Caveat Working: Working on chunks is memory eﬃcient filter, select, rename,mutate,mutate_each,transmute,do, tbl_vars, inner_join, left_join, semi_join,anti_join all work , also with name completion! However: summarize and group_by work chunkwise (and not for all data!) No arrange, right_join, full_join

13. Thank you! Interested? install.packages("chunked") Or visit http://github.com/edwindj/chunked

Chunked, dplyr for large text files

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (19)

Similaire à Chunked, dplyr for large text files

Similaire à Chunked, dplyr for large text files (20)

Plus de Edwin de Jonge

Plus de Edwin de Jonge (13)

Dernier

Dernier (20)

Chunked, dplyr for large text files