iotools: High-Performance I/O Tools for R

The iotools package provides a set of tools for Input/Output (I/O) intensive datasets processing in R (R Core Team, 2014). Efficent parsing methods are included which minimize copying and avoid the use of intermediate string representations whenever possible. Functions for applying chunk-wise operations allow for computing on streaming input as well as arbitrarily large files. We present a set of example use cases for iotools, as well as extensive benchmarks comparing comparable functions provided in both core-R as well as other contributed packages.


Introduction
When processing large data sets on a single machine the performance bottleneck is often getting data from the hard drive to the format required by the programming environment. The associated latency comes from a combination of two sources. First, there is hardware latency from moving data from the hard-drive to RAM. This is especially the case with "spinning" disk drives, which can have throughput speeds several orders of magnitude less than those of RAM. Hardware approaches for addressing latency have been an active area of research and development since hard-drives have existed. Solid state drives and redundant arrays of inexpensive disks (RAID) now provide throughput comparable to RAM; they are readily available on commodity systems; and they continue to improve. The second source comes from the software latency associated with transforming data from its representation on the disk to the format required by the programming environment. This translation drags performance for many R users, especially in the context of larger data sources.
The code below uses the microbenchmark (Mersmann, 2014) package to compare the time needed to read, parse, and create a data.frame with the time needed to simply read data from disk. The file contains comma-separated value (CSV) file with 29 columns and 7,009,728 rows. It takes about 20 times longer to perform the former compared to the latter indicating there may be room for improvement.
While these functions are sufficient for processing relatively small data sets, larger ones require a different approach. For large files, data are often processed on a single machine by extracting consecutive rows or "chunks" from the filesystem, a chunk is processed, and then the next chunk is retrieved. The results from processing each chunk are then aggregated and returned. Small, manageable subsets are streamed from the disk to the processor and only require the memory needed to represent a single chunk.
This approach is common not only on single machine but also in distributed environements with technologies like Spark (Zaharia et al., 2010) and Hadoop MapReduce (Dean and Ghemawat, 2008). Clusters of commodity machines are able to process vast amounts of data one chunk at a time. Statistical methodology is compatible with this computational approach and is justified in a variety of statistical/machine learning contexts including Hartigan (1975), Kleiner et al. (2011), Guha et al. (2012), and Matloff (2014) to name a few.
However, R currently does not address this common computing pattern. Packages such as bigmemory (Kane et al., 2013) and ff (Adler et al., 2014)  binary format on a disk. They make use of memory-mapped files that may be stored on disk. The data structures they provide are not native R objects. They do not exhibit copy-on-write behavior. And, in general, they cannot be seamlessly integrated with R's plethora of user contributed packages. The readr package (Wickham and Francois, 2015) provides fast importing of data.frame objects but it does not support chunk-wise operations for arbitrarily large files. The foreach package (Weston and Revolution Analytics, 2014), and it's associated iterators package (Revolution Analytics, 2014), provide a general framework for chunked processing but does not provide the low-level connection-based utilities for transforming binary data stored on a disk to those native to R.
The iotools package provides in-place stream processing for any data source represented as a connection. Users of the package can import text and binary data into R and process large data sets as chunks. The package can be several orders of magnitude faster when compared to R's native facilities. The package provides general tools for quickly processing large data sets in consecutive chunks, both in-and out-of-core, and provides a basis for speeding up distributed computing frameworks including Hadoop Streaming (The Apache Software Foundation, 2013) and Spark.
The rest of this paper introduces the use of the iotools package for quickly importing data from disk to R and processing those data. Examples center around calculation of OLS slope coefficients via the normal equations. This particular calculation was chosen because it balances read/write times with processing time.

A note on the data used in this paper
Examples in this paper make use of the "Airline on-time performance" data set (RITA, 2009), which was released for the 2009 American Statistical Association (ASA) Section on Statistical Computing and Statistical Graphics biannual data exposition. The data set includes commercial flight arrival and departure information from October 1987 to April 2008 for those carriers with at least 1% of domestic U.S. flights in a given year. In total, there is information for over 120 million flights, with 29 variables related to flight time, delay time, departure airport, arrival airport, and so on. In total, the uncompressed data set is 12 gigabytes (GB) in size.
It should be noted the 12 GB Airline On-time data set will likely not be considered "big" to many readers. However, in designing the examples two principles were considered before sheer data size. First, the data set is publicly available. The code included in the Supplemental Material of this paper is capable of downloading the data set and running the benchmarks. Users are encouraged to engage the data themselves by trying the code examples and developing their own analyses. Second, the data set is large enough to investigate the performance properties of iotools along with it's associated scaling behavior. Together, the data set and the code available with this paper provide a set of accessible and reproducible examples forming a basis for instruction and subsequent development.

I/O Methods and Formatters
R's file operations make use of Standard C input/output operations includig fread and fwrite. Data are read in, elements are parsed, and parsed values populate data structures. The iotools package also uses the Standard C library but it makes use of "bulk" binary operations including memchr and strchr. These functions make use of hardware specific, single instruction, multiple data operations (SIMD) and tend to be faster than their Standard I/O counterpart, which uses fread with search functions in the user-space. As a result iotools is able to find and retrieve data at a higher rate. In addition, an entire data set or chunk is buffered rather than scanned and transformed line-by-line as in the read.table function. Thus, by buffering chunks of data and making use of low-level, system functions iotools is able to provide more performant data ingestion than what is available in base Ras well as other packages.
Importing data with read.csv.raw and dstrsplit In this section the iotools import functionality is applied to the airline data files, each of which is csv-formatted. Files begin with header and column types consistent across each of the 22 files. Each file corresponds to a full year of data, except the first year (1987), where the data starts on October 14th. Importing 1987 flights with iotools is shown below. The readAsRaw function takes either connection or file name and returns the contents as a raw type. The dstrsplit function parses the a raw vector according to the specified column types and returns a data.frame. Since these functions may be considered "lower-level," the read.csv.raw was written for importing data in a manner similar to read.  The performance of read.csv and dstrsplit is compared in Figure 1. The benchmark measures the import times for 1,000,000 to 7,000,000 lines 1 . The visualization shows importing data using read.csv takes about five times longer than dstrsplit.
The dstrsplit function takes either a raw or character vector and splits it into a data frame according to a specified separator. The columns may be of type logical, integer, numeric, character, raw, complex, POSIXct, and NA where NA indicates the column should be skipped in the output. It may be considered a building block for both read.csv.raw as well as other computing infrastructures including Hadoop, pipes, and database connections to name a few.
It should be noted factor types are not supported. It will be shown later dstrsplit can be used in a streaming context and in this case data are read sequentially. As a result, the set of factor levels cannot be deduced until the entire sequence is read. However, in most cases, a caller knows the schema and is willing to specify factor levels or the caller is willing to use a single pass through the data to find factor levels.  Figure 2 shows the time needed to import a data file with 1,000,000 rows and 25 columns using load, dstrsplit, read_csv (from the readr package), and read.table. Imports were performed for each of R's native types to see how their different size requirements affect performance. The benchmarks show that, except for the POSIXct type, load is fastest. This is unsurprising since load stores the binary representation of an R object and importing consists of copying the file to memory and registering the object in R.
The read_csv's performance is very close to those of iotools. When comparing with read_csv we have found three things of note. First, The difference in times are constant. As the number of lines to read increases the slope of the read times are the same. Second, where read_csv maintains a slight edge in supported numerical types on OS X, iotools has a slight edge on the Linux machines we tested. Third, the read_csv function was provided a connection explicitly in these benchmarks so all functions being examined are provided the same input. When a file name is provided to read_csv, it achieves slightly better performance than the values shown here since it imports from a memory-mapping of the file.

Processing and Checkpointing with as.output
While optimized I/O operations are a convenience when performing explorations and analyses on small data sets fitting in RAM, they are an imperative when working with big data. In this class of data challenges we often deal with individual files whose aggregate is too large to fit in RAM. Furthermore, in distributed applications we may need to load and process subsets from different machines, later combining the results in some useful way.
Let us assume we are tasked with finding the slope coefficents for the linear regresssion ArrDelay ∼ DayO f Week + DepTime + Month + DepDelay. (1) The slope estimates are formed by creating the model matrix, and applying the normal equations to derive the coefficients. As a first task, we perform the simple preprocessing step of aggregating all of the files into a single file holding the model matrix of the entire airline data set. The slope coefficients will be calculated in a second, separate step. Separate processing and model fitting in this case are mostly for the sake of example. However, in many real-world data challenges it is a good idea. Separated steps provide checkpointing and if a problem arises while fitting the model, whether from a bug in the code or an interruption in computing services, the model matrix does not need to be recalculated. Also, the analysis can be changed based on the transformed data, thereby saving a step for similar analyses. In the case of our regression we can derive many different models involving the described variables by including or excluding them in the model fitting step.
The example below shows how to write the model matrices to a single file, named airline_mm.io. However, we could have processed sets of files just as easily with iotools. To emphasize iotools is complementary to existing packages, we will show its use with "pipes" included in the tidyr package (Wickham, 2016). The code reads each of the airline files into a data frame using readAsRaw and dstrsplit, normalizes the categorical variables and transforms the departure times, creates a model matrix from the resulting data.frame, strips the row names of the model matrix, creates the text output representation, and writes it to the output file. The output connection is recycled in each iteration of the loop thereby appending each year's data.

Fitting the model with mstrsplit and chunk.apply
With the model matrices created, The next step is to estimate the slope coefficients β in the model where Y, ε ∈ R n , and β ∈ R d , n ≥ d; each element of ε is an i.i.d. random variable with mean zero; and X is a matrix in R n×d with full column rank. The analytic solution for estimating the slope coefficients, β, is Consider the row-wise partitioning (or chunking) of Equation 2: where Y 1 , Y 2 , ..., Y r , X 1 , X 2 , ..., X r and ε 1 , ε 2 , ..., ε r are data partitions where each chunk is composed of subsets of rows of the model matrix. Then Equation 3 can be expressed aŝ The matrices X T i X i and X T Y can be calculated on each chunk and then summed to calculated the slope coefficients. We remark computed solutions rarely use Equation 3 directly but rather use QR decompositions of X for numerical stability. In practice we have found the amount of numerical stability gained does not warrant the QR calculation, especially when distinguishing nearly colinear variables is not critical.
Code to fit the model will need to read from airline_mm.io in chunks and where before data were read into a data frame, now we would like to read data into a numeric matrix. Interestingly enough, this functionality is not provided in base R or the Matrix (Bates and Maechler, 2014) package. Traditionally, users who wanted to read matrices from disk either used the load/dget function, forcing them to write using save/dput, or they could be read in as a data frame and then converted using the as.matrix function. The former approach allows an R user to quickly import and export data but is not easily accessed from other computing environments. The latter requires a redundant copy of the data. The iotools package fills this gap by providing the mstrsplit, a matrix import function similar to dstrsplit An implementation to fit a linear model out-of-core linear model shown below. The chunk.apply function reads and processes chunks -in this case contiguous groups of rows in the model matrix. The function takes as an argument a connection or file, a function with a single parameter, and a number of parallel processors to use. The function parameter requires a single argument corresponding to the raw vector to be parsed by dstrsplit or mstrsplit. > # Get the factor expansion of the variables. > mm_col_names = data_files[1] %>% read.csv.raw(header=TRUE, nrows=2) %>% + normalize_df %>% model.matrix(form, .) %>% colnames > > ne_chunks = chunk.apply("airline_mm.io", + function(x) { + mm = mstrsplit(x, sep=",", type="numeric") + colnames ( Figure 3 compares the performance of mstrsplit with read.table followed by a call to as.matrix along binary importing using load. As with dstrsplit mstrsplit outperforms the base R's read.table benchmarks by an order of magnitude and even outperforms load for character data.

Parallel Processing of Chunks
In the example above xtx and xty for each chunk are calculated independently of any other chunk. The chunk.apply function includes a parameter, parallel, allowing the user to specify the number of parallel processes, taking advantage of the embarrassingly parallel nature of these calculations. However, it is worth noting parallelism in the chunk.apply function is slightly different than other functions such as mclapply.
Most parallel functions in R work by having worker processes receive data and an expression to compute. The master process initiates the computations and waits for them to complete. For I/Ointensive computations this means either the master loads data before initiating the computation or the worker processes load the data. The former case is supported in iotools through iterator functions (idstrsplit and imstrsplit), which are compatible with the foreach package. However, in this case, new tasks cannot be started until data has been loaded for each of the workers. Loading the data on master process may become a bottleneck and it may require much more time to load the data than to process it. The latter approach is also supported in iotools and ensures the master process is not a bottleneck but if multiple worker processes on a single machine load a large amount of data from the same disk then resource contention at the system level may also cause excessive delays. The operating system has to service multiple requests for data from the same disk having limited I/O capability.
A third option, implemented in chunk.apply, provides pipeline parallelism where the master process sequentially loads data and then calls mcparallel to initiate the parallel computation. When the maximum number of worker processes has been reached the master process pre-fetches the next chunk and then blocks on the result of the running worker processes. When the result is returned a newly created worker begins processing the pre-fetched data. In this way the master does not wait idly for worker processing and there is no resource contention since only the master is retrieving data.
Pipeline parallelism increases execution throughput when the computation time is around the same order as the load time. When the overhead involved in initiating worker processes and getting their results may overwhelm the computation time and parallel processing yield less performant results.  Figure 4 shows the times required to calculate X T X and X T Y from the normal equations in the regression described above using the three approaches described: all workers read, only the master reads, and pipeline parallel. Pipeline parallelism performs best followed by all workers reading. It should be noted all workers reading will only be able to keep pace with pipeline parallelism as long as there is sufficient hard-drive bandwidth and little contention from multiple reads. As a result, the pipeline parallel approach is likely a more general and therefore preferred strategy.

Conclusion
This paper presents the iotools package for the processing of data out-of-core and explores its use analyzing the Airline On-time data set. The examples emphasize computing on a single machine. However it should be noted iotools is by no means limited to this configuration. The "chunk" functions are compatible with any object derived from a connection and could therefore be used with compressed files or even pipes and sockets. In fact, our current work uses iotools as a building block for more tightly integrating R in the Hadoop Streaming and Spark frameworks. Early results show iotools achieves better performance in processing terabyte-and even petabyte-scale data when compared to other existing packages.