RcppMsgPack : MessagePack Headers and Interface Functions for R by

MessagePack, or MsgPack for short, or when referring to the implementation, is an efficient binary serialization format for exchanging data between different programming languages. The RcppMsgPack package provides R with both the MessagePack C++ header files, and the ability to access, create and alter MessagePack objects directly from R. The main driver functions of the R interface are two functions msgpack_pack and msgpack_unpack. The function msgpack_pack serializes R objects to a raw MessagePack message. The function msgpack_unpack de-serializes MessagePack messages back into R objects. Several helper functions are available to aid in processing and formatting data including msgpack_simplify, msgpack_format and msgpack_map.


Introduction
MessagePack (or MsgPack for short, or when referring to the actual implementation) is a binary serialization format made for exchanging data between different programming languages (Furuhashi, 2018).Unlike other related formats such as JSON, MsgPack is a binary format-which makes it more efficient in terms of (disk or memory) space, transfer speeds (which is increasingly important for large data sets across networks) and potentially also precision (as textual representation rarely goes to the length of binary precision).As shown on the project homepage at https://msgpack.org,several major projects including Redis, Pinterest, Fluentd and Treasure Data utilize MsgPack to transfer data or to represent internal data structures (Furuhashi, 2018).Other binary serialization formats similar to MsgPack include BSON (MongoDB, 2018) and ProtoBuf (Google, 2018) which have their own advantages and disadvantages, such as serialization speed, memory usage, compression and requirement of descriptive schemas (Hamida et al., 2015;Dawborn and Curran, 2014).R support for these formats is available via the packages mongolite (Ooms, 2014) and RProtoBuf (Eddelbuettel et al., 2016); Redis is also implemented in R through the RcppRedis package (Eddelbuettel, 2018).RcppMsgPack (Ching et al., 2018) brings support for the MsgPack specification to R.
The MsgPack specification describes a number of common data type: Booleans, Integers, Floats, Strings, Binary data, Arrays, Maps, and user-defined extension types, and has been implemented in most major programming languages.RcppMsgPack aims to provide an efficient, and easy to use implementation by relying on the official C++ MsgPack code and the Rcpp package (Eddelbuettel, 2013;Eddelbuettel et al., 2018).The package provides users with the MsgPack header files which can be used to more directly integrate MsgPack into R projects through C++ code.It also provides the ability to serialize and de-serialize data directly to and from R (e.g., through pipes, file handlers, sockets or binary object files).These functionalities can be used to efficiently transfer data between various programming languages and between separate R instances.
In this manuscript, we describe the main interface functions used to serialize and deserialize MsgPack messages, and the conversion between R data types and MsgPack data types.We describe helper functions contained in RcppMsgPack and describe several use cases and examples of how RcppMsgPack can be used in practice, benchmarking several common approaches for transferring data between processes.

Interface functions
The functions msgpack_pack and msgpack_unpack allow serialization and de-serialization of R objects respectively (Figure 1).Here, msgpack_pack takes in any number of R objects and generates a single serialized message in the form of a raw vector.Conversely, msgpack_unpack takes a serialized message as input, and returns the R object(s) contained in the message.Moreover, msgpack_format is a helper function to properly format R objects for input, and msgpack_simplify is a helper function to simplify output from a MsgPack conversion.One of the main goals of MsgPack is the transfer of data across processes and/or hosts.Therefore, we also define two helper functions msgpack_write and msgpack_read which facilitate writing and reading of MsgPack objects to files, pipes or any connection object.
The data types in MsgPack do not directly map on to R data types, more so than other languages.For example, basic R "atomic" types such as integers and strings are inherently vectorized, which is not true in C++, Python or most other languages.I.e., in R there is no distinction between a single integer and a vector of integers of length 1. R also has multiple non-value types, such as NULL or NA The R Journal Vol.10/2, December 2018 ISSN 2073-4859 for integer, string, numeric, etc.Because of these complexities, the conversion processes using these interface functions are described in detail below.
R integers are converted into MsgPack integers, which are automatically reduced in size, depending on the value of the integer.MsgPack integers are converted back into R integers.Because R does not natively support 64 bit integers, whereas MsgPack supports integers up to 64 unsigned bits in value, MsgPack integers exceeding signed 32 bits supported by R are coerced to R numeric values, with potential loss of precision.The integer NA value in R is represented by its bit value in C++ (0x80000000), and requires no special treatment.R numeric (i.e., doubles) variables conform to IEEE 754 double-precision standards (IEEE Standards Committee, 2008) Currently, the MsgPack specifications includes one official extension type: timestamps.Timestamps are a MsgPack extension type with extension value -1 and can be converted to and from R POSIXct objects using msgpack_timestamp_decode and msgpack_timestamp_encode respectively.Ms-gPack timestamps can encode nanosecond precision.R POSIXct objects rely on numeric, and therefore conversion may have some loss of precision (unless a package such as nanotime (Eddelbuettel and Silvestri, 2018) is used, which is left as a future extension).
MsgPack specifications define two container objects: arrays and maps.MsgPack arrays are a sequential container object.The length of the array is defined in its message header.Arrays can contain any other MsgPack types, including other arrays or maps.
MsgPack arrays are naturally analogous to R unnamed list objects.However, because lists have a large memory footprint, R atomic vectors (with length of 0 or greater than or equal to 2) are also allowed as input for serialization to arrays.
MsgPack maps are an ordered sequence of key and value pairs, where each key and value can be any MsgPack object.There is no requirement for unique keys.Maps do not have an analogous data type in R .Therefore, maps are implemented by creating an object of class map, which is also a data.framewith key and value columns.As input to serialization, these columns can also be lists, The R Journal Vol.10/2, December 2018 ISSN 2073-4859 and can therefore contain any other R object, and not only a single type.The function msgpack_map is a simple helper function that takes two lists and returns a map which can be serialized into a MsgPack object with msgpack_pack.
In order to support as much generality as possible in serialization and deserialization, the use of lists to represent arrays and maps is necessary.However, it is often the case in R that one would want to deal with large vectors or matrices of a single type without the computational and memory overhead of lists.Two approaches are given to deal with this type of scenario.msgpack_simplify can be used after a call to msgpack_unpack to recursively simplify lists to vectors when only a single type is included within a list.(For lists of characters or logicals, this may also include NULLs.)Secondly, msgpack_unpack can be called with the simplify=TRUE parameter, which performs the same task as msgpack_simplify within C++, and is therefore much faster.The second approach can drastically improve speed and memory usage compared to the first approach.

Using MsgPack C++ headers through RcppMsgPack and Rcpp
Complex objects or data structures, such as trees, often do not fit into R data types because a tree data structure does not map nicely to an R vector, data.frame,matrix, etc. Storing such a complex object as a MsgPack message will be more performant in terms of serialization speed and memory usage.
The example below demonstrates how MsgPack headers can be integrated into a standard Rcpp workflow.In this example, a prefix tree is created for nucleotide sequences, and is serialized through MsgPack to create a persistant tree object in the form of a raw vector in R. The stored tree can be saved to disk, unpacked within R directly using msgpack_unpack or it can be reconstructed into the prefix tree within C++ using the MsgPack C++ interface.The code below defines a structure for storing the Prefix tree data and a function for constructing the tree using sequence data input from R and saving it as a MsgPack object: The create_prefix_tree function is an C++ function that returns a raw vector, which is a serialization of the prefix tree.From R, the prefix tree can be initialized and serialized through calling the Rcpp function.
tree <-create_prefix_tree(c("AGCT", "AGCC", "AGC", "ATG", "GACC", "GTCT")) Because the resulting message is a standard MsgPack object, the tree can be unpacked directly in R using msgpack_unpack.Additionally, the following code reconstructs the prefix tree from the MsgPack message from within C++ using the C++ interface: Called from within R, the tree is printed to the console: tree_nested_list <-msgpack_unpack(tree, simplify=F) print_newick_tree(tree) The R Journal Vol.10/2, December 2018 ISSN 2073-4859 Writing MsgPack objects to disk and reading data from the internet Following is an example of how one can create a MsgPack serialized message as a binary file, and read it from the internet into R.The first step would be to read in the data to R as normal, and serialize the data using the msgpack_pack function.

Transferring large datasets from Python to R and back
To evaluate the performance of MsgPack serialization, we benchmarked the transfer of the MNIST (LeCun et al., 2010) and CIFAR-10 (Krizhevsky, 2009) datasets to and from Python and R. We compared this approach to writing and reading in CSV format, and to writing and reading using feather (Wickham et al., 2016), a cross-platform library and specification for efficiently handling tabular data.(The feather package is no longer actively developed, and will be superseded by Apache Arrow (Apache Arrow Developers, 2018), a more general cross-language development platform for working with in-memory data in a standardized language-independent columnar memory format.However, Arrow is not yet available for R.) In Python, serialization of the data was performed using the msgpack package and written to disk in binary format: with open("/tmp/dataset.mp","rb") as file: file.write(xb) x = msgpack.unpackb(file.read()) In R, we used the readBin function followed by msgpack_unpack function to deserialize the dataset.Alternatively, one could use the helper function msgpack_read.
xb <-readBin(con = "/tmp/dataset.mp","raw", n=file.info("/tmp/dataset.mp")$size) x <-msgpack_unpack (xb, simplify=T) Subsequently, the data can be re-serialized and written to disk using the msgpack_pack function followed by writeBin: xb <-msgpack_pack(x) writeBin(xb,"/tmp/dataset.mp", useBytes=T) For CSV format, we used the fread and fwrite functions in the data.tablepackage (Dowle et al., 2018) to write and read CSV tables in R. In Python, we used the package numpy savetxt and loadtxt functions (van der Walt et al., 2011).For the feather format, we used the write_feather and read_feather functions within the feather package in R and Python.All approaches were benchmarked after clearing the system cache and replicated 5 times (Figure 2).
Serialization and de-serialization of objects from MsgPack objects was similar between R and Python, with a slight speed advantage going to Python.Reading MsgPack objects was generally considerably faster than reading CSV format in both R and Python.Comparing MsgPack to feather, feather was generally faster.As feather is designed for columnar binary data, it does not use a header for every element and therefore has an advantage over the (more general) MsgPack in this context.
One surprising result is in writing of CSVs in R. Using data.table,writing the MNIST dataset was faster than MsgPack, although not quite as fast as feather.The MNIST dataset is formatted as a single floating point value in per pixel by TensorFlow (Abadi et al., 2016).When writing floating point data as CSV, both numpy and the data.tablepackage truncate floating point numbers by default, which causes loss of precision.To compare the memory usage of MsgPack to native R serialization, we serialized the datasets found in the datasets R package (Figure 3).For R serialization, this was done by calling the serialize function for native R serialization or calling the msgpack_pack function for MsgPack serialization.For compression, the memCompress function was called on the results.Some datasets contained formulas,

MsgPack serialization memory usage
The R Journal Vol.10/2, December 2018 ISSN 2073-4859 or other R-specific attributes that can't be stored in MsgPack; these attributes were removed prior to serialization and compression.
Memory usage was generally very similar between MsgPack and native R.Although there were some differences in raw serialization, the difference seems to be less apparent after compression.For some uncompressed serializations, MsgPack significantly outperformed R in memory usage.This is attributable to efficient integer storage: MsgPack stores variable size integers depending on the integer magnitude, and can be as low as 8 bits.Native R serialization uses the modified External Data Representation (XDR) standard (Srinivasan, 1995), which uses a fixed 32 bits for integers.Furthermore, attributes of each dataset are stored as lists internally, which can increase the relative overhead for small datasets, as R lists are not memory-efficient objects.
On the other hand, every MsgPack element has a short header of several bits, which increases its memory overhead, particularly for large vectors.This overhead is apparent for larger datasets, where R typically is more memory efficient.As mentioned, one situation where MsgPack serialization is much more efficient than R serialization is how it handles list objects.Because R lists have a large memory overhead, both compression times and writing/reading to disk are faster using the MsgPack format.To show this, we generated a contrived list of DNA sequences as follows:

Serialization of large lists
x <-replicate(10000, { replicate(sample(5,1), { paste(sample(c("G","C","A","T"), size=sample(50,1), replace=T), collapse="") }) }) If the data structure of the object is known, using the C++ headers directly can speed up serialization and de-serialization by avoiding the logic involved in serializing generic data structures, although the speed-up is not large.Compression and writing to disk can also be performed directly within C++, for example, by using the zlib C++ library (Gailly and Adler, 2018) to perform standard DEFLATE compression.The example below illustrates MsgPack serialization and subsequent compression using the zlib library:

Figure 1 :
Figure 1: A flowchart of the conversion of R objects to MsgPack objects and vice versa.

Figure 2 :
Figure 2: Transferring the MNIST dataset (left) and the CIFAR-10 dataset (right) to and from R and Python using either MsgPack, feather or CSV format.

Figure 4 :
Figure 4: Serialization and compression time of a large list of DNA sequences.

objects of class character) are converted to MsgPack strings
, and also require no special treatment.The numeric NA value is a special case of NaN values, and is serialized by its bit representation. .Because C++ and MsgPack do not have missing values for strings, NA characters are converted into MsgPack Nil (similar to NULL in R).R logical values are converted into MsgPack bool.Again, because NA logical values do not exist in C++ or MsgPack, NA logical values are converted into MsgPack Nil.R raw vectors are converted into MsgPack bin.Raw vectors with the "EXT" integer attribute are converted into MsgPack extension types.The EXT attribute should be a positive integer, as negative values are reserved for official extensions.