Web access logs contain information on HTTP(S) requests and form a key part of both industry and academic explorations of human behaviour on the internet. But the preparation (reading, parsing and manipulation) of that data is just unique enough to make generalized tools unfit for the task, both in programming time and processing time which are compounded when dealing with large data sets common with web access logs. In this paper we explain and demonstrate a series of packages designed to efficiently read in, parse and munge access log data, allowing researchers to handle URLs and IP addresses easily. These packages are substantially faster than existing R methods from a 3-500% speedup for file reading to a 57,000% speedup in URL parsing.
httr, ApacheLogProcessor, webreadr, readr, microbenchmark, urltools, httr, XML, lubridate, iptools, rgeolocate, Rcpp
WebTechnologies, HighPerformanceComputing, NumericalMathematics, ReproducibleResearch, TimeSeries
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Keyes, et al., "R Packages to Aid in Handling Web Access Logs", The R Journal, 2016
BibTeX citation
@article{RJ-2016-026, author = {Keyes, Oliver and Rudis, Bob and Jacobs, Jay}, title = {R Packages to Aid in Handling Web Access Logs}, journal = {The R Journal}, year = {2016}, note = {https://doi.org/10.32614/RJ-2016-026}, doi = {10.32614/RJ-2016-026}, volume = {8}, issue = {1}, issn = {2073-4859}, pages = {360-366} }