R Packages to Aid in Handling Web Access Logs

Web access logs contain information on HTTP(S) requests and form a key part of both industry and academic explorations of human behaviour on the internet. But the preparation (reading, parsing and manipulation) of that data is just unique enough to make generalized tools unfit for the task, both in programming time and processing time which are compounded when dealing with large data sets common with web access logs. In this paper we explain and demonstrate a series of packages designed to efficiently read in, parse and munge access log data, allowing researchers to handle URLs and IP addresses easily. These packages are substantially faster than existing R methods from a 3-500% speedup for file reading to a 57,000% speedup in URL parsing.

Oliver Keyes , Bob Rudis , Jay Jacobs
2016-06-13

CRAN packages used

httr, ApacheLogProcessor, webreadr, readr, microbenchmark, urltools, httr, XML, lubridate, iptools, rgeolocate, Rcpp

CRAN Task Views implied by cited packages

WebTechnologies, HighPerformanceComputing, NumericalMathematics, ReproducibleResearch, TimeSeries

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Keyes, et al., "R Packages to Aid in Handling Web Access Logs", The R Journal, 2016

BibTeX citation

@article{RJ-2016-026,
  author = {Keyes, Oliver and Rudis, Bob and Jacobs, Jay},
  title = {R Packages to Aid in Handling Web Access Logs},
  journal = {The R Journal},
  year = {2016},
  note = {https://doi.org/10.32614/RJ-2016-026},
  doi = {10.32614/RJ-2016-026},
  volume = {8},
  issue = {1},
  issn = {2073-4859},
  pages = {360-366}
}