openair – Data Analysis Tools for the Air Quality Community

The openair package contains data analysis tools for the air quality community. This paper provides an overview of data importers, main functions, and selected utilities and workhorse functions within the package and the function output class, as of package version 0.4-14. It is intended as an explanation of the rationale for the package and a technical description for those wishing to work more interactively with the main functions or develop additional functions to support ‘higher level’ use of openair and R.

Karl Ropkins (Institute for Transport Studies, University of Leeds) , David C. Carslaw (King’s College London)
2012-06-01

Large volumes of air quality data are routinely collected for regulatory purposes, but few of those in local authorities and government bodies tasked with this responsibility have the time, expertise or funds to comprehensively analyse this potential resource (Chow and J.G. Watson 2008). Furthermore, few of these institutions can routinely access the more powerful statistical methods typically required to make the most effective use of such data without a suite of often expensive and niche-application proprietary software products. This in turn places large cost and time burdens on both these institutions and others (e.g. academic or commercial) wishing to contribute to this work. In addition, such collaborative working practices can also become highly restricted and polarised if data analysis undertaken by one partner cannot be validated or replicated by another because they lack access to the same licensed products.

Being freely distributed under general licence, R has the obvious potential to act as a common platform for those routinely collecting and archiving data and the wider air quality community. This potential has already been proven in several other research areas, and commonly cited examples include the Bioconductor project (Gentleman, V.J. Carey, D.M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A.J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J.Y.H. Yang and J. Zhang 2004) and the Epitools collaboration (http://www.medepi.com/epitools). However, what is perhaps most inspiring is the degree of transparency that has been demonstrated by the recent public analysis of climate change data in R and associated open debate (http://chartsgraphs.wordpress.com/category/r-climate-data-analysis-tool/). Anyone affected by a policy decision, could potentially have unlimited access to scrutinise both the tools and data used to shape that decision.

1 The openair rationale

With this potential in mind, the openair project was funded by UK NERC (award NE/G001081/1) specifically to develop data analysis tools for the wider air quality community in R as part of the NERC Knowledge Exchange programme (http://www.nerc.ac.uk/using/introduction/).

One potential issue was identified during the very earliest stages of the project that is perhaps worth emphasising for the existing R users.

Most R users already have several years of either formal or self-taught experience in statistical, mathematical or computational working practices before they first encounter R. They probably first discovered R because they were already researching a specific technique that they identified as beneficial to their research and saw a reference to a package or script in an expert journal or were recommended R by a colleague. Their first reaction on discovering R, and in particular the packages, was probably one of excitement. Since then they have most likely gone on to use numerous packages, selecting an appropriate combination for each new application they undertook.

Many in the air quality community, especially those associated with data collection and archiving, are likely to be coming to both openair (Carslaw and K. Ropkins) and R with little or no previous experience of statistical programming. Like other R users, they recognise the importance of highly evolved statistical methods in making the most effective use of their data; but, for them, the step-change to working with R is significantly larger.

As a result many of the decisions made when developing and documenting the openair package were shaped by this issue.

2 Data structures and importers

Most of the main functions in openair operate on a single data frame. Although it is likely that in future this will be replaced with an object class to allow data unit handling, the data frame was initially adopted for two reasons. Firstly, air quality data is currently collected and archived in numerous formats and keeping the import requirements for openair simple minimised the frustrations associated with data importation. Secondly, restricting the user to working in a single data format greatly simplifies data management and working practices for those less familiar with programming environments.

Typically, the data frame should have named fields, three of which are specifically reserved, namely: date, a field of POSIXt class time stamps, and ws and wd, numeric fields for wind speed and wind direction data. There are no restrictions on the number of other fields and the names used outside the standard conventions of R. This means that the ‘work up’ to make a new file openair-compatible is minimal: Read in data; reformat and rename date; and rename wind speed and wind direction as ws and wd, if present.

That said, many users new to programming still found class structures, in particularly POSIXt, daunting. Therefore, a series of importer functions were developed to simplify this process.

The first to be developed was import, a general purpose function intended to handle comma and tab delimited file types. It defaults to a file browser (via file.choose), and is intended to be used in the common form, e.g.:

newdata <- import()
newdata <- import(file.type = "txt") #etc

(Here, as elsewhere in openair, argument options have been selected pragmatically for users with limited knowledge of file structures or programming conventions. Note that the file.type option is the file extension "txt" that many users are familiar with, rather than either the delim from read.delim or the "\(\setminus\)t" separator.)

A wide range of monitoring, data logging and archiving systems are used by the air quality community and many of these employ unique file layouts, including e.g. multi-column date and stamps, isolated or multi-row headers, or additional information of different dimensions to the main data set. So, import includes a number of arguments, described in detail in ?import, that can be used to fine-tune its operation for novel file layouts.

Dedicated importers have since been written for some of the file formats and data sources most commonly used by the air quality community in the UK. These operate in the common form:

newdata <- import[Name]() 

And include:

Here, we gratefully acknowledge the very significant help and support of AEAT, King College London’s Environmental Research Group (ERG) and CERC in the development of these importers. AEAT and ERG operate the AURN and LondonAir archives, respectively, and both specifically set up dedicated services to allow the direct download of .RData files from their archives. CERC provided extensive access to multiple generations of ADMS file structures and ran an extensive programme of compatibility testing to ensure the widest possible body of ADMS data was accessible to openair users.

3 Example data

The openair package includes one example dataset, mydata. This is data frame of hourly measurements of air pollutant concentrations, wind speed and wind direction collected at the Marylebone (London) air quality monitoring supersite between 1st January 1998 and 23rd June 2005 (source: London Air Quality Archive; http://www.londonair.org.uk).

The same dataset is available to download as a .csv file from the openair website (http://www.openair-project.org/CSV/OpenAir_example_data_long.csv). This file can be directly loaded into openair using the import function. As a result, many users, especially those new to R, have found it a very useful template when loading their own data.

4 Manuals

Two manuals are available for use with openair. The standard R manual is available alongside the package at its CRAN repository site. An extended manual, intended to provide new users less familiar with either R or openair with a gentler introduction, is available on the openair website: http://www.openair-project.org.

5 Main functions

Most of the main functions within openair share a highly similar structure and, wherever possible, common arguments. Many in the air quality community are very familiar with GUI interfaces and data analysis procedures that are very much predefined by the software developers. R allows users the opportunity to really explore their data. However, a command line framework can sometimes feel frustratingly slow and awkward to users more used to a ‘click and go’ style of working. Standardising the argument structure of the main functions both encourages a more interactive approach to data analysis and minimises the amount of typing required of users more used to working with a mouse than keyboard.

Common openair function arguments include: pollutant, which identities the data frame field or fields to select data from; statistic, which, where data are grouped, e.g. share common coordinates on a plot, identifies the summary statistic to apply if only a single value is required; and, avg.time, which, where data series are to be averaged on longer time periods, identifies the required time resolution. However, perhaps the most important of these common arguments is type, a simplified form of the conditioning term cond widely used elsewhere in R.

Rapid data conditioning is only one of a large number of benefits that R provides, but it is probably the one that has most resonance with air quality data handlers. Most can instantly appreciate its potential power as a data visualisation tool and its flexibility when used in a programming environment like R. However, many new users can struggle with the fine control of cond, particularly with regards to the application of format.POSIX* to time stamps. The type argument therefore uses an openair workhorse function called cutData, which is discussed further below, to provide a robust means of conditioning data using options such as "hour", "weekday", "month" and "year")

These features are perhaps best illustrated with an example.

The openair function trendLevel is basically a wrapper for the lattice (Sarkar) function levelplot that incorporates a number of built-in conditioning and data handling options based on these common arguments. So, many users will be very familiar with the basic implementation.

The function generates a levelplot of pollutant\(\sim\)x * y | type where x, y and type are all cut/factorised by cutData and in each x/y/type case pollutant data is summarised using the option statistic.

When applied to the openair example dataset mydata in its default form, trendLevel uses x = "month" (month of year), y = "hour" (time of day) and type = "year" to provide information on trends, seasonal effects and diurnal variations in mean \(NO_{x}\) concentrations in a single output (Figure 1).

However, x, y, type and statistic can all be user-defined.

graphic without alt text
Figure 1: openair plot trendLevel(mydata, "nox"). Note: The seasonal and diurnal trends, high in winter months, and daytime hours, most notably early morning and evening, are very typical of man-made sources such as traffic and the general, by-panel, decrease in mean concentrations reflects the effect of incremental air quality management regulations introduced during the last decade.

The function arguments x, y and type can be set to a wide range of time/date options or to any of the fields within the supplied data frame, with numerics being cut into quantiles, characters converted to factors, and factors used as is.

Similarly statistic can also be either a pre-coded option, e.g. "mean", "max", etc, or be a user defined function. This ‘tiered approach’ provides both simple, robust access for new users and a very flexible structure for more experienced R users. To illustrate this point, the default trendLevel plot (Figure 1) can be generated using three equivalent calls:

   # predefined
   trendLevel(mydata, statistic = "mean")  
  
   # using base::mean
   trendLevel(mydata, statistic = mean)
  
   # using local function   
   my.mean <- function(x)\{
                 x <- na.omit(x)
                 sum(x) / length(x)\} 
   trendLevel(mydata, pollutant = "nox", 
              statistic = my.mean) 

The type argument can accept one or two options, depending on function, and in the latter case strip labelling is handled using the latticeExtra (Sarkar and F. Andrews) function useOuterStrips.

graphic without alt textgraphic without alt textgraphic without alt text

Figure 2: openair plots generated using scatterPlot(mydata, “nox”, “no2”, …) and method = “scatter” (default; left), “hexbin” (middle) and “density” (right).

The other main functions include:

6 Utilities and workhorse functions

The openair package includes a number of utilities and workhorse functions that can be directly accessed by users and therefore used more generally. These include:

graphic without alt text
Figure 4: Trivial example of the use of openColourKey with a lattice plot.
graphic without alt text
Figure 5: Trivial example of the use of quickText outside openair.

7 Output class

Many of the main functions in openair return an object of "openair" class, e.g.:

#From:
[object] <- openair.function(...)

#object structure
[object] #list[S3 "openair"]
    \$call  [function call]
    \$data  [data.frame generated/used in plot]
           [or list if multiple part]
    \$plot  [plot from function]
           [or list if multiple part]

#Example
ans <- windRose(mydata)
ans

openair object created by:
        windRose(mydata = mydata)

this contains:
        a single data frame:
        \$data [with no subset structure]
        a single plot object:
        \$plot [with no subset structure]

Associated generic methods (head, names, plot, print, results, summary) have a common structure:

#method structure for openair generics
[generic method].[class] #method.name
     ([object], [class-specific options], 
     [method-specific options]) #options 

As would be expected, most .openair methods work like associated .default methods, and object and method-specific options are based on those of the .default method. Typically, openair methods return outputs consistent with the associated .default method unless either $data or $plot have multiple parts in which cases outputs are returned as lists of data frames or plots, respectively. The main class-specific option is subset, which can be used to select specific sub-data or sub-plots if these are available. The local method results extracts the data frames generated during the plotting process. Figure 6 shows some trivial examples of current usage.

graphic without alt text
Figure 6: Trivial examples of "openair" object handling, with the outputs of plot(ans) and plot(ans, subset = "hour") shown as inserts right top and right bottom, respectively.

8 Conclusions and future directions

As with many other R packages, the feedback process associated with users and developers working in an open-source environment means than openair is subject to continual optimisation. As a result, openair will undoubtedly continue to evolve further through future versions. Obviously, the primary focus of openair will remain the development of tools to help the air quality community make better use of their data. However, as part of this work we recognise that there is still work to be done.

One area that is likely to undergo significant updates is the use of classes and methods. The current S3 implementation of output class is crude, and future options currently under consideration include improvements to plot.openair and print.openair, the addition of an update.openair method (for reworking openair plots), the release of the openairApply (a currently un-exported wrapper function for apply-type operations with openair objects), and the migration of the object to S4.

In light of the progress made with the output class, we are also considering the possibility of replacing the current simple data frame input with a dedicated class structure, as this could provide access to extended capabilities such as measurement unit management.

One other particularly exciting component of recent work on openair is international compatibility. The initial focus of the openair package was very much on the air quality community in the UK. However, distribution of the package through CRAN has meant that we now have an international user group with members in Europe, the United States, Canada, Australia and New Zealand. This is obviously great. However, it has also brought with it several challenges, most notably in association with local time stamp and language formats. Here, we greatly acknowledge the help of numerous users and colleagues who bug-tested and provided feedback as part of our work to make openair less ‘UK-centric’. We will continue to work on this aspect of openair.

We also greatly acknowledge those in our current user group who were less familiar with programming languages and command lines but who took a real ‘leap of faith’ in adopting both R and openair. We will continue to work to minimise the severity of the learning curves associated with both the uptake of openair and the subsequent move from using openair in a standalone fashion to its much more efficient use as part of R.



CRAN packages used

openair, lattice, latticeExtra, hexbin, grDevices, mgcv, stats, RColorBrewer

CRAN Task Views implied by cited packages

Bayesian, Econometrics, Environmetrics, Hydrology, MixedModels, Spatial, SpatioTemporal

Note

This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.

D. Carr, N. Lewin-Koh and M. Maechler. hexbin: Hexagonal Binning Routines. R package version 1.26.0, URL http://CRAN.R-project.org/package=hexbin.
D. C. Carslaw and K. Ropkins. openair: Open-source tools for the analysis of air pollution data. R package version 0.4-14, URL http://www.openair-project.org/.
D. C. Carslaw and S.D. Beevers. Estimations of road vehicle primary NO2 exhaust emission fractions using monitoring data in London. em Atmospheric Environment 39(1):, 2005.
D. C. Carslaw, S.D. Beevers, K. Ropkins and M.C. Bell. Detecting and quantifying aircraft and other on-airport contributions to ambient nitrogen oxides in the vicinity of a large international airport. em Atmospheric Environment 40(28):, 2006.
J. C. Chow and J.G. Watson. New Directions: Beyond compliance air quality measurements. em Atmospheric Environment 42:, 2008.
R. B. Cleveland, W.S. Cleveland, J.E. McRae and I. Terpenning, I. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. em Journal of Official Statistics 6:, 1990.
R. Gentleman, V.J. Carey, D.M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A.J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J.Y.H. Yang and J. Zhang. Bioconductor: Open software development for computational biology and bioinformatics. em Genome Biology 5: R80, 2004. URL http://www.bioconductor.org/.
D. Helsel and R. Hirsch. R: Statistical methods in water resources. US Geological Survey, URL http://pubs.usgs.gov/twri/twri4a3/.
R. Henry, G.A. Norris, R. Vedantham and J.R. Turner. Source Region Identification Using Kernel Smoothing. em Environmental Science; Technology 43(11):, 2009.
R. M. Hirsch, J.R. Slack, and R.A. Smith. Techniques of trend analysis for monthly water-quality data. em Water Resources Research 18(1):, 1982.
H. R. Kunsch. The jackknife and the bootstrap for general stationary observations. em Annals of Statistics 17(3):, 1989.
C. A. McHugh, D.J. Carruthers and H.A. Edmunds. ADMS and ADMS570 Urban. em International Journal of Environment; Pollution,8(): 438–440, 1997.
E. Neuwirth. RColorBrewer: ColorBrewer palettes. R package version 1.0-5, URL http://CRAN.R-project.org/package=RColorBrewer/.
D. Sarkar and F. Andrews. latticeExtra: Extra Graphical Utilities Based on Lattice. R package version 0.6-18, URL http://CRAN.R-project.org/package=latticeExtra.
D. Sarkar. lattice: Lattice Graphics. R package version 0.18-5, URL http://r-forge.r-project.org/projects/lattice/.
D. Sarkar. Lattice: Multivariate Data Visualization with R. Springer :, URL http://lmdvr.r-forge.r-project.org/.
S. N. Wood. Generalized Additive Models: An Introduction with R. Chapman; Hall/CRC, 2006.
S. N. Wood. Stable and efficient multiple smoothing parameter estimation for generalized additive models. em Journal of the American Statistical Association,99:, 2004.

References

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Ropkins & Carslaw, "openair -- Data Analysis Tools for the Air Quality Community", The R Journal, 2012

BibTeX citation

@article{RJ-2012-003,
  author = {Ropkins, Karl and Carslaw, David C.},
  title = {openair -- Data Analysis Tools for the Air Quality Community},
  journal = {The R Journal},
  year = {2012},
  note = {https://rjournal.github.io/},
  volume = {4},
  issue = {1},
  issn = {2073-4859},
  pages = {20-29}
}