Flexible R Functions for Processing Accelerometer Data , with Emphasis on NHANES 2003 – 2006 by

Accelerometers are a valuable tool for measuring physical activity (PA) in epidemiological studies. However, considerable processing is needed to convert time-series accelerometer data into meaningful variables for statistical analysis. This article describes two recently developed R packages for processing accelerometer data. The package accelerometry contains functions for performing various data processing procedures, such as identifying periods of non-wear time and bouts of activity. The functions are flexible, computationally efficient, and compatible with uniaxial or triaxial data. The package nhanesaccel is specifically for processing data from the National Health and Nutrition Examination Survey (NHANES), years 2003–2006. Its primary function generates measures of PA volume, intensity, frequency, and patterns according to user-specified data processing methods. This function can process the NHANES 2003–2006 dataset in under one minute, which is a drastic improvement over existing software. This article highlights important features of packages accelerometry and nhanesaccel and demonstrates typical usage for PA researchers.


Introduction
The National Health and Nutrition Examination Survey (NHANES) is one of a select few national studies to include objective physical activity (PA) monitoring with accelerometers (Troiano et al., 2008;Hagstromer et al., 2007;Colley et al., 2011).In the 2003-2004 and 2005-2006 waves of the study, a total of 14,631 Americans age ≥ 6 years wore uniaxial ActiGraph GT1M accelerometers at the hip for seven consecutive days.The result is a rich dataset for describing PA levels in Americans and assessing correlates of PA.
In 2007, SAS programs for processing the accelerometer data from NHANES were made available on the National Cancer Institute's (NCI) website (National Cancer Institute, 2007).Using these programs, researchers around the world could access the time-series accelerometer data from NHANES and run SAS programs to derive meaningful PA variables to use in statistical analyses.Over the past seven years, the NCI's SAS programs have facilitated great progress in understanding the distribution of PA behaviors in America (Troiano et al., 2008;Tudor-Locke et al., 2010), and have helped to identify numerous cross-sectional associations between PA and health outcomes (Healy et al., 2011;Sisson et al., 2010;Carson and Janssen, 2011).
While the NCI's SAS programs provide several indicators of overall PA and moderate-to-vigorous PA (MVPA), researchers are increasingly writing customized scripts in programs such as MATLAB (The MathWorks, Inc., 2014), LabVIEW (National Instruments, 2014), and R for more flexibility in data processing (Tudor-Locke et al., 2011).This approach allows researchers to generate more descriptive variables, such as time spent in extended periods of sedentariness or hour-by-hour count averages (Bankoski et al., 2011); to implement non-wear algorithms that might better discriminate between true non-wear and sedentary time (Choi et al., 2012); and to apply activity intensity cut-points specific to certain populations, such as children and older adults (Copeland and Esliger, 2009).
As the level of complexity in data processing methods increases, the learning curve for researchers interested in working with accelerometer data also increases.More generally, researchers with data from studies other than NHANES are tasked with writing their own software to convert the time-series data to a form suitable for statistical analysis.Such efforts are time-consuming and probably inefficient, as many different researchers are likely to write their own code for very similar purposes.The size of accelerometer files causes additional complexity.For example, the accelerometer files in NHANES 2003-2006 are 3.87 GB and contain over 1.25 billion data points.
The purpose of this paper is to describe recently developed software for processing accelerometer data.The software takes the form of two R packages: accelerometry for general functions (Van Domelen, 2014), and nhanesaccel for processing NHANES 2003-2006data (Van Domelen et al., 2014).

Goal Description 1
Allow flexibility in data processing (non-wear algorithm, bouts, etc.).

2
Able to process data from uniaxial or triaxial accelerometers worn at the hip, wrist, or other position.

3
Run efficiently so that large datasets can be processed quickly.

4
Generate a wide variety of physical activity variables, including those that can be used to study weekday/weekend and time-of-day effects.

5
Use reasonable defaults for function inputs so that users with minimal knowledge of accelerometry can still process their data adequately.

R package 'accelerometry' Overview
The package accelerometry has a collection of functions for performing various tasks in accelerometer data processing, as well as composite functions for converting minute-to-minute uniaxial or triaxial accelerometer data into meaningful PA variables.This package is intended for researchers who wish to process accelerometer data from studies other than NHANES.

Design goals
The package was developed according to the design goals in Table 1.Computing speed was considered particularly important because users may have data on thousands or tens of thousands of participants.Such a large volume of data can take hours to process using inefficient software.

Implementation
Using R with embedded C++ code was a natural choice to meet goals 3 and 6, and the functions in the package were written to achieve goals 1-2 and 4-5.The package contains eleven functions for performing various steps in accelerometer data processing.These functions are summarized in Table 2.
For seven of the functions, C++ code was added via the Rcpp package (Eddelbuettel and François, 2011;Eddelbuettel, 2013) to improve efficiency.
Most users will only interact with accel.process.unior accel.process.tri.These functions internally call the other functions, and have inputs to control the non-wear and activity bout algorithms, count cut-points for activity intensities, and other options.Notably, accel.process.triallows the user to choose which accelerometer axis to use for non-wear detection, activity bouts, and intensity classification.This is important because vertical-axis counts are almost always used for intensity classification, but triaxial data is more sensitive to small movements and therefore well-suited for non-wear detection (Choi et al., 2012).Version 2.2.4 of accelerometry is currently available on CRAN.The package requires R version 3.0.0or above due to its dependency on Rcpp, which itself requires R 3.0.0or above.A data dictionary for the variables generated by the functions in accelerometry is available online (Van Domelen, 2013).

Examples
Suppose a researcher wishes to process triaxial data collected from an ActiGraph GT3X+ device (ActiGraph, Pensacola, FL) worn at the hip for a particular study participant.The first step is to load a data file with the accelerometer count values in one-minute intervals ("epochs") into R.After transferring data from the accelerometer to a computer via USB cable, the researcher can use Actigraph's ActiLife software (ActiGraph, 2014) to create a .csvfile and load it into R using read.csv,or use the function gt3xAccFile in the R package pawacc (Geraci, 2014;Geraci et al., 2012) to load the accelerometer file directly into R. Notably, data collected in shorter than one-minute epochs can still be processed using accelerometry, it just has to be converted to one-minute epochs first.This can The R Journal Vol.6/2, December 2014 ISSN 2073-4859  Choi et al. 2011a).
Once the data is in R, the user should have a four-column matrix or data frame where the columns represent counts in the vertical axis, anteroposterior (AP) axis, and mediolateral (ML) axis, as well as steps.A sample matrix with this format is included in the accelerometry package.The next step is to install accelerometry and call accel.process.tri.The following code installs the package and processes the sample dataset in two ways: first using default settings, and then using a different algorithm for non-wear detection, and requesting a larger set of PA variables.
Looking at dailyPA1, it is interesting that counts_vert and counts_ap were very similar on all days except day 6, when counts_ap was more than three times greater than counts_vert.It seems likely that the participant engaged in some non-walking activity on this day that involved more anteroposterior (i.e.front-back) motion than vertical motion.This is supported by the participant having an unusually low number of steps on this day, but only slightly lower than average counts_ap.
A 60-minute non-wear algorithm based on only vertical-axis counts was used to create dailyPA1, while a 90-minute algorithm based on triaxial counts was used to create dailyPA2.The first algorithm identified some non-wear time on days 5 and 6, while the second non-wear algorithm did not mark any time on any day as non-wear.Some disagreement between different non-wear algorithms is to be expected, and different researchers may have different preferences based on validation studies or their own experiences processing accelerometer data.Regardless, it makes little difference which algorithm is used for this particular participant.Average counts per minute during wear time (cpm_vert) for the seven days of monitoring, a standard measure of total PA, was 142.7 using the first non-wear algorithm and 142.0 using the second.
Adding return.form= 1 to the function calls to accel.process.triwould result in daily averages being returned rather than separate values for each day of monitoring.This format is usually more useful for statistical analysis.
For further information on accelerometry, the package documentation includes a description of all of the functions and their inputs and has some additional examples (Van Domelen, 2014).The code from this section is also available in the package as a demo that can be executed by running demo("acceljournal").

R package 'nhanesaccel' Overview
The package nhanesaccel is intended for researchers who want to use the objectively measured PA data in NHANES 2003NHANES -2006.Its primary function, nhanes.accel.process,converts minute-to-minute accelerometer data from NHANES participants into useful PA variables for statistical analysis.

Design goals
Design goals for nhanesaccel are shown in Table 3. Speed and flexibility were primary considerations for this package.

Implementation
The main function in nhanesaccel is nhanes.accel.process.This function relies heavily on the functions in accelerometry, and has similar inputs as accel.process.unifor controlling data processing methods.There are 31 function inputs controlling methods for non-wear detection, activity bouts, inclusion criteria, and the number of PA variables generated.These are described in the package documentation files (Van Domelen et al., 2014).A data dictionary for the variables generated by nhanes.accel.process is available online (Van Domelen, 2013).
The use of C++ code in the functions in accelerometry made it possible for nhanes.accel.process in nhanesaccel to process the full dataset of 14,631 participants in NHANES 2003NHANES -2006 in less than one minute.By including NHANES data in the package, there is no need for the user to download the large raw data files from the NHANES website.Adjusted sample weights, which are needed to ensure that the subset of participants with usable accelerometer data are weighted to match the demographics of the United States as a whole, are calculated by the function nhanes.accel.reweight.

Examples
Users can install nhanesaccel via the install.packagesfunction in R. Mac and Linux users and Windows users not running the most recent version of R (3.1 as of November 28, 2014) need to set type = "source" to install from source files rather than binaries.After a short period of time, the function returns a data frame named nhanes1 with daily averages for basic PA variables.Alternatively, the user may want to write a .csvfile that can be imported into a different statistical software package for analysis.This can be achieved by adding the input write.csv= TRUE to the nhanes.accel.processfunction call.Unless otherwise specified via the directory input, the .csvfile would be written to the current working directory, which can be seen by running getwd() and set by running setwd("<location>").
At this point the user could close R and switch to his or her preferred software package for statistical analysis.For example, the user could load the .csvfile into SAS, merge in the Demographics files (which can be downloaded from the NHANES website), and use survey procedures in SAS to test for associations between demographic variables and PA.
Here we briefly illustrate typical data processing and analysis in R. Suppose we wish to test the hypothesis that American youth are more active on weekdays than on weekend days.First, we process The data frame nhanes2 has 14,631 rows and 17 columns.Each row summarizes the PA of one NHANES participant.The first few variables are participant ID (seqn), NHANES year (wave; 1 for 2003-2004, 2 for 2005-2006), number of valid days of monitoring (valid_days, valid_week_days, and valid_weekend_days), and whether the participant has valid data for statistical analysis (include).Then we have daily averages for accelerometer wear time (valid_min), counts (counts), and counts per minute of wear time (cpm), for all valid days and for weekdays (wk_ prefix) and weekend days (we_ prefix) separately.
The variable cpm is a standard measure of total PA.We will compare weekday cpm (wk_cpm) and weekend cpm (we_cpm) in participants age 6 to 18 years to address our hypothesis.To get age for each participant, we merge a demographics dataset (included in nhanesaccel) to the nhanes2 data frame.Then we calculate for each participant the percent difference between wk_cpm and we_cpm.Positive values for cpm_diff indicate greater cpm on weekdays than on weekend days.
# Load demographics data and merge with nhanes2 data(dem) nhanes2 <-merge(x = nhanes2, y = dem) # Calculate percent difference between weekday and weekend CPM for each participant nhanes2$cpm_diff <-(nhanes2$wk_cpm -nhanes2$we_cpm) / ((nhanes2$wk_cpm + nhanes2$we_cpm) / 2) * 100 Now we are almost ready for statistical analysis.Because NHANES uses a complex multi-stage probability sampling design, analyzing the data as a simple random sample would result in incorrect inference.We need to use functions in the survey package (Lumley, 2014(Lumley, , 2004) ) rather than base R functions like t.test and glm.For those who are not familiar with concepts in survey analysis, reference materials and tutorials are available (Lumley, 2004(Lumley, , 2012(Lumley, , 2010)).
Here we create a survey object using design features of NHANES.
# Create survey object called hanes hanes <-svydesign(id = ~sdmvpsu, strata = ~sdmvstra, weight = ~wtmec4yr_adj, data = nhanes2, nest = TRUE) Next we generate a figure comparing weekday and weekend PA in participants age 6 to 18 years.We use functions in the survey package to calculate mean (95% confidence interval) for cpm_diff in one-year age increments, then plot the results.
Going back to the data processing step, users may wish to implement some non-default settings.To illustrate, the following code generates the full set of PA variables (203 variables), requires at least four valid days of monitoring, and uses a 90-minute rather than 60-minute minimum window for non-wear time.
# Process NHANES 2003-2006 data with four non-default inputs nhanes3 <-nhanes.accel.process(brevity= 3, valid.days= 4, nonwear.window= 90, weekday.weekend= TRUE) The next example provides code to replicate the methods used in the NCI's SAS programs to process NHANES 2003NHANES -2006 data. data.In the first function call, we explicitly set each function input as needed to replicate the NCI's methods; in the second, we use the nci.methods argument as a shortcut.
# Specify count cut-points for moderate and vigorous intensity in youth youthmod <-c(1400, 1515, 1638, 1770, 1910, 2059, 2220, 2393, 2580, 2781, 3000, 3239) youthvig <-c(3758, 3947, 4147, 4360, 4588, 4832, 5094, 5375, 5679, 6007, 6363,   While we prefer using the default function inputs, there are some compelling reasons to use the NCI's methods instead.Doing so is simple, easy to justify in manuscripts, and ensures consistency with prior studies that have used the NCI's SAS programs.On the other hand, certain aspects of the NCI's methods are suboptimal.For example, the non-wear algorithm is applied to each day of monitoring separately, which results in misclassification of non-wear time as sedentary time when participants remove the accelerometer between 11 pm and midnight to go to sleep (Winkler et al., 2012).Of course, it is up to the user to decide whether to use the default settings, the NCI's methods, or some other combination of function inputs.Certainly many different approaches can be justified.
The code from this section of the paper is included in nhanesaccel as a demo that can be executed by running demo("hanesjournal").

Processing times vs. NCI's SAS programs
One of the challenges of working with the NHANES accelerometer dataset is its size.A program written in R or MATLAB using standard loops to generate variables for nearly 15,000 participants can take hours to process.Efficiency was therefore a high priority in the development of nhanesaccel.
Processing times for nhanesaccel and the NCI's SAS programs were compared using R version 3.1.0and SAS Version 9.4, respectively, on a Dell OptiPlex 990 desktop computer with Intel Core i7-2600 CPU at 3.4 GHz and 8 GB of RAM.The system.time function in R was used to record processing times for nhanesaccel, and "real time" output from the SAS log file was used for the NCI's programs.In both cases, elapsed times rather than CPU times were recorded.Results are shown in Table 4.
The package nhanesaccel was 174 times faster than the NCI's SAS programs under default settings, and 87 times faster using settings that replicate the NCI's methods.For nhanesaccel, the "NCI methods" settings have longer processing times than the defaults because the NCI's algorithms for non-wear and activity bout detection are more computationally intensive than the defaults.
Equivalence between nhanesaccel (NCI methods) and NCI's SAS programs was confirmed by direct comparison of the files generated.The only difference was that the NCI's SAS programs had data on nine participants with potentially unreliable data, whereas nhanesaccel excluded those participants entirely.Excluding this data is considered acceptable by the NCI (National Cancer Institute, 2007).
Both the NCI's SAS programs and nhanesaccel require some one-time steps prior to processing.The SAS programs require downloading and unzipping the data files, converting from .xptformat to SAS datasets, and downloading four SAS scripts and making minor changes to each.This process takes about ten minutes.As for nhanesaccel, installation can take a minute or two due to its large size.
Processing the full NHANES 2003-2006 using nhanesaccel takes approximately twice the time to process just NHANES 2003-2004.In five trials, mean (standard deviation) processing times were 17.57 (0.01) s using defaults and 40.26 (0.06) s using the NCI's methods.

Discussion
The nhanesaccel package should increase accessibility to the objectively measured PA data in NHANES 2003-2006.The software is free, efficient, and allows flexibility in data processing without writing original code to process over a billion data points.Proficiency in R is not required, as researchers can obtain a file for statistical analysis in SAS or another software package with just three lines of R code (install package, load package, call nhanes.accel.process).For researchers with data from studies other than NHANES, the functions in accelerometry should greatly simplify the process of converting time-series count data into meaningful PA variables.
There are several limitations of the packages presented in this article.First, the functions are not immediately useful for processing accelerometer data collected in less than one-minute epochs.Researchers with 1-s, 30-Hz, or 80-Hz data could convert the data to one-minute epochs and then use the functions in accelerometry, but doing so would sacrifice any additional information contained in the higher resolution data.Also, the default settings for the functions in accelerometry and nhanesaccel were chosen for data collected from accelerometers worn at the hip.Users who wish The R Journal Vol.6/2, December 2014 ISSN 2073-4859 to process wrist data could still use the functions, but would have to take care to adjust function parameters, particularly int.cuts.
Another limitation is that the NHANES 2003-2006 dataset is 8-11 years old, and the data has been available for about six years.However, NHANES 2003-2006 remains the most recent study with objectively measured PA on a nationally representative sample of Americans.Although population activity levels may have changed since [2003][2004][2005][2006], associations between PA and outcomes should be relatively constant over time.Additionally, data on deaths will continue to be updated, which will enable future studies on PA and 5-or 10-year mortality.There are likely still many opportunities for useful contributions to the PA literature from this dataset.
There are several other R packages for processing accelerometer data.The PhysicalActivity package (Choi et al., 2011b) has functions for converting accelerometer data to longer epochs (dataCollapser), plotting accelerometer data (plotData), and implementing a non-wear algorithm (wearingMarking).Briefly, this algorithm uses a 90-rather than 60-minute window, and allows one or two non-zero count values provided that they are surrounded by 30 minutes of zero counts on both sides.Variants of this non-wear algorithm can be implemented by adjusting function inputs.The GGIR package (van Hees et al., 2014) has functions for processing high-frequency triaxial data from certain model accelerometers, and has a function for generating summary statistics.The variables that GGIR generates are mostly indicators of overall PA, such as mean acceleration over the full day or during time periods within the day.Finally, pawacc has functions for converting to longer epochs, detecting non-wear and activity bouts, generating summary statistics, and plotting data.The pawacc package can also be used to read accelerometer files into R without using ActiGraph's ActiLife software.To our knowledge, nhanesaccel is the only shared software currently available for processing the data from NHANES 2003-2006 other than the NCI's SAS programs.
Looking forward, there is a great need for software to process high-frequency accelerometer data.In particular, wrist-worn accelerometers are being used in NHANES 2011-2014, with the devices recording triaxial data at 80-Hz (Troiano, 2012).Analyzing this data will be challenging, as there will be over 145 million data points per participant, or about two trillion data points total.Aside from the sheer volume of data, methods for generating meaningful activity variables from high-frequency triaxial data are still being developed.It will be interesting to see how this data will be made available to the public, and whether the NCI will provide data processing software similar to the SAS programs for NHANES [2003][2004][2005][2006].Provided that the raw data is made available in some form, we intend to add functions to nhanesaccel or develop a separate R package to process NHANES 2011-2014 data.
We hope that accelerometry and nhanesaccel will make it easier for researchers to work with accelerometer data, and particularly the NHANES 2003-2006 dataset.Researchers who have any problems using either package, or have suggestions for additional features, should not hesitate to contact us.

Figure 1 :
Figure 1: Mean (95% confidence interval) for percent difference between weekday and weekend CPM for NHANES 2003-2006 participants ages 6 to 18 years.Positive values indicate greater physical activity on weekdays.

Table 1 :
Design goals for R package accelerometry.

Table 2 :
Brief description of functions in accelerometry, and whether each function uses C++ code.

Table 3 :
Design goals for R package nhanesaccel.verysimilarlyto the NCI's SAS program reweight.pam(NationalCancerInstitute, 2007), and is called internally by nhanes.accel.process.2.1.1 of nhanesaccel is currently available on R-Forge.Due to its size (88.1 MB) it cannot be hosted on CRAN.The package requires R version 3.0.0or above. operates accelerometer data from NHANES 2003-2006 using three non-default inputs.The first two inputs require that participants have at least one valid weekday and one valid weekend day (rather than just one valid day overall), and the third requests that PA averages are calculated for weekdays and weekend days separately. the

Table 4 :
Mean (standard deviation)for time to process NHANES 2003-2004 data using nhanesaccel and using the NCI's SAS programs, based on five trials for each method.