Retrieval and analysis of Eurostat open data with the eurostat package

The increasing availability of open statistical data resources is providing novel opportunities for research and citizen science. Efficient algorithmic tools are needed to realize the full potential of the new information resources. We introduce the eurostat R package that provides a collection of custom tools for the Eurostat open data service, including functions to query, download, manipulate, and visualize these data sets in a smooth, automated and reproducible manner. The online documentation provides detailed examples on the analysis of these spatio-temporal data collections. This work provides substantial improvements over the previously available tools, and has been extensively tested by an active user community. The eurostat R package contributes to the growing open source ecosystem dedicated to reproducible research in computational social science and digital humanities.


Introduction
Eurostat, the statistical office of the European Union, provides a rich collection of data through its open data service 1 , including thousands of data sets on European demography, economics, health, infrastructure, traffic and other topics. The statistics are often available with fine geographical resolution and include time series spanning over several years or decades.
Availability of algorithmic tools to access and analyse open data collections can greatly benefit reproducible research (Gandrud, 2013;Boettiger et al., 2015), as complete analytical workflows spanning from raw data to final publications can be made fully replicable and transparent. Dedicated software packages help to simplify, standardize, and automate analysis workflows, greatly facilitating reproducibility, code sharing, and efficient data analytics. The code for data retrieval need to be customized to specific data sources to accommodate variations in raw data formats, access details, and typical use cases so that the end users can avoid repetitive programming tasks and save time. A number of packages for governmental and other sources have been designed to meet these demands, including packages for the Food and Agricultural Organization (FAO) of the United Nations (FAOSTAT; Kao et al. 2015), World Bank (WDI; Arel-Bundock 2013), national statistics authorities (pxweb; Magnusson et al. 2014), Open Street Map (osmar; Eugster and Schlesinger 2012) and many other sources.
A dedicated R package for the Eurostat open data has been missing. The eurostat package fills this gap. It expands the capabilities of our earlier statfi (Lahti et al., 2013a) and smarterpoland (Biecek, 2015) packages. Since its first CRAN release in 2014, the eurostat package has been developed by several active contributors based on frequent feedback from the user community. We are now reporting mature version that has been improved and tested by multiple users, and applied in several case studies by us and others (Kenett and Shmueli, 2016). The Eurostat database has three services for programmatic data access: a bulk download, json/unicode, and SDMX web service; we provide targeted methods for the first two in the eurostat package; generic tools for the SDMX format are available via the rsdmx package (Blondel, 2017). The bulk download provides single files, which is convenient and fast for retrieving major parts of data. More light-weight json methods allow data subsetting before download and may be preferred in more specific retrieval tasks but the query size is limited to 50 categories. Finally, the package can be used to download custom administrative boundaries by EuroGeographics© that allow seamless visualization of the data on the European map.
In summary, the eurostat package provides custom tools for Eurostat open data. Key features such as cache, date formatting, tidy data principles (Wickham, 2014), and tibble  data format support seamless integration with other tools for data manipulation and analysis. This article provides an overview of the core functionality in the current CRAN release version (3.1.1). A comprehensive documentation and source code are available via the package Github site 2 .

Search and download commands
To install and load the CRAN release version, just type the standard installation command in R. > install.packages("eurostat") > library("eurostat") The database table of contents is available on-line 3 , or can be downloaded in R with get_eurostat_toc(). A more focused search is provided by the search_eurostat() function.
> query <-search_eurostat("road accidents", type = "table") This seeks data on road accidents. The type argument limits the search on a selected data set type, one of three hierarchical levels including 'table' , which resides in 'dataset' , which is in turn stored in a 'folder' . Values in the code column of the search_eurostat() function output provide data identifiers for subsequent download commands. Alternatively, these identifiers can be browsed at the Eurostat open data service; check the codes in the Data Navigation Tree listed after each dataset in parentheses. Let us look at the data identifier and title for the first entry of the query data.
[1] "People killed in road accidents" Let us next retrieve the data set with this identifier.
> dat <-get_eurostat(id = "tsdtr420", time_format = "num") Here we used the numeric time format as it is more convient for annual time series than the default date format. The transport statistics returned by this function call (Table 1) could be filtered before download with the filters argument, where the list names and values refer to Eurostat variable and observation codes, respectively. To retrieve transport statistics for specific countries, for instance, use the get_eurostat function.

Utilities
Many entries in Table 1 are not readily interpretable, but a simple call label_eurostat(dat) can be used to convert the original identifier codes into human-readable labels (Table 2) based on translations in the Eurostat database. Labels are available in English, French and German languages.
The Eurostat database includes a variety of demographic and health indicators. We see, for instance, that overweight varies remarkably across different age groups (Figure 2A). Sometimes the data sets require more complicated pre-processing. Let's consider, for instance, the distribution of renewable energy sources in different European countries. In order to summarise such data one needs to first aggregate a multitude of possible energy sources into a smaller number of coherent groups. Then one can use standard R tools to process the data, chop country names, filter countries depending on production levels, normalize the within country production. After a series of transformations (see Appendix for the source code) we can finally plot the data to discover that countries vary a lot in terms of renewable energy sources ( Figure 2B). Three-dimensional data sets such as this can be conveniently visualized as triangular maps by using the plotrix (Lemon, 2006) package.
The data sets are stored in cache by default to avoid repeated downloads of identical data and to speed up the analysis. Storing an exact copy of the retrieved raw data on the hard disk will also support reproducibility when the source database is constantly updated.

Map visualizations
The indicators in the Eurostat open data service are typically available as annual time series grouped by country, and sometimes at more refined temporal or geographic levels. Eurostat provides complementary geospatial data on the corresponding administrative statistical units to support visualizations at the appropriate geographic resolution. The geospatial data sets are available as standard shapefiles 4 . Let us look at disposable income of private households (data identifier tgs00026 5 ). This is provided at the geographic NUTS2 regions, the intermediate territorial units in the Eurostat regional classifications,

Default country groupings
To facilitate the analysis and visualization of standard European country groups, the eurostat package includes ready-made country code lists. The list of EFTA countries (Table 3), for instance, is retrieved with the data command.   Similar lists are available for Euro area (ea_countries), EU (eu_countries) and the EU candidate countries (eu_candidate_countries). These auxiliary data sets facilitate smooth selection of specific country groups for a closer analysis. The full name and a two-letter identifier are provided for each country according to the Eurostat database. The country codes follow the ISO 3166-1 alpha-2 standard, except that GB and GR are replaced by UK (United Kingdom) and EL (Greece) in the Eurostat database, respectively. Linking these country codes with external data sets can be facilitated by conversions between different country coding standards with the countrycode package (Arel-Bundock, 2014).