Temporal data often has a hierarchical structure, defined by
categorical variables describing different levels, such as political
regions or sales products. The nesting of categorical variables produces
a hierarchical structure. The tsibbletalk package is developed to allow
a user to interactively explore temporal data, relative to the nested or
crossed structures. It can help to discover differences between category
levels, and uncover interesting periodic or aperiodic slices. The
package implements a shared tsibble
object that allows for
linked brushing between coordinated views, and a shiny module that aids
in wrapping timelines for seasonal patterns. The tools are demonstrated
using two data examples: domestic tourism in Australia and pedestrian
traffic in Melbourne.
Temporal data typically arrives as a set of many observational units measured over time. Some variables may be categorical, containing a hierarchy in the collection process, that may be measurements taken in different geographic regions, or types of products sold by one company. Exploring these multiple features can be daunting. Ensemble graphics (Unwin and Valero-Mora 2018) bundle multiple views of a data set together into one composite figure. These provide an effective approach for exploring and digesting many different aspects of temporal data. Adding interactivity to the ensemble can greatly enhance the exploration process.
This paper describes new software, the tsibbletalk package, for exploring temporal data using linked views and time wrapping. We first provide some background to the approach based on setting up data structures and workflow, and give an overview of interactive systems in R. The section following introduces the tsibbletalk package. We explain the mechanism for constructing interactivity, to link between multiple hierarchical data objects and hence plots, and describe the set up for interactively slicing and dicing time to wrap a series on itself to investigate periodicities.
The tsibble
package (Wang et al.
2020) introduced a unified temporal data structure, referred
to as a tsibble
, to represent time series and longitudinal
data in a tidy format (Wickham 2014). A tsibble
extends the data.frame
and tibble
classes with
the temporal contextual metadata: index
and
key
. The index
declares a data column that
holds time-related indices. The key
identifies a collection
of related series or panels observed over the index
-defined
period, which can comprise multiple columns. An example of a
tsibble
can be found in the monthly Australian retail trade
turnover data (aus_retail
), available in the tsibbledata
package (O’Hara-Wild et al.
2020c), shown below. The Month
column holds
year-months as the index
. State
and
Industry
are the identifiers for these 152 series, which
form the key
. Note that the column Series ID
could be an alternative option for setting up the key
, but
State
and Industry
are more readable and
informative. The index
and key
are “sticky”
columns to a tsibble
, forming critical pieces for fluent
downstream temporal data analysis.
#> # A tsibble: 64,532 x 5 [1M]
#> # Key: State, Industry [152]
#> State Industry `Series ID` Month Turnover
#> <chr> <chr> <chr> <mth> <dbl>
#> 1 Australian Capital Territory Cafes, r… A3349849A 1982 Apr 4.4
#> 2 Australian Capital Territory Cafes, r… A3349849A 1982 May 3.4
#> 3 Australian Capital Territory Cafes, r… A3349849A 1982 Jun 3.6
#> 4 Australian Capital Territory Cafes, r… A3349849A 1982 Jul 4
#> 5 Australian Capital Territory Cafes, r… A3349849A 1982 Aug 3.6
#> # … with 64,527 more rows
In the spirit of tidy data from the tidyverse (Wickham et al. 2019), the
tidyverts suite features tsibble
as the
foundational data structure, and helps to build a fluid and fluent
pipeline for time series analysis. Besides tsibble, the feasts (O’Hara-Wild et al. 2020b) and fable (O’Hara-Wild et al. 2020a) packages fill
the role of statistical analysis and forecasting in the
tidyverts ecosystem. During all the steps of a time
series analysis, the series of interest, denoted by the key
variable, typically persist, through the trend modeling and also
forecasting. We would typically want to examine the series across all of
the keys.
Figure 1 illustrates examining temporal data
with many keys. The data has 152 series corresponding to different
industries in retail data. The multiple series are displayed using an
overlaid time series plot, along with a scatterplot of two variables
(trend versus seasonal strength) from feature space, where each series
is represented by a dot. The feature space is computed using the
features()
function from feasts, which
summarises the original data for each series using various statistical
features. This function along with other tidyverts
functions is tsibble
-aware, and outputs a table in a
reduced form where each row corresponds to a series, which can be
graphically displayed as in Figure 1
(right).
Figure 1: Plots for the data, with the series of strongest seasonal strength highlighted. (a) An overlaid time series plot. (b) A scatter plot drawn from their time series features, where each dot represents a time series from (a).
Figure 1 has also been highlighted to focus
on the one series with the strongest seasonality. To create this
highlighting, one needs to first filter the interesting series from the
features table, and join back to the original tsibble
in
order to examine its trend in relation to others. This procedure can
soon grow cumbersome if many series are to be explored. It illustrates a
need to query interesting series on the fly. Although these two plots
are static, we can consider them as linked views because the common
key
variables link between the two data tables producing
the two plots. This motivates the work in this package, described in
this paper, to enable interactivity of tsibble
and
tsibble
-derived objects for rapid exploratory data
analysis.
There is a long history of interactive data visualization research and corresponding systems. Within R, the systems can be roughly divided into systems utilizing web technology and those that do not.
R shiny (Chang et al. 2020) and htmlwidgets (Vaidyanathan et al. 2019) provide infrastructure connecting R with HTML elements and JavaScript that support the interactivity. The htmlwidgets package makes it possible to embed JavaScript libraries into R so that users are able to write only R code to generate web-based plots. Many JavaScript charting libraries have been ported to R as HTML widgets, including plotly (Sievert 2020), rbokeh (Hafen and Continuum Analytics, Inc. 2020), and leaflet (Cheng et al. 2019) for maps. Interactions between different widgets can be achieved with shiny or crosstalk (Cheng 2020). The crosstalk extends htmlwidgets with shared R6 instances to support linked brushing and filtering across widgets, without relying on shiny.
Systems without the web technology include grDevices, loon (Waddell and Oldford 2020), based on Tcl/Tk, and cranvas (Xie et al. 2014) based on Qt. They offer a wide array of pre-defined interactions, such as selecting and zooming, to manipulate plots via mouse action, keyboard strokes, and menus. The cranvastime package (Cheng et al. 2016) is an add-on to cranvas, which provides specialized interactions for temporal data, such as wrapping and mirroring.
The techniques implemented in the work described in this paper utilize web technology, including crosstalk, plotly, and R shiny.
The tsibbletalk
package introduces a shared tsibble instance built on a
tsibble
. This allows for seamless communication between
different plots of temporal data. The as_shared_tsibble()
function turns a tsibble
into a shared instance,
SharedTsibbleData
, which is a subclass of
SharedData
from crosstalk. This
is an R6 object driving data transmission across multiple views, due to
its mutable and lightweight properties. The tsibbletalk
package aims to streamline interactive exploration of temporal data,
with the focus of temporal elements and structured linking.
As opposed to one-to-one linking, tsibbletalk
defaults to categorical variable linking, where selecting one or more
observations in one category will broadcast to all other observations in
this category. That is, linking is by key variables: within the time
series plot, click on any data point, and the whole line will be
highlighted in response. The as_shared_tsibble()
uses
tsibble
’s key
variables to achieve these types
of linking.
The approach can also accommodate temporal data of nesting and
crossing structures. These time series are referred to as hierarchical
and grouped time series in the literature (Hyndman and
Athanasopoulos 2017). The aus_retail
above is an
example of grouped time series. Each series in the data corresponds to
all possible combinations of the State
and
Industry
variables, which means they are intrinsically
crossed with each other. When one key variable is nested within another,
such as regional areas within a state, this is considered to be a
hierarchical structure.
The spec
argument in as_shared_tsibble()
provides a means to construct hybrid linking, that incorporates
hierarchical and categorical linking. A symbolic formula can be passed
to the spec
argument, to define the crossing and/or nesting
relationships among the key variables. Adopting Wilkinson and Rogers (1973)’s
notation for factorial models, the spec
follows the
/
and *
operator conventions to declare
nesting and crossing variables, respectively. The spec
for
the aus_retail
data is therefore specified as
State * Industry
or Industry * State
, which is
the default for the presence of multiple key
variables. If
there is a hierarchy in the data, using /
is required to
indicate the parent-child relation, for a strictly one directional
parent/child
.
To illustrate nesting and crossing we use the
tourism_monthly
dataset (Tourism Research Australia 2020)
packaged in tsibbletalk.
It contains monthly domestic overnight trips across Australia. The
key
is comprised of three identifying variables:
State
, Region
, and Purpose
(of
the trip), in particular State
nesting of
Region
, crossed together with Purpose
. This
specification can be translated as follows:
library(tsibble)
library(tsibbletalk)
library(dplyr)
tourism_shared <- tourism_monthly %>%
# Comment out the next line to run the full example
filter(State %in% c("Tasmania", "Western Australia")) %>%
mutate(Region = stringr::str_replace(Region, "Australia's ", "WA's ")) %>%
as_shared_tsibble(spec = (State / Region) * Purpose)
There is a three-level hierarchy: the root node is implicitly
Australia, geographically disaggregated to states, and lower-level
tourism regions. A new handy function plotly_key_tree()
has
been implemented to help explore the hierarchy. It interprets
hierarchies in the shared tsibble’s spec
as a tree view,
built with plotly. The
following code line produces the linked tree diagram (left panel of
Figure 2). The visual for the tree hierarchy
detangles a group of related series and provides a bird’s eye view of
the data organization.
p_l <- plotly_key_tree(tourism_shared, height = 800, width = 800)
The tree plot provides the graphics skeleton, upon which the rest of the data plots can be attached. In this example, small multiples of line plots are placed at the top right of Figure 2 to explore the temporal trend across regions by the trip purpose. The shared tsibble data can be directly piped into ggplot2 code to create this.
library(ggplot2)
p_tr <- tourism_shared %>%
ggplot(aes(x = Month, y = Trips)) +
geom_line(aes(group = Region), alpha = .5, size = .4) +
facet_wrap(~ Purpose, scales = "free_y") +
scale_x_yearmonth(date_breaks = "5 years", date_labels = "%Y")
These line plots are heavily overplotted. To tease apart structure in
the multiple time series, the features()
function computes
interesting characteristics, including the measures of trend and
seasonality. These are displayed in the scatterplot at the bottom right,
where one dot represents one series.
There is one final step, to compose the three plots into an ensemble of coordinated views for exploration, shown in Figure 2. (This is the interactive realization of Figure 1).