langevitour: Smooth Interactive Touring of High Dimensions, Demonstrated with scRNA-Seq Data

langevitour displays interactive animated 2D projections of high-dimensional datasets. Langevin Dynamics is used to produce a smooth path of projections. Projections are initially explored at random. A “guide” can be activated to look for an informative projection, or variables can be manually positioned. After a projection of particular interest has been found, continuing small motions provide a channel of visual information not present in a static scatter plot. langevitour is implemented in Javascript, allowing for a high frame rate and responsive interaction, and can be used directly from the R environment or embedded in HTML documents produced using R. Single cell RNA-sequencing (scRNA-Seq) data is used to demonstrate the widget. langevitour’s linear projections provide a less distorted view of this data than commonly used non-linear dimensionality reductions such as UMAP.

Paul Harrison https://logarithmic.net/pfh/ (Monash Genomics and Bioinformatics Platform, Monash University)
2023-11-01

1 Introduction

Understanding high-dimensional data is difficult. High-dimensional data is data where many variables have been measured at once. There may be complex relationships between variables, and the data may contain clusters and other features with complex shapes. This article introduces a new interactive tool that may be helpful for visualizing and understanding high-dimensional data using animated 2D projections.

High-dimensional data is produced in fields across the breadth of science. This article will focus on a motivating example from biology. Single cell RNA-sequencing (scRNA-Seq) typically measures the expression levels of thousands of genes in tens of thousands of biological cells. We can think of cells as points in a gene-expression space with thousands of dimensions. There is a complex high-dimensional geometry due to differences between biological cell types, variation in expression within cell types, cell developmental trajectories, and treatment responses. Principal Components Analysis (PCA) can find a set of directions in which the data is most variable, allowing scRNA-Seq data to be summarized down to perhaps tens of dimensions while still capturing most of the important geometry. However even a ten-dimensional space is difficult to comprehend.

One way to explore high-dimensional data is using a “tour.” A tour is a sequence of projections of the dataset, most commonly into two dimensions. A Grand Tour (Asimov 1985) is a tour that will eventually visit as close as we like to every possible projection of the data, typically using a sequence of random projections. A Guided Tour, on the other hand, seeks an “interesting” projection by moving toward the maximum of some index function (Cook et al. 1995). The sequence of projections is animated, with smooth interpolation between each successive pair of projections. The software XGobi and GGobi (Swayne et al. 1998) provide an interactive graphical application incorporating tours for exploring high-dimensional data. The more recent R package tourr (Wickham et al. 2011) provides a framework for creating and displaying tours in the R language. Displaying animations directly in R usually does not achieve a high frame rate. It is also not possible to interact with the display as with GGobi. To get around these problems, a recent R package called detourr (Hart and Wang 2022) computes a tour path in R using tourr and then displays it using a Javascript widget (using htmlwidgets) (Vaidyanathan et al. 2021). The widget then provides a high frame rate display and interactive features. However, the projection path itself can not be modified interactively.

This article introduces a new R package, langevitour, that differs from previous tour software by using Langevin Dynamics, a method from physics, to produce a continuous path of projections. This path can be directly used for animation, eliminating the need to interpolate between distinct projections to animate the tour. The package is htmlwidgets-based, with interaction, calculations, and animation performed in Javascript. The projection can be controlled interactively, with the user able to switch between Grand and Guided Tours while also interactively focusing in on particular dimensions of interest.

The methods section describes Langevin Dynamics mathematically, but I will outline its important features here using two physical examples. First, consider modelling the position and velocity of a large particle over time. Many small particles continuously jostle the large particle. This is the original Brownian motion scenario studied by Langevin in 1908 (see translation by Lemons and Gythiel 1997). Rather than modelling every particle, Langevin Dynamics simulates the jostling as small random forces. Langevin’s model includes these random forces and damping of momentum, and we can also add a force field acting on the particle. The particle explores the space it is in, and the force field may cause the particle to spend more time near certain locations.

langevitour applies Langevin Dynamics to an orthonormal projection matrix rather than to a particle’s position. As a second physical example, imagine a two-dimensional disk in a high-dimensional space. The disk represents a projection plane for a high-dimensional dataset. It has a fixed center but can rotate freely. Tiny unseen particles continuously jostle the disk, causing it to spin first one way and then another. The motion of the disk provides the path for a Grand Tour of the dataset. A force field may also draw it toward particular orientations. The force field is specified using a potential energy function. It is used to seek interesting data projections, similar to the index functions used in previous tour software, providing a Guided Tour.

This article begins by demonstrating the widget using data from the palmerpenguins package. The method and implementation are then described in detail. Finally, an extended demonstration using scRNA-Seq data is presented.

2 Palmer Penguins example

The R data package palmerpenguins (Horst et al. 2020) provides body measurements of penguins of three different species from the Palmer Archipelago, Antarctica. The langevitour based visualization is shown in Figure 2. R code to produce this figure is:

library(langevitour)
library(palmerpenguins)

completePenguins <- na.omit(penguins[,c(1,3,4,5,6)])
scale <- apply(completePenguins[,-1], 2, sd)*4

langevitour(completePenguins[,-1], completePenguins$species, 
            scale=scale, pointSize=2)

Figure 1: palmerpenguins data visualized using langevitour. Each dot is a penguin, and the axes are four different penguin measurements. Interact with this figure directly or use the buttons in the text, which will adjust the figure to demonstrate various features.

The widget displays a moving projection of high-dimensional points onto a two-dimensional plane. The current projection is indicated using a collection of axis lines, with the axes labelled by their respective variable names. A second set of variable and group labels appear to the right of the plot area when interacting with the plot. These can be dragged on to the plot to control the projection. Let us now step through some manipulations the langevitour widget allows.

Unchecking all but two or three axes can make the relationship between those particular axes clear. With three axes checked, the eye interprets the display as three-dimensional. A systematic way to examine a dataset is to proceed through all possible combinations of two or three axes.

If a particular interesting projection is found using langevitour it can be brought back into R by pressing the “?” button and copying and pasting R code that is shown. The “?” button also shows a JSON record of the current settings of the widget, including form inputs and label positions. All or some of these settings can be specified to a future call to langevitour(), or applied to a running widget using Javascript code. Example code to control the widget using HTML buttons and Javascript is given in the supplemental file figures-page.Rmd.

3 Method

Say we have a set of \(n\) \(p\)-dimensional data points. A \(2 \times p\) projection matrix from \(p\) to 2 dimensions will be denoted \(\mathbf X\). The two rows of \(\mathbf X\) are called a 2-frame. These rows must be unit vectors and orthogonal to each other. The set of all 2-frames in \(p\) dimensions is a Stiefel manifold.

It will often be necessary to consider all of the elements of the projection matrix concatenated together into a single vector (“melted”), which will be denoted \(\mathbf x\). We will think of \(\mathbf x\) as a simulated physical system’s “position” vector. With \(\mathbf x\) in use as the position of the system, the data points will be called the vectors \(\mathbf y_i\). The projection of point \(i\) into two dimensions is calculated with the matrix product \(\mathbf X \mathbf y_i\).

3.1 Langevin Dynamics overview

Projections are generated using a numerical simulation of Langevin Dynamics but with projections constrained to lie on the Stiefel manifold. This section briefly summarizes Langevin Dynamics. The next section will describe the numerical simulation method and how the constraint is applied.

The description of Langevin Dynamics given here follows Leimkuhler and Matthews (2015), but for simplicity I set the Boltzmann constant to 1 and all masses to 1. We define a system with a position vector \(\mathbf x\) and a velocity vector \(\mathbf v\). We must specify a temperature \(T\), a damping rate \(\gamma\), and a potential energy function \(U(\mathbf x)\). The behavior of the system is then defined by a pair of Stochastic Differential Equations (SDEs):

\[\begin{align} \mathrm{d}\mathbf{v} &= -\gamma \mathbf{v} \mathrm{d}t + \sqrt{2\gamma T}\mathrm{d}\mathbf{W} -\nabla U(\mathbf{x}) \mathrm{d}t \tag{1}\\ \mathrm{d}\mathbf{x} &= \mathbf{v} \mathrm{d}t \tag{2} \end{align}\]

Here \(\mathbf W\) is a vector of Wiener processes. For any positive time-step \(\Delta t\):

\[ \mathbf W(t+\Delta t) - \mathbf W(t) \sim \sqrt{\Delta t} \mathcal N(0,\mathbf I) \]

The total energy of the system, kinetic energy plus potential energy, is called the Hamiltonian:

\[\begin{equation*} H(\mathbf{x},\mathbf{v}) = \frac{1}{2}|\mathbf{v}|^2 + U(\mathbf{x}) \end{equation*}\]

In Equation (1) in a physical system, the first two terms would describe the exchange of kinetic energy with the surrounding environment. In the first term, kinetic energy is lost (damping), while the second term adds randomness to the velocity, increasing the kinetic energy again. The third term applies acceleration according to the gradient of the potential energy function.

If we were to set \(\gamma=0\), we would be doing Hamiltonian Dynamics, and the system’s total energy would remain constant. If \(\gamma>0\) the total energy can fluctuate, and in the long run the process is ergodic (Leimkuhler and Matthews 2015 in section 6.4.4), producing samples with the Gibbs-Boltzmann probability density:

\[\begin{equation*} \rho(\mathbf{x},\mathbf{v}) \propto e^{ -H(\mathbf{x},\mathbf{v})/T } \end{equation*}\]

From this density, it can be seen that each component of the velocity is normally distributed with variance \(T\) and that the position has probability density

\[\begin{equation*} \rho(\mathbf{x}) \propto e^{ -U(\mathbf{x})/T } \end{equation*}\]

The potential energy function completely controls the distribution of positions being produced, providing a great deal of freedom. Here, we will use this to craft suitable potential energy functions to allow the user to control the explored projections.

3.2 Langevin Tour numerical simulation

We are to generate a sequence of animation frames \(i=1,2,...\), each with a projection matrix \(\mathbf X_i\) (written \(\mathbf x_i\) when viewed as a vector). Each frame will also have a velocity vector \(\mathbf v_i\). The time-step between frames can vary depending on the computational load from langevitour and other things happening in the web browser. Call the time-step from frame \(i-1\) to frame \(i\) \(\Delta t_i\).

The Position Based Dynamics method (PBD, Müller et al. 2007) is used to perform the numerical simulation while constraining the system to produce orthonormal projection matrices. PBD is simple to implement and emphasizes stability over accuracy when enforcing constraints, which is appropriate and adequate for this application. Using PBD, in each iteration we will:

  1. Update the velocity.
  2. Update the position based on the velocity.
  3. Fix the updated position to be an orthonormal projection matrix.
  4. Fix the velocity to be consistent with the fixed position.

Step 1. Update the velocity

We will write \(\mathbf v'_i\) and \(\mathbf x'_i\) for the initially proposed velocity and position of the current frame. These will be adjusted in steps 3 and 4 to produce the final position and velocity, \(\mathbf x_i\) and \(\mathbf v_i\). The first step is to calculate

\[\begin{equation} \mathbf v'_i = e^{-\gamma \Delta t_i} \; \mathbf v_{i-1} + \sqrt{ T \left(1-e^{-2 \gamma \Delta t_i}\right) } \; \mathbf r_i - \Delta t_i \; \nabla U(\mathbf x_{i-1}) \tag{3} \end{equation}\]

where the components of \(\mathbf r_i\) follow a standard normal distribution.

In the limit for \(\Delta t_i \rightarrow 0\), Equation (3) matches the rate of change of the mean and rate of added variance in equation (1). Equation (3) has also been carefully chosen to have stable behavior for large \(\Delta t\) or \(\gamma\) or both. The first term decays the existing velocity by a factor of \(e^{-\gamma \Delta t_i}\). If Equation (1) only contained the first term, this would be the exact solution. This decay reduces the variance of the velocity by a factor of \(\left( e^{-\gamma \Delta t_i} \right) ^2\). The second term re-injects variance sufficient to restore the variance of the velocity in every direction (orthogonal to constraints) as \(T\).

A small refinement is made to avoid random rotation in the plane of projection, as this can be unsettling to view. Any part of the random noise \(\mathbf r_i\) within the plane of the projection is subtracted out before the noise is added to the velocity. More precisely, considering the noise in matrix form \(\mathbf R_i\) in the same way as the projection matrix \(\mathbf X_{i-1}\), the projection of each row of \(\mathbf R_i\) onto each row of \(\mathbf X_{i-1}\) is subtracted from that row of \(\mathbf R_i\). Previous tour software has also avoided this type of rotation, but in a different way, by using a “geodesic interpolation” method that operates between planes rather than frames (see Buja et al. 2005).

Step 2. Update the position based on the velocity

The position is advanced according to the velocity and the size of the time-step.

\[\begin{equation*} \mathbf x'_i = \mathbf x_{i-1} + \Delta t_i \mathbf v'_i \end{equation*}\]

Step 3. Fix the updated position to be an orthonormal projection matrix

Position Based Dynamics requires the proposed position \(\mathbf x'_i\) be projected back to a constraint-satisfying position \(\mathbf x_i\). Here, the constraint is \(\mathbf x_i\) represents an orthonormal projection matrix. Doing this arbitrarily might cause unexpected spinning in the projection plane. We must find the nearest valid \(\mathbf x_i\) to \(\mathbf x'_i\).

Considering the proposed position vector as a projection matrix, we take the singular value decomposition and set the singular values to 1.

\[\begin{align} \mathbf U \mathbf S \mathbf V^\top & = \mathbf X'_i \tag{4} \\ \mathbf X_i & = \mathbf U \mathbf V^\top \tag{5} \end{align}\]

Here, \(\mathbf U\) is a \(2\times 2\) orthonormal matrix, \(\mathbf S\) is a \(2\times 2\) diagonal matrix, and \(\mathbf V\) is a \(p\times 2\) matrix with orthonormal columns. Let \(s_j\) be the values along the diagonal of \(\mathbf S\), the singular values, all of which are non-negative.

Equation (5) chooses the closest orthonormal projection matrix in terms of Euclidean distance to \(\mathbf{x'}_i\). Stated another way, this is the matrix \(\mathbf X\) that minimizes the Frobenius norm \(\|\mathbf X' - \mathbf X\|\) with the proposed projection matrix \(\mathbf X'\). To see this, consider first the problem of finding the nearest orthonormal projection matrix to \(\mathbf{ U^\top X' }\).

\[ \mathbf{ U^\top X' = U^\top USV^\top =SV^\top } \]

For each row in \(\mathbf{ U^\top X' }\), \(\mathbf{\left(U^\top X' \right)}_{j,\cdot} = s_j \mathbf V^\top_{j,\cdot}\), the nearest unit vector will be parallel to this vector, namely \(\mathbf V^\top_{j,\cdot}\). We know that the rows of \(\mathbf V^\top\) are orthogonal, so \(\mathbf V^\top\) is the nearest orthonormal projection matrix to \(\mathbf{U^\top X' }\). Multiplying both matrices by an orthonormal matrix does not change the Frobenius norm of their difference, so the nearest orthonormal projection matrix to \(\mathbf{UU^\top X' = X'}\) is \(\mathbf{UV^\top}\).

Step 4. Fix the velocity to be consistent with the fixed position

Position Based Dynamics requires the velocity to match the actual update made to the position, rather than the initially proposed update.

\[\begin{equation*} \mathbf v_i = \frac{\mathbf x_i - \mathbf x_{i-1}}{\Delta t_i} \end{equation*}\]

3.3 Guiding projections using the potential energy function

We can use any function we like for the potential energy \(U(\mathbf x)\), so long as we can calculate its gradient. This is used in langevitour to provide a set of automatic guides and also as a method of interaction.

When an energy function is being used, the temperature \(T\) plays a role analogous to variance in the normal distribution. When the temperature is very low, the system seeks the minimum of the energy function. As the temperature is raised, projections further and further from the minimum are produced.

Linear energy function for interaction

For some choice of vector \(\mathbf a\), we can set the energy function to be the dot product

\[\begin{equation*} U(\mathbf x) = - \mathbf a \mathbf x \end{equation*}\]

This encourages the projection to have a large component parallel to \(\mathbf a\). In langevitour this is used when labels are dragged onto the plot area to control the placement of particular axes of the high-dimensional space or to control the position of the mean of a group of points.

Central force energy function

The Box-Cox power transformation (Box and Cox 1964) provides a useful building block for energy functions.

\[ f(x ; \lambda_1 , \lambda_2) = \begin{cases} \dfrac{(x + \lambda_2)^{\lambda_1} - 1}{\lambda_1} & \text{if } \lambda_1 \neq 0, \\ \ln (x + \lambda_2) & \text{if } \lambda_1 = 0 \end{cases} \]

An energy function creating forces away from or toward the center can be defined using:

\[ U_\text{central}(\mathbf x ; c,\lambda_1,\lambda_2) = \frac{c}{n} \sum_{i=1}^n f\left(|\mathbf X \mathbf y_{i}|^2 ; \lambda_1,\lambda_2\right) \]

langevitour offers a central “push” guide (\(c<0,\lambda_1=0.5,\lambda_2=0.0001\)) and a central “pull” guide (\(c>0,\lambda_1=0.5,\lambda_2=0.0001\)). These cone-shaped energy functions result in a nearly constant magnitude outward or inward force on points, except for a small region close to the center.

Layout by point-point repulsion

It was found that repulsion forces between pairs of points can be used to produce an informative layout. Let \(S\) be a set of pairs of points \((i,j)\). Ideally we would make this the set of all possible pairs of points but, for computational efficiency, langevitour uses a random mini-batch of 5,000 pairs of points per iteration, with a different mini-batch used each time. Using random mini-batches to approximate the gradient injects extra noise into the system (see Mandt et al. 2017). The effect is similar to increasing the temperature slightly.

\[ U_\text{layout}(\mathbf x ; \lambda_1,\lambda_2) = \frac{c}{|S|} \sum_{(i,j) \in S} f\left(|\mathbf X \mathbf y_{j}-\mathbf X \mathbf y_{i}|^2; \lambda_1,\lambda_2\right) \]

The power parameter \(\lambda_1\) determines whether the layout is governed by long-range or short-range forces. langevitour offers “ultra-local” (\(c<0,\lambda_1=-1,\lambda_2=0.0025\)), “local” (\(c<0,\lambda_1=0,\lambda_2=0.0001\)), “PCA” (\(c<0,\lambda_1=1,\lambda_2=0\)), and “outlier” (\(c<0,\lambda_1=2,\lambda_2=0\)) guides. The “local” guide is the preferred default. With this guide, pairs of points that are near to each other exert more force than pairs of points that are far apart. The “ultra-local” guide potentially produces better layouts but is somewhat unstable. The “PCA” guide is equivalent to PCA. The “outlier” guide seeks projections where there are some points that are very far from other points.

Blending energy functions

A sum of energy functions such as the above can be used to produce behavior that blends the behaviors from the individual functions. For example, there could be an active guide and also one or more labels dragged onto the plot.

4 Implementation

langevitour uses the htmlwidgets framework. It was an important design goal that using langevitour be no more difficult than any other plotting function in R. htmlwidgets allows Javascript widgets to be used in most places that conventional R graphical output can be used. The widget may be displayed during an interactive R session or included in a knitted document with a call to the langevitour function. The only required argument is a matrix (or data frame) of numerical data. A grouping of rows is often also given, allowing points to be distinguished by color. There are further optional arguments providing adjustments to the scaling, appearance, and further optional features.

The htmlwidgets scaffoldWidget function was used to scaffold the package, including functions to create the widget (langevitour) and to use the widget in shiny applications (langevitourOutput, renderLangevitour).

langevitour operates without a server, so the R portion of langevitour is limited to sanity-checking all the inputs and ensuring htmlwidgets will translate the data to JSON consistently. In particular, vectors convert to lists, which ensures vectors of length 1 are not unboxed. A Javascript class performs calculation, plotting, and interaction. The D3 Javascript package is used to perform drag-and-drop interaction, color operations, scale operations, and some DOM element manipulation. The SVD-JS Javascript package is used for the singular value decomposition calculation. The jStat Javascript package is used to produce normally distributed random numbers. Besides these packages, calculations are performed using plain Javascript code, following the steps in the previous section.

Gradients need to be calculated in order to use potential energy functions as a guides or for interaction. The necessary partial derivatives were found by hand, and used to implement a collection of gradient functions. To add a new guide, a function to calculate the required gradient can be written, and the source code edited to make it available in the widget interface.

In Javascript, animation frames are scheduled using requestAnimationFrame, allowing the browser to manage the frame rate, co-ordinate multiple animations within a document, and pause animation when the document is not on screen. A typical frame rate for the browser to aim for is 60 frames per second. The frame rate may drop if there is too much calculation and drawing required, such as when there are many points to display. Multiple widgets may be active in a document at once, and even if the document is visible, not all widgets may be visible. To minimize CPU usage the animation is paused if a widget scrolls off-screen or is otherwise hidden. Canvas-based rendering was used.

5 scRNA-Seq example

A dataset by Kang et al. (2018) demonstrates many of the complex high-dimensional features that are found in scRNA-Seq data. In this dataset, peripheral blood mononuclear cells (PBMCs) from eight patients with lupus were pooled. PBMCs are cells from the immune system that circulate in blood, including monocytes, B cells, T cells, and natural killer (NK) cells. These cells were then stimulated with a cytokine, recombinant interferon beta, causing a change in the gene expression of the cells. The dataset contains a sample of unstimulated cells (U), and a separate sample of stimulated cells (S).

Single cell sequencing produces a small proportion of doublets, where two cells end up in a single micro-droplet and appear in the final data as a single cell. A nice feature of this dataset is that doublets containing cells from two different individuals can be identified with certainty due to genetic differences between the individuals.

5.1 Processing steps

Sequencing data was produced using a 10x Chromium Single Cell instrument and an Illumina HiSeq 2500 sequencer. Kang et al. (2018) then processed sequencing reads using the 10x Genomics CellRanger software and provided the resulting RNA molecule count data in the Gene Expression Omnibus (GEO) database as accession number GSE96583. They also provided their annotation of the cells into different types, and doublet detection based on genetic differences between individuals. Slightly simplified annotations are shown in this article. There are 29,065 cells in total. 3,169 of these are identified as doublets.

In the processed data for each of the two samples, there is a matrix giving the number of molecules of RNA associated with each gene within each cell. Normalization by total count per cell, log transformation, and PCA were carried out using the Seurat package (Hoffman 2022). As per Seurat defaults, only the top 2,000 highly variable genes are used. Each resulting Principal Component (PC) has a score for each cell and a loading for each gene.

The top PCs capture as much variation in the data as possible but are not necessarily individually interpretable. To aid biological interpretation, it would be better if each component represented changes in the expression of a distinct set of genes. Each differentially expressed gene should have loadings that are mostly concentrated in a single component, and we prefer the loadings to be positive if possible. With these goals in mind, the varimax rotation of the gene loadings was found using the varimax() function in the built-in stats package, with Kaiser normalization disabled. Both the gene loadings and the cell scores are rotated. Then, for each component, if the loadings have negative skew both the loadings and scores are negated.

Genetic differences can not identify doublets containing cells from the same individual, so Bioconductor package scDblFinder (Germain et al. 2022) was used to impute further doublets using the recoverDoublets() function. This only works between cells of different types, but doublets containing cells of the same type are not a problem. A further 595 doublets were identified this way.

The dataset was sub-sampled down to 10,000 cells to allow a smooth frame rate in langevitour,

The R code used to process the scRNA-Seq data is given in the supplemental file processing.R. Code for figure generation from the processed data is given in figures.R.

5.2 Cell scores

Results from analysis with Seurat are shown in Figure 2. The scree plot has a fat tail with no clear elbow. A large number of PCs potentially contain useful information. The top 10 PCs will be used simply as a manageable number with which to interact. Common practice is to visualize cells using a 2D UMAP layout computed from the PCs, as shown in Figure 2B, to see what clusters exist in the data and try to understand their relationships.