Graphical processing units are rapidly gaining maturity as powerful general parallel computing devices. The package cudaBayesreg uses GPU–oriented procedures to improve the performance of Bayesian computations. The paper motivates the need for devising high-performance computing strategies in the context of fMRI data analysis. Some features of the package for Bayesian analysis of brain fMRI data are illustrated. Comparative computing performance figures between sequential and parallel implementations are presented as well.
A functional magnetic resonance imaging (fMRI) data set consists of time series of volume data in 4D space. Typically, volumes are collected as slices of 64 x 64 voxels. The most commonly used functional imaging technique relies on the blood oxygenation level dependent (BOLD) phenomenon (Sardy 2007). By analyzing the information provided by the BOLD signals in 4D space, it is possible to make inferences about activation patterns in the human brain. The statistical analysis of fMRI experiments usually involve the formation and assessment of a statistic image, commonly referred to as a Statistical Parametric Map (SPM). The SPM summarizes a statistic indicating evidence of the underlying neuronal activations for a particular task. The most common approach to SPM computation involves a univariate analysis of the time series associated with each voxel. Univariate analysis techniques can be described within the framework of the general linear model (GLM) (Sardy 2007). The GLM procedure used in fMRI data analysis is often said to be “massively univariate”, since data for each voxel are independently fitted with the same model. Bayesian methodologies provide enhanced estimation accuracy (Friston et al. 2002). However, since (non-variational) Bayesian models draw on Markov Chain Monte Carlo (MCMC) simulations, Bayesian estimates involve a heavy computational burden.
The programmable Graphic Processor Unit (GPU) has evolved into a highly parallel processor with tremendous computational power and very high memory bandwidth (NVIDIA Corporation 2010b). Modern GPUs are built around a scalable array of multithreaded streaming multiprocessors (SMs). Current GPU implementations enable scheduling thousands of concurrently executing threads. The (CUDA) (NVIDIA Corporation 2010b) is a software platform for massively parallel high-performance computing on NVIDIA manycore GPUs. The CUDA programming model follows the standard single-program multiple-data (SPMD) model. CUDA greatly simplifies the task of parallel programming by providing thread management tools that work as extensions of conventional C/C++ constructions. Automatic thread management removes the burden of handling the scheduling of thousands of lightweight threads, and enables straightforward programming of the GPU cores.
The package
cudaBayesreg
(Ferreira da Silva 2010a) implements a Bayesian multilevel model for the analysis
of brain fMRI data in the CUDA environment. The statistical framework in
cudaBayesreg is
built around a Gibbs sampler for multilevel/hierarchical linear models
with a normal prior (Ferreira da Silva 2010c). Multilevel modeling may be regarded as a
generalization of regression methods in which regression coefficients
are themselves given a model with parameters estimated from data
(Gelman 2006). As in SPM, the Bayesian model fits a linear regression
model at each voxel, but uses uses multivariate statistics for parameter
estimation at each iteration of the MCMC simulation. The Bayesian model
used in
cudaBayesreg
follows a two–stage Bayes prior approach to relate voxel regression
equations through correlations between the regression coefficient
vectors (Ferreira da Silva 2010c). This model closely follows the Bayesian multilevel
model proposed by Rossi, Allenby and McCulloch (Rossi et al. 2005), and
implemented in bayesm
(Rossi and McCulloch. 2008). This approach overcomes several limitations of the
classical SPM methodology. The SPM methodology traditionally used in
fMRI has several important limitations, mainly because it relies on
classical hypothesis tests and
We are interested in the following Bayesian multilevel model, which has
been analyzed by Rossi et al. (2005), and has been implemented as
rhierLinearModel
in
bayesm. Start out with a
general linear model, and fit a set of
In order to tie together the voxels’ regression equations, assume that
the
where
The prior ((2)) can be written using the matrix form of the
multivariate regression model for
where
The proposed Bayesian model can be written down as a sequence of conditional distributions (Ferreira da Silva 2010c),
Running MCMC simulations on the set of full conditional posterior distributions ((4)), the full posterior for all the parameters of interest may then be derived.
In this section, we describe some of the main design considerations
underlying the code implementation in
cudaBayesreg, and
the options taken for processing fMRI data in parallel. Ideally, the GPU
is best suited for computations that can be run on numerous data
elements simultaneously in parallel (NVIDIA Corporation 2010b). Typical textbook
applications for the GPU involve arithmetic on large matrices, where the
same operation is performed across thousands of elements at the same
time. Since the Bayesian model of computation outlined in the previous
Section does not fit well in this framework, some design options had to
be assumed in order to properly balance optimization, memory
constraints, and implementation complexity, while maintaining numerical
correctness. Some design requirements for good performance on CUDA are
as follows (NVIDIA Corporation 2010a): (i) the software should use a large number of
threads; (ii) different execution paths within the same thread block
(warp) should be avoided; (iii) inter-thread communication should be
minimized; (iv) data should be kept on the device as long as possible;
(v) global memory accesses should be coalesced whenever possible; (vi)
the use of shared memory should be preferred to the use of global
memory. We detail below how well these requirements have been met in the
code implementation. The first requirement is easily met by
cudaBayesreg. On
the one hand, fMRI applications typically deal with thousands of voxels.
On the other hand, the package uses three constants which can be
modified to suit the available device memory, and the computational
power of the GPU. Specifically, REGDIM
specifies the maximum number of
regressions (voxels), OBSDIM
specifies the maximum length of the time
series observations, and XDIM
specifies the maximum number of
regression coefficients. Requirements (ii) and (iii) are satisfied by
cudaBayesreg as
well. Each thread executes the same code independently, and no
inter-thread communication is required. Requirement (iv) is optimized in
cudaBayesreg by
using as much constant memory as permitted by the GPU. Data that do not
change between MCMC iterations are kept in constant memory. Thus, we
reduce expensive memory data transfers between host and device. For
instance, the matrix of voxel predictors
The computational model has been specified as a grid of thread blocks of
dimension
The bulk of the MCMC simulations for Bayesian data analysis is
implemented in the kernel (device code). Most currently available RNGs
for the GPU tend to be have too high time- and space-complexity for our
purposes. Therefore, we implemented and tested three different random
number generators (RNGs) in device code, by porting three well-known
RNGs to device code. Marsaglia’s multicarry RNG (Marsaglia 2003) follows the R
implementation, is the fastest one, and is used by default; Brent’s RNG
(Brent 2007) has higher quality but is not-so-fast; Matsumoto’s Mersenne
Twister (Matsumoto and Nishimura 1998) is slower than the others. In addition, we had to ensure
that different threads receive different random seeds. We generated
random seeds for the threads by combining random seeds generated by the
host with the threads’ unique identification numbers. Random deviates
from the normal (Gaussian) distribution and chi-squared distribution had
to be implemented in device code as well. Random deviates from the
normal distribution were generated using the Box-Muller method. In a
similar vein, random deviates from the chi-squared distribution with
The next Sections provide details on how to use
cudaBayesreg
(Ferreira da Silva 2010b) for fMRI data analysis. Two data sets, which are
included and documented in the complementary package
cudaBayesregData
(Ferreira da Silva 2010b), have been used in testing: the fmri
and the
swrfM
data sets. We performed MCMC simulations on these data sets
using three types of code implementations for the Bayesian multilevel
model specified before: a (sequential) R-language version, a
(sequential) C-language version, and a CUDA implementation. Comparative
runtimes for fmri
and swrfM
, are as follows.
Runtimes in seconds for 3000 iterations:
slice R-code C-code CUDA
fmri 3 1327 224 22
swrfM 21 2534 309 41
Speed-up factors between the sequential versions and the parallel CUDA implementation are summarized next.
Comparative speedup factors:
C-vs-R CUDA-vs-C CUDA-vs-R
fmri 6.0 10.0 60.0
swrfM 8.2 7.5 61.8
In these tests, the C-implementation provided, approximately, a
Makefile
. In particular, the optimization flag -O3
is set
there. It is worth noting that the CUDA implementation in
cudaBayesreg
affords much higher speedups. First, the CUDA implementation may easily
be modified to process all voxels of a fMRI volume in parallel, instead
of processing data in a slice-by-slice basis. Second, GPUs with 16
multiprocessors and 512 CUDA cores and
The data set fmri.nii.gz
is available from the FMRIB/FSL site
(www.fmrib.ox.ac.uk/fsl). This
data set is from an auditory–visual experiment. Auditory stimulation
was applied as an alternating “boxcar” with fmri_filtered_func_data.nii
included in
cudaBayesregData
was obtained from fmri.nii.gz
by applying FSL/FEAT pre-preprocessing
tools. For input/output of NIFTI formatted fMRI data sets
cudaBayesreg
depends on the R package
oro.nifti
(Whitcher et al. 2010). The following code runs the MCMC simulation for slice fmri
dataset, and saves the result.
> require("cudaBayesreg")
> slicedata <- read.fmrislice(fbase = "fmri",
+ slice = 3, swap = TRUE)
> ymaskdata <- premask(slicedata)
> fsave <- "/tmp/simultest1.sav"
> out <- cudaMultireg.slice(slicedata,
+ ymaskdata, R = 3000, keep = 5,
+ nu.e = 3, fsave = fsave, zprior = FALSE)
We may extract the posterior probability (PPM) images for the visual
(vreg=2
) and auditory (vreg=4
) stimulation as follows (see
Figures 1 and 2).
> post.ppm(out = out, slicedata = slicedata,
+ ymaskdata = ymaskdata, vreg = 2,
+ col = heat.colors(256))
> post.ppm(out = out, slicedata = slicedata,
+ ymaskdata = ymaskdata, vreg = 4,
+ col = heat.colors(256))
To show the fitted time series for a (random) active voxel, as depicted in Figure 3, we use the code:
> post.tseries(out = out, slicedata = slicedata,
+ ymaskdata = ymaskdata, vreg = 2)
range pm2: -1.409497 1.661774
Summary statistics for the posterior mean values of regression
coefficient vreg=2
, are presented next. The same function plots the
histogram of the posterior distribution for vreg=2
, as represented in
Figure 4.
> post.simul.hist(out = out, vreg = 2)
Call:
density.default(x = pm2)
Data: pm2 (1525 obs.); Bandwidth 'bw' = 0.07947
x y
Min. :-1.6479 Min. :0.0000372
1st Qu.:-0.7609 1st Qu.:0.0324416
Median : 0.1261 Median :0.1057555
Mean : 0.1261 Mean :0.2815673
3rd Qu.: 1.0132 3rd Qu.:0.4666134
Max. : 1.9002 Max. :1.0588599
[1] "active range:"
[1] 0.9286707 1.6617739
[1] "non-active range:"
[1] -1.4094965 0.9218208
hpd (95%)= -0.9300451 0.9286707
An important feature of the Bayesian model used in
cudaBayesreg is the
shrinkage induced by the hyperprior
> nu2 <- 45
> fsave2 <- "/tmp/simultest2.sav"
> out2 <- cudaMultireg.slice(slicedata,
+ ymaskdata, R = 3000, keep = 5,
+ nu.e = nu2, fsave = fsave2,
+ zprior = F)
> vreg <- 2
> x1 <- post.shrinkage.mean(out = out,
+ slicedata$X, vreg = vreg,
+ plot = F)
> x2 <- post.shrinkage.mean(out = out2,
+ slicedata$X, vreg = vreg,
+ plot = F)
> par(mfrow = c(1, 2), mar = c(4,
+ 4, 1, 1) + 0.1)
> xlim = range(c(x1$beta, x2$beta))
> ylim = range(c(x1$yrecmean, x2$yrecmean))
> plot(x1$beta, x1$yrecmean, type = "p",
+ pch = "+", col = "violet",
+ ylim = ylim, xlim = xlim,
+ xlab = expression(beta), ylab = "y")
> legend("topright", expression(paste(nu,
+ "=3")), bg = "seashell")
> plot(x2$beta, x2$yrecmean, type = "p",
+ pch = "+", col = "blue", ylim = ylim,
+ xlim = xlim, xlab = expression(beta),
+ ylab = "y")
> legend("topright", expression(paste(nu,
+ "=45")), bg = "seashell")
> par(mfrow = c(1, 1))
In this Section, we exemplify the analysis of the random effects
distribution swrfM
(argument
fbase="swrfM"
) in the package’s data directory, include mask files
associated with the partition of the fMRI dataset swrfM
in cudaMultireg.slice
with the
argument zprior=TRUE
. This argument will launch read.Zsegslice
, that
reads the segmented images (CSF/GRY/WHT) to build the
> fbase <- "swrfM"
> slice <- 21
> slicedata <- read.fmrislice(fbase = fbase,
+ slice = slice, swap = TRUE)
> ymaskdata <- premask(slicedata)
> fsave3 <- "/tmp/simultest3.sav"
> out <- cudaMultireg.slice(slicedata,
+ ymaskdata, R = 3000, keep = 5,
+ nu.e = 3, fsave = fsave3,
+ zprior = TRUE)
We confirm that areas of auditory activation have been effectively
selected by displaying the PPM image for regression variable vreg=2
.
> post.ppm(out = out, slicedata = slicedata,
+ ymaskdata = ymaskdata, vreg = 2,
+ col = heat.colors(256))
Plots of the draws of the mean of the random effects distribution are presented in Figure 7.
> post.randeff(out)
Random effects plots for each of the
> post.randeff(out, classnames = c("CSF",
+ "GRY", "WHT"))
Plots of the random effects associated with the
> post.randeff(out, classnames = c("CSF",
+ "GRY", "WHT"))
The CUDA implementation of the Bayesian model in
cudaBayesreg has
been able to reduce significantly the runtime processing of MCMC
simulations. For the applications described in this paper, we have
obtained speedup factors of
cudaBayesreg, bayesm, cudaBayesregData, oro.nifti
Bayesian, Cluster, Distributions, Econometrics, MedicalImaging
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Silva, "cudaBayesreg: Bayesian Computation in CUDA", The R Journal, 2010
BibTeX citation
@article{RJ-2010-015, author = {Silva, Adelino Ferreira da}, title = {cudaBayesreg: Bayesian Computation in CUDA}, journal = {The R Journal}, year = {2010}, note = {https://rjournal.github.io/}, volume = {2}, issue = {2}, issn = {2073-4859}, pages = {48-55} }