In this work, a novel package called nmfgpu4R is presented, which offers the computation of Non-negative Matrix Factorization (NMF) on Compute Unified Device Architecture (CUDA) platforms within the R environment. Benchmarks show a remarkable speed-up in terms of time per iteration by utilizing the parallelization capabilities of modern graphics cards. Therefore the application of NMF gets more attractive for real-world sized problems because the time to compute a factorization is reduced by an order of magnitude.
Dimension reduction techniques are commonly used in machine learning and data mining tasks. For instance in text mining a corpora with thousands of words in the vocabulary could be too complex to be learned by Support Vector Machines (SVM) directly. Therefore the most important structure within the data must be extracted prior to the learning process. In the context of text mining new data axes at best represent topics in the corpora, which are used to approximate the original documents. Furthermore by reducing the feature space of the data it is less likely to be influenced by the Curse of Dimensionality (CoD) (Bellman 1961).
There are several methods to reduce the dimension of a data matrix, for example Principal Component Analysis (PCA) (Pearson 1901) and Latent Dirichlet Allocation (LDA) (Blei et al. 2003). Another powerful technique namely Non-negative Matrix Factorization (NMF) (Lee and Seung 1999) will be discussed in the first section of this work. Currently available NMF implementations require a prohibitively long computation time, which make the usage for real-world applications impractical. Therefore we present an implementation using the Compute Unified Device Architecture (CUDA) platform with a binding to the R environment. Furthermore the package is developed platform independent and is compatible with all three major platforms for R: Windows, Linux and Mac OS X.
Let
Besides the general convention in the context of data mining, NMF expects columns to represent observations of the dataset instead of attributes (Skillicorn 2007), as visualized in Figure 1. For that reason it is very important to read the data matrix definition in the literature carefully.
Contrary to PCA or Singular Value Decomposition (SVD), the basis vectors are not linearly independent and thus the solution is not unique. However the reconstruction of the data matrix is purely additive and yields a more natural parts-based decomposition (Lee and Seung 1999).
As the factorization should represent a compressed form of the original
data matrix, one approach is to choose
Input: Data matrix
Initialize
while
Fix matrix
Fix matrix
Evaluate error function to check for convergence
Set
Using a good initialization of the matrices can decrease the required
number of iterations and further improve the factorization’s quality.
Depending on the chosen algorithm either only matrix
Several different approaches were presented to execute step 2 of the pseudo-code, the most simple one by Lee and Seung (1999; Lee and Seung 2001) namely initializing both matrices just with random values. A more complex initialization uses the SVD of the data matrix (Boutsidis and Gallopoulos 2008), a very expensive approach which should be only used if the SVD is already available (Langville et al. 2014). However this initialization yields a unique factorization because SVD is also unique.
In general, the convergence theory of NMF is not researched enough. For example, Lee and Seung (2001) had shown that the multiplicative update rules converge to a local minimum. However Gonzalez and Zhang (2005) disproved that and clearly state the algorithm is only proven to converge at most to a saddle point. In fact most of the newer algorithms are only guaranteed to converge to a local minimum. This is mainly because NMF is a non-convex optimization problem (Lee and Seung 2001). In each computation step only one of two matrices gets updated, independently from the other one. Hence finding a global minimum is unlikely, however multiple local minima do exist. If the execution time of an algorithm is short enough, then a Monte-Carlo like approach can be chosen (Berry et al. 2007). That implies executing the algorithm multiple times using different initializations each time and picking the factorization with the best quality.
In the literature, different error or loss functions are proposed. The most common are Kullback-Leibler Divergence (Lee and Seung 1999) and Frobenius norm (Paatero and Tapper 1994; Lee and Seung 2001). Since only the Frobenius norm is used in this work, Kullback-Leibler divergence won’t be discussed.
In an abstract sense, the Frobenius norm of a matrix
The following algorithms minimize the Frobenius norm, but can also easily be derived for other error functions.
Multiplicative update rules have been first described by Lee and Seung (1999; Lee and Seung 2001) and are the fastest algorithms in terms of computational
cost per iteration. In fact this type of algorithm is a special case of
the gradient-descent algorithm with a specific step size (Lee and Seung 2001).
Both update rules for the matrices
mu
and nsNMF
for both
matrices, in gdcls
only for matrix
Alternating Least Squares (ALS) type algorithms are another approach
to solve step 3a) and 3b) of the NMF pseudo-code. The central idea is
that for one given matrix the other one can be computed using a
least-squares projection (Paatero and Tapper 1994).
Since solving a linear equation system possibly yields negative values,
the non-negativity constraint for both matrices
Langville et al. (2014) describe ALS extensions like Alternating Constraint Least
Squares (ACLS) and Alternating Hoyer Constraint Least Squares
(AHCLS), which use additional parameters to provide a more sparse
factorization. Therefore both diagonal and non-diagonal values of the
covariance matrices
als
, acls
, and ahcls
for both matrices, in gdcls
only for matrix
There already exist some approaches to compute NMF in R, for example the NMF (Gaujoux and Seoighe 2010) and NMFN (Liu 2012) packages on CRAN. However both packages use the CPU for the computational process and even with parallelization of multiple runs the usage for real-world datasets is limited.
CUDA-based implementations of NMF are already part of the GPUMLib
In this work we propose a new package called
nmfgpu4Rnvcc
compiler, which only supports one compiler per platform (nvcc
itself is built on top of one compiler). By splitting the package and
C++ library in two separate modules, it is possible to provide both
nmfgpu4R and nmfgpu for all three major platforms: Windows, Linux,
and Mac OS X.
Modern Graphics Processing Units (GPU) can also be used as High
Performance Computing (HPC) devices using either OpenCL or CUDA. Latter
is restricted to only Nvidia hardware but is more common and can be
integrated directly into C/C++ source code. One advantage of the GPU
over CPU parallelization is that algorithms have to be developed
scalable and data parallel. Synchronization and data transfer logic has
to be handled by the developer and therefore these algorithms are able
to profit more from new and more powerful hardware generations. For more
information about the CUDA platform please visit the Nvidia CUDA
website
https://developer.nvidia.com/cuda-zone (last access: 18.04.2016)
Internally the library computes the algorithms using dense matrices, so one option is to pass in a numeric matrix with proper dimensions. Furthermore the nmfgpu4R package currently supports S4 classes from the Matrix package, developed by Bates and Maechler (2014), and the SparseM package, developed by Koenker and Ng (2015). A complete reference about supported S4 classes is listed in table 1. It is important to note that the sparse matrices get converted into dense matrices on the GPU-side. At the moment, a computation using sparse algorithms does not take place at any time.
Storage Format | Matrix | SparseM |
---|---|---|
Dense | "dgeMatrix" |
- |
Coordinate (COO) | "dgTMatrix" |
"matrix.coo" |
Compressed Sparse Column (CSC) | "dgCMatrix" |
"matrix.csc" |
Compressed Sparse Row (CSR) | "dgRMatrix" |
"matrix.csr" |
However this feature allows large sparse matrices to be converted much faster in GPU memory. For example this might be quite useful for Bag-of-Words (BoW) in text mining (Salton and Buckley 1988) or Bag-of-Visual-Words (BoVW) in image classification / retrieval (Cula and Dana 2001), where the vocabulary is commonly very large but the frequencies are mostly zero.
Algorithms of the Non-negative Matrix Factorizations solve a non-convex
optimization problem. Thus choosing a good initialization can reduce the
number of iterations and yield better results. In NMF four different
initialization strategies are implemented. There are different
approaches to choose an initialization for both matrices
Strategy | Matrix |
Matrix |
---|---|---|
CopyExisting | Copy |
Copy |
AllRandomValues | Random | Random |
MeanColumns | Mean of |
Random |
k-means/Random | k-means | Random |
k-means/NonNegativeWTH | k-means | |
EIn-NMF | k-means |
All supported initializations by nmfgpu4R are listed in Table
2. Strategy CopyExisting can be used
to provide user-defined initializations for both matrices
There are currently six different algorithms implemented in nmfgpu4R, because NMF models can be computed in different ways and, furthermore, can be restricted by constraints. Those algorithms which do have extra constraints, can also be adjusted through parameters. In Table 3 all implemented algorithms and their corresponding publications are listed.
Method | Name | Publication |
---|---|---|
acls |
Alternating Constrained Least Squares | Langville et al. (2014) |
ahcls |
Alternating Hoyer Constrained Least Squares | Langville et al. (2014) |
als |
Alternating Least Squares | Paatero and Tapper (1994) |
gdcls |
Gradient Descent Constrained Least Squares | Shahnaz et al. (2006) |
mu |
Multiplicative Update Rules (Frobenius Norm) | Lee and Seung (2001) |
nsnmf |
non-smooth Non-negative Matrix Factorization | Pascual-Montano et al. (2006) |
A few of these algorithms will be evaluated in the benchmark section, using two different image datasets. In general the right choice of algorithm depends on the data and noise within the data. For an overview of all required parameters for a specific algorithm, please have a look at the package documentation.
Most NMF implementations only use the number of iterations as a convergence test, as this is a very cheap test. However, for a mathematically correct convergence test an error function has to be computed and observed during the algorithm execution. In NMF there are four different stopping criteria implemented, which can also be combined. The nmfgpu4R package implements both: the convergence test by observing an error function, as the primary and an upper limit of iterations, as the secondary convergence criterion.
Setting the threshold value can be done by passing in the parameter
threshold
. This value is actually interpreted differently depending on
the configured error function. Currently the Frobenius Norm and Root
Mean Squared Deviation (RMSD) are supported. One advantage of the RMSD
error function is that it is normalized by the number of data matrix
elements and therefore independent of the data matrix dimension. By
passing in the parameter maxiter
the maximum number of iterations can
be overwritten, which is by default set to
result <- nmf(data, r, threshold=0.1, thresholdType="rmsd", maxiter=500)
Depending on the datasets the ALS type algorithms are sometimes not stable and therefore not monotonically decreasing. In such a case the convergence test using the threshold value will not work properly.
A simple but effective method to calculate an encoding matrix for unseen
data was described by Lopes and Ribeiro (2011), which allows NMF to be used within
learning algorithms. Using this method the training data gets factorized
with a normal NMF step. However the factorization step of the testing
data reuses the matrix
As a result, structures between both training and test data are being
preserved, but the feature dimension in matrix
In the following example code the nmf
method is used to train the
basis vectors for the training dataset. After that, the generic
predict
method can be used to either retrieve the encoding matrix iris
dataset (Fischer 1936) to 2 dimensions.
# Set seed for reproducible results
set.seed(42)
# Split iris dataset into training and test data
idx <- sample(1:nrow(iris), 100, replace=F)
data.train <- iris[idx,-5]
data.test <- iris[-idx,-5]
# Compute model and retrieve encoding matrix H for both training and test data
library(nmfgpu4R)
nmfgpu4R.init()
model <- nmf(t(data.train), 2)
encoding.train <- t(predict(model)) # Identical: encoding.train <- t(model$H)
encoding.test <- t(predict(model, t(data.test)))
# Use encoding matrices to predict "Species"
library(e1071)
model.svm <- best.svm(x=encoding.train, y=iris$Species[idx])
prediction <- predict(model.svm, encoding.test)
table(iris[-idx,5], prediction)
Using the iris
dataset is just an example and should be replaced with
a much larger dataset to fully utilize the GPU. Furthermore an
improvement in speed and possibly in accuracy over non-reduced data is
more likely to be observed when the dimension is reduced by a larger
magnitude.
This example learns basis vectors from a training dataset and predicts the encoding matrix for the test dataset. To visualize the encoding matrices of both datasets and their relationships, a simple scatter plot can be made with the following code:
# Plot encoding matrices
library(ggplot2)
data.plot <- data.frame(rbind(encoding.train, encoding.test),
class=unlist(list(iris[idx,5], iris[-idx,5])),
type=c(rep("Train", nrow(data.train)),
rep("Test", nrow(data.test))))
ggplot(data.plot, aes(x=r1, y=r2, color=class, shape=type)) + geom_point()
As shown in Figure 3, both datasets share the same structure. Observations from each of the three classes are predicted to belong to the same area as the training observations.
nmf
method (left) and
by the pcromp
method (right) to 2 dimensions.
The nmfgpu4R package provides a binding to an independent C++ library, which uses the latest C++ features from the C++11 standard. In order to support multiple platforms deploying an extra library is a necessary step since the Nvidia CUDA compiler nvcc only supports the Microsoft Visual C++ compiler on Windows platforms. But R uses its own compilation tool chain and therefore does not allow the Microsoft Visual C++ compiler.
The main problem is that C++ compilers emit an object code which is not
compatible with the object code of another compiler. R uses g++ from the
MinGW tool chain and therefore both compiled binaries are not
link-compatible, virtual tables are only compatible in some situations
and struct returns simply do not work. Furthermore since the object code
is not link-compatible one must fall back to an extern "C"
interface,
which then can be loaded using native system calls like GetProcAddress
on Windows or dlsym
on Linux/Mac OS. Such issues do not come up on
Linux or Mac OS because R uses on these platforms the default configured
compiler which is also supported by the nvcc compiler.
In this section multiple benchmarks are described which were performed on the Yale Face Database (Belhumeur et al. 1997) and Cropped Extended Yale Face Database B (Lee et al. 2005). As a preprocessing step all images were scaled to a common height of 64 pixels while preserving the aspect ratio. The resulting matrix dimensions can be taken from table 4.
Dataset | Pixels (orig.) | Pixels (scaled) | Count | Matrix |
---|---|---|---|---|
Yale Face Database | ||||
Extended Yale Face Database B |
For testing, a system server with CentOS 7.2.1511, Intel Xeon E5-2687W
v3 @3.10GHz (10 physical cores), 256GB RAM, Nvidia GeForce GTX Titan X
and two Nvidia Tesla K80 was used. R is a custom build of version 3.3.1
using OpenBLAS
In this benchmark the nmfgpu4R (version 0.2.5.1) package is compared
to the CRAN packages NMF (version 0.20.6) and NMFN (version 2.0),
which both provide CPU implementations of common NMF algorithms. The NMF
package does provide optimized C++ algorithms as well as pure R
implementations. Regarding the package documentation parallelization is
only performed using clusters for parallelizing multiple runs of the
same algorithm with different initializations. In order to fully utilize
the CPU cores, pure R algorithms were benchmarked using an OpenBLAS
back-end with OPENBLAS_NUM_THREADS=10
. Algorithms from the NMFN
package were modified to accept preinitialized matrices to be able to
compare the algorithms with identical starting points. Both the CPU and
GPU algorithms were executed 10 times each.
Yale Face Database | Extended Yale Face Database B+ | |||||
Device | Package | Algorithm | Elapsed Time | Elapsed Time | ||
Intel Xeon E5-2687W v3 @3.10GHz | NMF | .R#brunet | ||||
.R#lee | ||||||
.R#nsNMF ( |
||||||
brunet | ||||||
lee | ||||||
nsNMF ( |
||||||
NMFN | nnmf_als | |||||
nnmf_mm | ||||||
Nvidia GeForce GTX TITAN X (Maxwell) | nmfgpu4R (double) | als | ||||
gdcls ( |
||||||
mu | ||||||
nsNMF ( |
||||||
nmfgpu4R (float) | als | |||||
gdcls ( |
||||||
mu | ||||||
nsNMF ( |
||||||
Nvidia Tesla K80 (Kepler) | nmfgpu4R (double) | als | ||||
gdcls ( |
||||||
mu | ||||||
nsNMF ( |
||||||
nmfgpu4R (float) | als | |||||
gdcls ( |
||||||
mu | ||||||
nsNMF ( |
As already stated in the previous section Alternating Least Squares algorithms seem to perform poorly on very dense datasets, leading to a non-stable factorization or even no solution at all. However the execution times of the ALS algorithms in nmfgpu4R are the highest of all GPU algorithms, but they are still very low compared to the ALS implementation in NMFN, which is shown by Figure 4 (top). Furthermore, the optimized C++ algorithms in the NMF package are much slower when computed in sequential mode compared to the R implementations, which are accelerated by the multi-threaded OpenBLAS back-end.
Overall the multiplicative algorithm is the fastest algorithm for both GPU and CPU. Depending on the dataset it might be useful to compute the factorization in single precision format, because modern GPUs have still more single precision than double precision floating point units. As shown by Figure 4, GPUs of Nvidia’s GeForce series are optimized for single precision calculations, which is sufficient for end-user gaming experience. However double precision computation is very limited on those cards, whereas the Tesla series also provides enough double precision units for fast calculations. As Table 5 indicates, there is no noticeable difference in terms of factorization quality but very much in execution time. Small variations between error functions can be caused due to computational ordering and on GPU-side due to dispatching of thread blocks.
In this work a new possibility to compute Non-negative Matrix Factorizations (NMF) using CUDA hardware is presented. As shown in the benchmarks, the performance gain is remarkable and therefore much larger datasets can be reduced, without having to wait on completion for weeks or even months. Currently the implementation is only limited by the available memory on the device, because all algorithms work directly in device memory without transfering intermediate results back to the host. Possible extensions to this library/package could make use of out-of-core computation and multiple CUDA devices, either to compute one distributed factorization or multiple factorizations with different initializations. Furthermore more complex algorithms and initialization strategies could be implemented in the future.
NMF, NMFN, nmfgpu4R, Matrix, SparseM
Econometrics, NumericalMathematics
This article is converted from a Legacy LaTeX article using the texor package. The pdf version is the official version. To report a problem with the html, refer to CONTRIBUTE on the R Journal homepage.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Koitka & Friedrich, "nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware", The R Journal, 2016
BibTeX citation
@article{RJ-2016-053, author = {Koitka, Sven and Friedrich, Christoph M.}, title = {nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware}, journal = {The R Journal}, year = {2016}, note = {https://rjournal.github.io/}, volume = {8}, issue = {2}, issn = {2073-4859}, pages = {382-392} }