nmfgpu4R: GPU-Accelerated Computation of the Non-Negative Matrix Factorization (NMF) Using CUDA Capable Hardware

In this work, a novel package called nmfgpu4R is presented, which offers the computation of Non-negative Matrix Factorization (NMF) on Compute Unified Device Architecture (CUDA) platforms within the R environment. Benchmarks show a remarkable speed-up in terms of time per iteration by utilizing the parallelization capabilities of modern graphics cards. Therefore the application of NMF gets more attractive for real-world sized problems because the time to compute a factorization is reduced by an order of magnitude.


Introduction
Dimension reduction techniques are commonly used in machine learning and data mining tasks.For instance in text mining a corpora with thousands of words in the vocabulary could be too complex to be learned by Support Vector Machines (SVM) directly.Therefore the most important structure within the data must be extracted prior to the learning process.In the context of text mining new data axes at best represent topics in the corpora, which are used to approximate the original documents.Furthermore by reducing the feature space of the data it is less likely to be influenced by the Curse of Dimensionality (CoD) (Bellman, 1961).
There are several methods to reduce the dimension of a data matrix, for example Principal Component Analysis (PCA) (Pearson, 1901) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003).Another powerful technique namely Non-negative Matrix Factorization (NMF) (Lee and Seung, 1999) will be discussed in the first section of this work.Currently available NMF implementations require a prohibitively long computation time, which make the usage for real-world applications impractical.Therefore we present an implementation using the Compute Unified Device Architecture (CUDA) platform with a binding to the R environment.Furthermore the package is developed platform independent and is compatible with all three major platforms for R: Windows, Linux and Mac OS X.

Overview of non-negative matrix factorization
Let X ∈ R n×m + be a matrix with n attributes and m observations in the dataset, then the data matrix X is approximated by the product of two new matrices W and H (Lee and Seung, 2001): Each column of the matrix W ∈ R n×r + represents a single basis vector, whereas each column of the matrix H ∈ R r×m + represents an encoding vector.Therefore a column of the data matrix can be approximated by the linear combination of all basis vectors with one encoding vector (Lee and Seung, 2001).The importance of each basis vector can be seen by analysing the row sums of matrix H. Row sums with a low value identify basis vectors with very little influence on the dataset and vice versa (Skillicorn, 2007).It is also important to note that the data matrix as well as both matrices W and H contain only non-negative values.
Besides the general convention in the context of data mining, NMF expects columns to represent observations of the dataset instead of attributes (Skillicorn, 2007), as visualized in Figure 1.For that reason it is very important to read the data matrix definition in the literature carefully.
Contrary to PCA or Singular Value Decomposition (SVD), the basis vectors are not linearly independent and thus the solution is not unique.However the reconstruction of the data matrix is purely additive and yields a more natural parts-based decomposition (Lee and Seung, 1999).
As the factorization should represent a compressed form of the original data matrix, one approach is to choose r depending on the number of rows and columns of the data matrix (Lee and Seung, 2001): The In general, one should choose r m (Shahnaz et al., 2006).However, choosing the right parameter depends on the dataset and usage of the factorization.

Pseudo-code
with non-negative values and set k = 0 3. while k < k max and not converged: (a) Fix matrix W (k) and compute matrix k+1) and compute matrix W (k+1) (c) Evaluate error function to check for convergence

Initialization of factor matrices
Using a good initialization of the matrices can decrease the required number of iterations and further improve the factorization's quality.Depending on the chosen algorithm either only matrix W or both matrices need to be initialized.
Several different approaches were presented to execute step 2 of the pseudo-code, the most simple one by Lee andSeung (1999, 2001) namely initializing both matrices just with random values.A more complex initialization uses the SVD of the data matrix (Boutsidis and Gallopoulos, 2008), a very expensive approach which should be only used if the SVD is already available (Langville et al., 2014).However this initialization yields a unique factorization because SVD is also unique.
In general, the convergence theory of NMF is not researched enough.For example, Lee and Seung (2001) had shown that the multiplicative update rules converge to a local minimum.However Gonzalez and Zhang (2005) disproved that and clearly state the algorithm is only proven to converge at most to a saddle point.In fact most of the newer algorithms are only guaranteed to converge to a local minimum.This is mainly because NMF is a non-convex optimization problem (Lee and Seung, 2001).In each computation step only one of two matrices gets updated, independently from the other one.Hence finding a global minimum is unlikely, however multiple local minima do exist.If the execution time of an algorithm is short enough, then a Monte-Carlo like approach can be chosen (Berry et al., 2007).That implies executing the algorithm multiple times using different initializations each time and picking the factorization with the best quality.

Error function
In the literature, different error or loss functions are proposed.The most common are Kullback-Leibler Divergence (Lee and Seung, 1999) and Frobenius norm (Paatero and Tapper, 1994;Lee and Seung, 2001).Since only the Frobenius norm is used in this work, Kullback-Leibler divergence won't be discussed.
In an abstract sense, the Frobenius norm of a matrix A ∈ R n×m + is equal to the Euclidean distance of a vector a ∈ R n•m + .To be more precise the Frobenius norm is the square root of the sum of all The R Journal Vol.8/2, December 2016 ISSN 2073-4859 squared matrix elements (Reinhardt et al., 2013): Besides this general definition there do exist alternative representations, among others the representation using the trace of a matrix (Reinhardt et al., 2013): For optimized computation the widely used minimization problem is rearranged using this equivalence: Upon first sight the error function seems to be more expensive to compute but actually most terms get computed during the algorithm execution anyway (Berry et al., 2007;Langville et al., 2014).Furthermore, the trace X T X is constant and can be precomputed.
The following algorithms minimize the Frobenius norm, but can also easily be derived for other error functions.

Updating with multiplicative update rules
Multiplicative update rules have been first described by Lee andSeung (1999, 2001) and are the fastest algorithms in terms of computational cost per iteration.In fact this type of algorithm is a special case of the gradient-descent algorithm with a specific step size (Lee and Seung, 2001).Both update rules for the matrices W and H are applied in an alternating fashion to solve step 3a) and 3b) of the NMF pseudo-code: Where ⊗ denotes the element-wise matrix multiplication and the element-wise matrix division.However it is advised to add an epsilon to the denominator, e.g.≈ 10 −9 for double precision floating point values, to avoid divisions by zero (Berry et al., 2007).Referring to table 3 in the implementation section, multiplicative update rules are used in mu and nsNMF for both matrices, in gdcls only for matrix W.

Updating with alternating least squares
Alternating Least Squares (ALS) type algorithms are another approach to solve step 3a) and 3b) of the NMF pseudo-code.The central idea is that for one given matrix the other one can be computed using a least-squares projection (Paatero and Tapper, 1994).
In the first step, Equation 8 gets solved to H (k+1) whereby the computation of matrix W becomes possible.Equation 9 gets solved for W (k+1) T , followed by transposing the solution to acquire the matrix W (k+1) .
Since solving a linear equation system possibly yields negative values, the non-negativity constraint for both matrices W and H must be ensured after each solving step.One possible solution for this problem is to set all negative values to zero (Langville et al., 2014).Langville et al. (2014) describe ALS extensions like Alternating Constraint Least Squares (ACLS) and Alternating Hoyer Constraint Least Squares (AHCLS), which use additional parameters to provide a more The R Journal Vol.8/2, December 2016 ISSN 2073-4859 sparse factorization.Therefore both diagonal and non-diagonal values of the covariance matrices W T W and HH T get manipulated.For example, the AHCLS uses the additional parameters λ W , λ H , α W and α H to solve the following equations: Where I ∈ R r×r denotes the identity matrix and E ∈ R r×r a matrix of ones, furthermore β W and β H are defined as: 12) , where α W and α H should represent the requested percentage of sparsity.As a head start all four values should be set to 0.5.
Once more referring to Table 3 in the implementation section, ALS update rules are used in als, acls, and ahcls for both matrices, in gdcls only for matrix H.

The NMF algorithm for R using CUDA: nmfgpu4R
There already exist some approaches to compute NMF in R, for example the NMF (Gaujoux and Seoighe, 2010) and NMFN (Liu, 2012) packages on CRAN.However both packages use the CPU for the computational process and even with parallelization of multiple runs the usage for real-world datasets is limited.
CUDA-based implementations of NMF are already part of the GPUMLib1 (Lopes and Ribeiro, 2010), which itself contains various algorithms of machine learning tasks for CUDA platforms.Currently, as of version 0.3.4,there are two algorithms available, one additive and one multiplicative, for both Frobenius norm and Kullback-Leibler divergence.Besides that no complex initialization strategies or algorithms incorporating constraints are available.Furthermore the computation of NMF is restricted to single precision format, which might not be suitable for every dataset.
In this work we propose a new package called nmfgpu4R2 , which is a binding to a separate library called nmfgpu3 written in C++11 using CUDA (version ≥ 7.0) for Nvidia GPUs with compute capability ≥ 3.0 (Kepler).When using CUDA, different build tools must be chosen depending on the platform.This limitation is induced by Nvidia's nvcc compiler, which only supports one compiler per platform (nvcc itself is built on top of one compiler).By splitting the package and C++ library in two separate modules, it is possible to provide both nmfgpu4R and nmfgpu for all three major platforms: Windows, Linux, and Mac OS X.
Modern Graphics Processing Units (GPU) can also be used as High Performance Computing (HPC) devices using either OpenCL or CUDA.Latter is restricted to only Nvidia hardware but is more common and can be integrated directly into C/C++ source code.One advantage of the GPU over CPU parallelization is that algorithms have to be developed scalable and data parallel.Synchronization and data transfer logic has to be handled by the developer and therefore these algorithms are able to profit more from new and more powerful hardware generations.For more information about the CUDA platform please visit the Nvidia CUDA website4 .

Supported data matrix formats
Internally the library computes the algorithms using dense matrices, so one option is to pass in a numeric matrix with proper dimensions.Furthermore the nmfgpu4R package currently supports S4 classes from the Matrix package, developed by Bates and Maechler (2014), and the SparseM package, developed by Koenker and Ng (2015).A complete reference about supported S4 classes is listed in table 1.It is important to note that the sparse matrices get converted into dense matrices on the GPU-side.At the moment, a computation using sparse algorithms does not take place at any time.

Storage Format Matrix SparseM
Dense "dgeMatrix" -Coordinate (COO) "dgTMatrix" "matrix.coo"Compressed Sparse Column (CSC) "dgCMatrix" "matrix.csc"Compressed Sparse Row (CSR) "dgRMatrix" "matrix.csr"However this feature allows large sparse matrices to be converted much faster in GPU memory.For example this might be quite useful for Bag-of-Words (BoW) in text mining (Salton and Buckley, 1988) or Bag-of-Visual-Words (BoVW) in image classification / retrieval (Cula and Dana, 2001), where the vocabulary is commonly very large but the frequencies are mostly zero.

Customizing the initialization
Algorithms of the Non-negative Matrix Factorizations solve a non-convex optimization problem.Thus choosing a good initialization can reduce the number of iterations and yield better results.In NMF four different initialization strategies are implemented.There are different approaches to choose an initialization for both matrices W and H.It is important to keep in mind that when an ALS type algorithm is chosen only matrix W has to be initialized.Matrix H will be computed in the first iteration from only matrix W and the data matrix.All supported initializations by nmfgpu4R are listed in Table 2. Strategy CopyExisting can be used to provide user-defined initializations for both matrices W and H which get copied directly into GPU memory.When using AllRandomValues both matrices W and H get initialized by random values which is the most common but also the simplest strategy (Pauca et al., 2006).Langville et al. (2014) presented a method called MeanColumns to form initial basis vectors from data columns.The idea behind this initialization is that if the data columns are sparse then the initial basis vectors should be sparse as well.Furthermore, k-means clustering can be used to find initial basis vectors (Gong and Nandi, 2013).

Strategy
If matrix H has to be initialized in the context of k-means based initializations, then there are different approaches.Most complex is the EIn-NMF initialization which computes the membership degree of each data column (Gong and Nandi, 2013).

Using different algorithms
There are currently six different algorithms implemented in nmfgpu4R, because NMF models can be computed in different ways and, furthermore, can be restricted by constraints.Those algorithms which do have extra constraints, can also be adjusted through parameters.In Table 3 all implemented algorithms and their corresponding publications are listed.
A few of these algorithms will be evaluated in the benchmark section, using two different image datasets.In general the right choice of algorithm depends on the data and noise within the data.For an overview of all required parameters for a specific algorithm, please have a look at the package documentation. The

Adjusting convergence tests
Most NMF implementations only use the number of iterations as a convergence test, as this is a very cheap test.However, for a mathematically correct convergence test an error function has to be computed and observed during the algorithm execution.In NMF there are four different stopping criteria implemented, which can also be combined.The nmfgpu4R package implements both: the convergence test by observing an error function, as the primary and an upper limit of iterations, as the secondary convergence criterion.
Setting the threshold value can be done by passing in the parameter threshold.This value is actually interpreted differently depending on the configured error function.Currently the Frobenius Norm and Root Mean Squared Deviation (RMSD) are supported.One advantage of the RMSD error function is that it is normalized by the number of data matrix elements and therefore independent of the data matrix dimension.By passing in the parameter maxiter the maximum number of iterations can be overwritten, which is by default set to 2000.For example, execute the algorithm until the delta error is less than 0.1 regarding the RMSD error function but at most 500 iterations: result <-nmf(data, r, threshold=0.1,thresholdType="rmsd", maxiter=500) Depending on the datasets the ALS type algorithms are sometimes not stable and therefore not monotonically decreasing.In such a case the convergence test using the threshold value will not work properly.

Encoding matrix for new unseen data
A simple but effective method to calculate an encoding matrix for unseen data was described by Lopes and Ribeiro (2011), which allows NMF to be used within learning algorithms.Using this method the training data gets factorized with a normal NMF step.However the factorization step of the testing data reuses the matrix W and only updates the matrix H. Thus the resulting matrix H is an encoding of learned basis vectors from the training data.A complete scheme of the process is visualized in figure 2.
As a result, structures between both training and test data are being preserved, but the feature dimension in matrix H can be reduced to a fraction of the original dimension.Hence, learning, for example, a Support-Vector-Machine (SVM) can be speeded up and furthermore prediction accuracy can be improved.
In the following example code the nmf method is used to train the basis vectors for the training dataset.After that, the generic predict method can be used to either retrieve the encoding matrix H of the training data or to generate an encoding matrix for a new data matrix.The objective here is to reduce the 4 dimensions of the iris dataset (Fischer, 1936)    Using the iris dataset is just an example and should be replaced with a much larger dataset to fully utilize the GPU.Furthermore an improvement in speed and possibly in accuracy over non-reduced data is more likely to be observed when the dimension is reduced by a larger magnitude.This example learns basis vectors from a training dataset and predicts the encoding matrix for the test dataset.To visualize the encoding matrices of both datasets and their relationships, a simple scatter plot can be made with the following code: # Plot encoding matrices library(ggplot2) data.plot<-data.frame(rbind(encoding.train, encoding.test), class=unlist(list(iris[idx,5], iris[-idx,5])), type=c(rep("Train", nrow(data.train)),rep("Test", nrow(data.test))))ggplot(data.plot,aes(x=r1, y=r2, color=class, shape=type)) + geom_point() As shown in Figure 3, both datasets share the same structure.Observations from each of the three classes are predicted to belong to the same area as the training observations.

Issues during developement
The nmfgpu4R package provides a binding to an independent C++ library, which uses the latest C++ features from the C++11 standard.In order to support multiple platforms deploying an extra library is a necessary step since the Nvidia CUDA compiler nvcc only supports the Microsoft Visual C++ compiler on Windows platforms.But R uses its own compilation tool chain and therefore does not allow the Microsoft Visual C++ compiler.
The main problem is that C++ compilers emit an object code which is not compatible with the object code of another compiler.R uses g++ from the MinGW tool chain and therefore both compiled binaries are not link-compatible, virtual tables are only compatible in some situations and struct returns simply do not work.Furthermore since the object code is not link-compatible one must fall back to an extern "C" interface, which then can be loaded using native system calls like GetProcAddress on Windows or dlsym on Linux/Mac OS.Such issues do not come up on Linux or Mac OS because R uses on these platforms the default configured compiler which is also supported by the nvcc compiler.
The R Journal Vol.8/2, December 2016 ISSN 2073-4859 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 3 4 5 6 7 0.0 2.5 5.0 7.5 10.0 r1 r2 class q q q setosa versicolor virginica type q Test Train q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −1.(Fischer, 1936), which is reduced by the nmf method (left) and by the pcromp method (right) to 2 dimensions.

Benchmarks
In this section multiple benchmarks are described which were performed on the Yale Face Database (Belhumeur et al., 1997) and Cropped Extended Yale Face Database B (Lee et al., 2005).As a preprocessing step all images were scaled to a common height of 64 pixels while preserving the aspect ratio.For testing, a system server with CentOS 7.2.1511,Intel Xeon E5-2687W v3 @3.10GHz (10 physical cores), 256GB RAM, Nvidia GeForce GTX Titan X and two Nvidia Tesla K80 was used.R is a custom build of version 3.3.1 using OpenBLAS5 as BLAS back-end.
In this benchmark the nmfgpu4R (version 0.2.5.1) package is compared to the CRAN packages NMF (version 0.20.6) and NMFN (version 2.0), which both provide CPU implementations of common NMF algorithms.The NMF package does provide optimized C++ algorithms as well as pure R implementations.Regarding the package documentation parallelization is only performed using clusters for parallelizing multiple runs of the same algorithm with different initializations.In order to fully utilize the CPU cores, pure R algorithms were benchmarked using an OpenBLAS backend with OPENBLAS_NUM_THREADS=10. Algorithms from the NMFN package were modified to accept preinitialized matrices to be able to compare the algorithms with identical starting points.Both the CPU and GPU algorithms were executed 10 times each.
As already stated in the previous section Alternating Least Squares algorithms seem to perform poorly on very dense datasets, leading to a non-stable factorization or even no solution at all.However the execution times of the ALS algorithms in nmfgpu4R are the highest of all GPU algorithms, but they are still very low compared to the ALS implementation in NMFN, which is shown by Figure 4 (top).Furthermore, the optimized C++ algorithms in the NMF package are much slower when computed in sequential mode compared to the R implementations, which are accelerated by the multi-threaded OpenBLAS back-end.
Overall the multiplicative algorithm is the fastest algorithm for both GPU and CPU.Depending on the dataset it might be useful to compute the factorization in single precision format, because modern GPUs have still more single precision than double precision floating point units.As shown by Figure 4, GPUs of Nvidia's GeForce series are optimized for single precision calculations, which is sufficient for end-user gaming experience.However double precision computation is very limited on those cards, whereas the Tesla series also provides enough double precision units for fast calculations.As Table 5 indicates, there is no noticeable difference in terms of factorization quality but very much in execution time.Small variations between error functions can be caused due to computational ordering and on GPU-side due to dispatching of thread blocks.

Figure 1 :
Figure 1: NMF model which approximates the data matrix by a linear combination of basis vectors and an encoding matrix.

Figure 2 :
Figure 2: (a) Prediction of an encoding matrix for unseen data.The data matrix of the existing NMF model is "extended" by new data, but the basis vectors are fixed.(b) Data flow visualization of the prediction process in the context of a SVM (derived from Lopes and Ribeiro (2011)).

Figure 3 :
Figure3: of the encoding matrices for the iris dataset(Fischer, 1936), which is reduced by the nmf method (left) and by the pcromp method (right) to 2 dimensions.

Figure 4 :
Figure 4: Computation time for one iteration on the Yale Face Database with r = 32 (top) and Cropped Extended Yale Face Database B with r = 128 (bottom) shown on a logarithmic scale.

Table 1 :
Supported S4 classes as input data matrix to nmfgpu4R.

Table 2 :
Supported initialization strategies for initializing matrix W and H.

Table 3 :
Overview of implemented algorithms in nmfgpu4R.

Table 4 :
The resulting matrix dimensions can be taken from table 4. Dimensions of data matrices which where used to benchmark existing CPU implementations as well as GPU implementations by the nmfgpu4R package.

Table 5 :
Benchmark results for the Yale Face Database with r = 32 features and Cropped Extended Yale Face Database with r = 128.Each measurement was taken at iteration 2000 with n = 10 computations.