glrm

SparkGLRM

SparkGLRM is a Spark package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a matrix by two low rank matrices, and include many well known models in machine learning, such as principal components analysis (PCA), matrix completion, robust PCA, nonnegative matrix factorization, k-means, and many more.

For more information on GLRMs, see our paper.

SparkGLRM makes it easy to mix and match loss functions and regularizers to construct a model suitable for a particular data set. In particular, it supports

using different loss functions for different entries of the matrix, which is useful when data types are heterogeneous
fitting the model to only some of the entries in the table, which is useful for data tables with many missing (unobserved) entries
custom regularization of the model

Design

The matrix to be factored is split entry-wise across many machines. The model (factors X and Y) is repeated and held in memory on every machine. Thus the total computation time required to fit the model is proportional to the number of non-zeros divided by the number of cores, with the restriction that the model should fit in memory on a single machine. Where possible, hardware acceleration is used for local linear algebraic operations, via breeze.

At every iteration, the current model is broadcast to all machines, such that there is only one copy of the model on each machine. This particularly important in machines with many cores, because it avoids duplicating the model those machines. Each core on a machine will process a partition of the input matrix, using the local copy of the model available.

Compilation

To compile and run, run the following from the Spark root directory. Compilation:

sbt/sbt assembly

To run with 4GB of ram:

./bin/spark-submit --class org.apache.spark.examples.SparkGLRM  \
  ./examples/target/scala-2.10/spark-examples-1.1.0-SNAPSHOT-hadoop1.0.4.jar \
  --executor-memory 4G \
  --driver-memory 4G

Generalized Low Rank Models

GLRMs form a low rank model for a matrix A with M rows and U columns, which can be input as an RDD of non-zero entries. It is fine if only some of the entries have been observed (i.e., the others are missing); the GLRM will only be fit on the observed entries. The desired model is specified by choosing a rank rank for the model, a loss function lossGrad, and two regularizers, moviesProx and usersProx. The data is modeled as XY, where X is a Mxrank matrix and Y is a kxrank matrix. X and Y are found by solving the optimization problem

minimize sum_{(i,j) in A} loss(i, j, X[i,:] Y[:,j], A[i,j]) + sum_i moviesProx(X[i,:]) + sum_j usersProx(Y[:,j])

To fit a GLRM in SparkGLRM, the user specifies

the data A via observed entries
the dimensions M and U
the loss function (sub)gradient lossGrad, indexed by i and j
the regularizers moviesProx and usersProx
the rank rank

There are currently several features implemented, including:

L1 loss lossL1Grad
L2squared loss lossL2squaredGrad
mixedLoss mixedLossGrad, for demonstration of per-entry loss function
quadratic regularization proxL2
L1 regularization proxL1
nonnegative constraint proxNonneg

Users may also implement their own losses and regularizers; see SparkGLRM.scala for more details.

For example, the following code builds parameters and observed entries, in preparation to fit a model:

    // Number of movies
    val M = 1000
    // Number of users
    val U = 1000
    // Number of non-zeros per row
    val NNZ = 10
    // Number of features
    val rank = 5
    // Number of iterations
    val numIterations = 100
    // regularization parameter
    val regPen = 0.1


    // Number of partitions for data, set to number of cores in cluster
    val numChunks = 4
    // Build non-zeros
    val A = sc.parallelize(0 until M, numChunks).flatMap{i =>
      val inds = new scala.collection.mutable.TreeSet[Int]()
      while (inds.size < NNZ) {
        inds += scala.util.Random.nextInt(U)
      }
      inds.toArray.map(j => (i, j, scala.math.random))
    }

To fit the model with squared error loss and quadratic regularization with rank=5 on the matrix A, call:

    val (ms, us) = fitGLRM(A, M, U, lossL2squaredGrad, proxL2, proxL2, rank, numIterations, regPen)

which runs an alternating directions proximal gradient method on to find the ms and us minimizing the objective function. To see how well the model performs using RMSE:

// Output RMSE using learned model
val finalRMSE = math.sqrt(A.map { case (i, j, rij) =>
  val err = ms(i).dot(us(j)) - rij
  err * err
}.mean())

Missing data

The input is a sparse matrix, and so missing data are simply left out of the sparse representation.

Optimization

Algorithm

The algorithm implemented is proximal gradient, described as follows, with more details in our paper.

With update implemented as

Convergence

The function fitGLRM uses an alternating directions proximal gradient method to minimize the objective. This method is not guaranteed to converge to the optimum, or even to a local minimum. If your code is not converging or is converging to a model you dislike, there are a number of parameters you can tweak.

The algorithm starts with X and Y initialized randomly. If you have a good guess for a model, try setting them explicitly. If you think that you're getting stuck in a local minimum, try reinitializing your GLRM (so as to construct a new initial random point) and see if the model you obtain improves.

The step size controls the speed of convergence. Small step sizes will slow convergence, while large ones will cause divergence.

Name	Name	Last commit message	Last commit date
parent directory ..
SparkGLRM.scala	SparkGLRM.scala	fast parameters	Oct 15, 2014
glrm1.png	glrm1.png	Images	Oct 5, 2014
glrm2.png	glrm2.png	Images	Oct 5, 2014
glrm3.png	glrm3.png	Images	Oct 5, 2014
readme.md	readme.md	algorith -> algorithm	Oct 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

glrm

glrm

readme.md

SparkGLRM

Design

Compilation

Generalized Low Rank Models

Missing data

Optimization

Algorithm

Convergence

Files

glrm

Directory actions

More options

Directory actions

More options

Latest commit

History

glrm

Folders and files

parent directory

readme.md

SparkGLRM

Design

Compilation

Generalized Low Rank Models

Missing data

Optimization

Algorithm

Convergence