Bayesian Non-Negative Matrix Factorization
Being Unsure About What To Recommend
Non-Negative Matrix Factorization
In deterministic NMF each entry in the parameters is a fixed weight or point estimate. But in Bayesian NMF each entry in the parameters is a distribution which will be the trusty Normal distribution $\mathcal{N}(\mu, \sigma)$. Now each entry in the matrices $P, Q$, vectors $u, i$ and scalar $b$ will be a distribution, i.e.
\[P_{uk} \sim \mathcal{N}(\mu_{uk}, \sigma_{uk}) \\ Q_{ki} \sim \mathcal{N}(\mu_{ki}, \sigma_{ki}) \\ u_u \sim \mathcal{N}(\mu_{u}, \sigma_{u}) \\ i_i \sim \mathcal{N}(\mu_{i}, \sigma_{i}) \\ b \sim \mathcal{N}(\mu_{b}, \sigma_{b}) \\\]The aim is now to find the optimal $\mu’s$ and $\sigma’s$ for all the parameters in $P, Q, u, i$ and $b$. In the previous post we optimized the mean squared error between our predicted matrix $\hat{R}$ and the real matrix $R$ with gradient descent. As it turns out we can train the distributions in the same way by applying a trick which is one of the many advantages of the Normal distribution: the reparameterization trick. I explained the reparameterization trick in detail in another blog post and will simply refer to it here.
In order to allow PyTorch (shout out to the devs!) to backpropagate to the variational parameters $\mu, sigma$ we have to reparameterize the every entry in the parameters.
\[\begin{align*} P_{uk} &= \mu_{uk} + \mathcal{E} \cdot \sigma_{uk} \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ Q_{ki} &= \mu_{ki} + \mathcal{E} \cdot \sigma_{ki} \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ u_u &= \mu_u + \mathcal{E} \cdot \sigma_u \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ i_i &= \mu_i + \mathcal{E} \cdot \sigma_i \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ b &= \mu_b + \mathcal{E} \cdot \sigma_b \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ \end{align*}\]We can rewrite this in matrix notation to write it more compactly with the element-wise multiplication operator $\odot$ and combining the $\mu’s$ and $\sigma’s$ into matrices, i.e. $P_\mu$ and $P_\sigma$:
\[\begin{align*} P &= P_\mu + \mathcal{E} \odot P_\sigma \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ Q &= Q_\mu + \mathcal{E} \odot Q_\sigma \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ u &= u_\mu + \mathcal{E} \odot u_\sigma \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ i &= i_\mu + \mathcal{E} \odot i_\sigma \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ b &= b_\mu + \mathcal{E} \odot b_\sigma \quad ; \mathcal{E} \sim \mathcal{N}(0,1) \\\\ \end{align*}\]We can now sample predictions by first sampling the above matrices and vector and then combining them into:
\[\begin{align*} \hat{R} &= P \ Q + u^T + i + b \end{align*}\]with $\hat{R} \in \mathbb{R}^{U \times I}, P \in \mathbb{R}^{U \times K}, Q \in \mathbb{R}^{K \times I}, u \in \mathbb{R}^{U}, i \in \mathbb{R}^I$ and $b \in \mathbb{R}$. Special attention for adding the vector $u \in \mathbb{R}^U$ column-wise to the matrix product $PQ$, the vector $i \in \mathbb{R}^I$ row-wise to $PQ$ and b simply to every entry in $PQ$. The vector $u$ is a combination of scalar biases for every user and is added to every of her/his recommendations in the columns of $\hat{R}$. The entries in the vector $i$ on the other hand are biases for the items and are added to each row in $\hat{R}$.
We can sample multiple prediction matrices $\hat{R}$ by first sampling the corresponding parameters and then combining the into a prediction. From these predictions we can estimate the mean and standard deviation:
\[\begin{align*} \mu_{\hat{R}} &= \mathbb{E}[ \ \hat{R} \ \ ] \approx \frac{1}{N} \sum_{i=0}^N \hat{R}_i \\\\ \sigma_{\hat{R}} &= \sqrt{\mathbb{V}[ \ \hat{R} \ \ ]} \approx \sqrt{ \frac{1}{N} \sum_{i=0}^N ( \hat{R}_i - \mu_{\hat{R}})^2} \\\\ \end{align*}\]Once we have the mean and standard deviation of the prediction we can construct a cost function through the negative log-likelihood of the real prediction matrix $R$ under the estimated distribution $\hat{R}$:
\[\begin{align*} \mathcal{L}(\hat{R}, R) = - \sum_{u,i}^{U,I} \delta_{ui} \left( \log \frac{1}{\sqrt{2\pi \sigma_{\hat{R}_{ui}}^2}} - \frac{1}{2 \sigma_{\hat{R}_{ui}}^2} (R_{ui} - \mu_{\hat{R}_{ui}})^2 \right) \end{align*}\]where $\delta_{ui}$ is a delta function which is 1 if the entry $R_{ui}$ is in the training data and 0 if the entry $R_{ui}$ is to be predicted. $\delta_{ui}$ serves as a filter to remove the non-existent data points from the cost function. Depending on the dimensionality and whether or not we’re estimating covariances as well (which we are not in this case), $N=10$ or $N=20$ is enough to stabilize the cost function. An important coding tweak is to predict the samples of $\hat{R}$ in parallel by creating a three-dimensional tensor $\mathcal{E}$ with a third ‘sample’ dimension. That third dimension will be used to compute the $\mu_{\hat{R}}$ and $\sigma_{\hat{R}}$ efficiently without having to resort to for-loops which can be very slow in Python.
After computing the cost function we can let PyTorch do its automatic differentiation magic and we optimize the whole thing with an optimizer such as Adam.
import numpy as np
import torch
import torch.nn.functional as F
FloatTensor = torch.FloatTensor
np.set_printoptions(precision=4)
class BayesNMF(torch.nn.Module):
def __init__(self, R, K, lr, iterations):
"""
Perform matrix factorization to predict empty
entries in a matrix.
Arguments
- R (ndarray) : user-item rating matrix
- K (int) : number of latent dimensions
- alpha (float) : learning rate
- beta (float) : regularization parameter
"""
super().__init__()
self.R = R #The recommendation matrix
self.data_mask = torch.zeros_like(R) #Mask with 1's where we have data points
self.data_mask[R!=0] = 1
self.missing_mask = torch.zeros_like(self.R)
self.missing_mask[R==0]=1 #Mask with 1's where we don't have data points
self.num_users, self.num_items = R.shape #Number of users and items
self.K = K #Number of latent dimensions
self.lr = lr #Learning rate for the optimizer
self.iterations = iterations #Number of gradient descent steps
exponential_dist_lambda = 5 #Parameter for exponential distribution used to initialize parameters with small positive values
init_std = 0
#Parameters with mu and untransformed variances: \sigma = log(1+exp(rho))
self.P_mu = torch.nn.Parameter(FloatTensor(self.num_users, self.K).exponential_(exponential_dist_lambda))
self.P_rho = torch.nn.Parameter(FloatTensor(self.num_users, self.K).fill_(init_std))
self.Q_mu = torch.nn.Parameter(FloatTensor(self.num_items, self.K).exponential_(exponential_dist_lambda))
self.Q_rho = torch.nn.Parameter(FloatTensor(self.num_items, self.K).fill_(init_std))
self.b_u_mu = torch.nn.Parameter(FloatTensor(self.num_users, 1).exponential_(exponential_dist_lambda))
self.b_u_rho = torch.nn.Parameter(FloatTensor(self.num_users, 1).fill_(init_std))
self.b_i_mu = torch.nn.Parameter(FloatTensor(self.num_items, 1).exponential_(exponential_dist_lambda))
self.b_i_rho = torch.nn.Parameter(FloatTensor(self.num_items, 1).fill_(init_std))
self.b_mu = torch.nn.Parameter(FloatTensor([0.1]))
self.b_rho = torch.nn.Parameter(FloatTensor([init_std]))
#Optimizer for gradient descent
self.optim = torch.optim.Adam(self.parameters(), lr=self.lr)
#Data for stochastic gradient descent updates instead of full matrix updates
self.samples = [(i, j, self.R[i,j]) for i in range(self.num_users) for j in range(self.num_items) if R[i,j]>0]
np.random.shuffle(self.samples)
self.forward_type = 1
#Number of prediction samples computed in parallel
self.num_MC_samples = 20
def forward2(self, _u, _i):
#Sample multiple parameters with the reparameterization trick in parallel
b = self.b_mu + torch.randn(self.num_MC_samples, *self.b_rho.shape)* F.softplus(self.b_rho)
b_i = self.b_i_mu[_i] + torch.randn(self.num_MC_samples, *self.b_i_rho[_i].shape)* F.softplus(self.b_i_rho[_i])
b_u = self.b_u_mu[_u] + torch.randn(self.num_MC_samples, *self.b_u_rho[_u].shape)* F.softplus(self.b_u_rho[_u])
P = self.P_mu[_u] + torch.randn(self.num_MC_samples, *self.P_rho[_u].shape)*F.softplus(self.P_rho[_u])
Q = self.Q_mu[_i] + torch.randn(self.num_MC_samples, *self.Q_rho[_i].shape)*F.softplus(self.Q_rho[_i])
prediction = torch.bmm(P.view(self.num_MC_samples, 1, self.K), Q.view(self.num_MC_samples, self.K, 1)).squeeze(-1)
prediction += b_u
prediction += b_i
prediction += b
return prediction
def forward1(self):
'''
Sample multipile parameters in parallel
:return: Prediction matrix of shape [N_MC, U, I]
'''
b = self.b_mu + torch.randn(self.num_MC_samples, *self.b_rho.shape)* F.softplus(self.b_rho)
b_i = self.b_i_mu + torch.randn(self.num_MC_samples, *self.b_i_rho.shape)* F.softplus(self.b_i_rho)
b_u = self.b_u_mu + torch.randn(self.num_MC_samples, *self.b_u_rho.shape)* F.softplus(self.b_u_rho)
P = self.P_mu + torch.randn(self.num_MC_samples, *self.P_rho.shape)*F.softplus(self.P_rho)
Q = self.Q_mu + torch.randn(self.num_MC_samples, *self.Q_rho.shape)*F.softplus(self.Q_rho)
pred = torch.bmm(P, Q.transpose_(1,2))
pred += b_u
pred += b_i.transpose_(1,2)
pred += b.unsqueeze(-1)
pred *= self.data_mask
# Add a little of noise in order to prevent nan's during backpropagation; NaN's are caused by masking and the resulting std=0 for missing values
pred += 1e-10*torch.randn(*pred.shape)*self.missing_mask
return pred
def criterion(self, _pred, _label):
'''
:param _pred: prediction matrix of shape [N_MC, U, I]
:param _label: _label matrix of shape [U, I]
:return: scalar loss for gradient descent
'''
mu = _pred.mean(dim=0)
sigma = _pred.std(dim=0)
loss = (-torch.sum(torch.log(torch.sqrt(1/(2*np.pi*sigma.pow(2))))-1./(2*sigma**2)*(_label-mu)**2))
mse_loss = F.mse_loss(_pred, _label) # MSE loss for interpretable loss
return loss, mse_loss
def train_params(self):
# Perform stochastic gradient descent for number of iterations
for e in range(self.iterations):
loss_ = 0
for (i, j, r) in self.samples:
self.zero_grad()
pred = self.forward1()
loss, mse_loss = self.criterion(pred, self.R)
#Alternative prediction by sampling a single data point instead of the entire prediction matrix
# pred= self.forward2(i, j)
# loss, mse_loss = self.criterion(pred, r)
loss.backward()
self.optim.step()
loss_+=loss.detach().numpy()
if e%1000==0 and e>0:
mf.print_variational_matrix()
if (e+1) % 100 == 0:
print("Iteration: %d ; loss = %.4f ; mse_loss=%.4f" % (e+1, loss_, mse_loss.detach().cpu().numpy().squeeze()))
def full_matrix(self):
"""
Computer a single full matrix using the resultant biases, P and Q
"""
b = self.b_mu + torch.randn(*self.b_rho.shape)* F.softplus(self.b_rho)
b_i = self.b_i_mu + torch.randn(*self.b_i_rho.shape)* F.softplus(self.b_i_rho)
b_u = self.b_u_mu + torch.randn(*self.b_u_rho.shape)* F.softplus(self.b_u_rho)
P = self.P_mu + torch.randn(*self.P_rho.shape)* F.softplus(self.P_rho)
Q = self.Q_mu + torch.randn(*self.Q_rho.shape)* F.softplus(self.Q_rho)
return_ = torch.matmul(P, Q.t())
return_ += b_u
return_ += b_i.t()
return_ += b
return return_.detach()
def print_variational_matrix(self):
'''
Print the mean and variance of the prediction matrix
'''
#Sample 100 different prediction matrices
Rs = torch.stack([mf.full_matrix() for _ in range(100)])
print('Mean:')
print(torch.mean(Rs, dim=0).detach().numpy())
print('Std')
print(np.around(torch.std(Rs, dim=0).detach().numpy(),3))
R = FloatTensor([ [5, 3, 5, 0],
[4, 5, 5, 5],
[3, 0, 0, 1],
[0, 0, 0, 0],
[0, 1, 0, 4],])
mf = BayesNMF(R, K=3, lr=0.001, iterations=10000)
mf.train_params()
mf.print_variational_matrix()
The recommendation matrix $R$ in the code snippet above has a tweak: the fourth row does not contain any entries so every item has to be recommended to user 4. Upon inspection of the mean $\mu_{\hat{R}}$ and $\sigma_{\hat{R}}$ we can that the variance in the fourth row in markedly increased:
\[P_{\sigma} = \left( \begin{array}{cccc} 0.000 & 0.001 & 0.000 & 0.000 \\ 0.000 & 0.000 & 0.000 & 0.000 \\ 0.918 & 1.358 & 0.833 & 0.974 \\ 0.001 & 0.001 & 0.000 & 0.000 \\ \end{array}\right)\]Because the user hasn’t reviewed anything, we cannot leverage any latent structure through the Bayesian Non-Negative Matrix Factorization.
Similarly, when predicting the matrix
\[R = \left( \begin{array}{cccc} 5& 3& 5& 0 \\ 4& 5& 5& 0\\ 3& 0& 0& 0\\ 0& 0& 0& 1\\ 0& 1& 0& 0\\ \end{array} \right)\]we will obtain a standard deviation matrix $P_{\sigma}$ which has an increased standard deviation for the fourth row and the last column:
\[P_{\sigma} = \left( \begin{array}{cccc} 0 & 0 & 0 & 0.003 \\ 0 & 0 & 0 & 0.005 \\ 0 & 0. & 0 & 0.001 \\ 0.003 & 0.006 & 0.002 & 0.001 \\ 0 & 0 & 0 & 0.001 \\ \end{array} \right)\]Since user 4 does not share any common likes with any other user the standard deviation is increased for all other movies in the fourth row. Similarly, movie 4 which user 4 reviewed has an increased standard deviation for the other users as not a single other user has seen it.
As a closing remark I just want to say that the gradient descent based variational inference approach is not the best approach. Shinichi Nakajima, a member of my lab published an analytic solution which can be somewhat involved. The core motivation of this post was to see whether we could use the reparameterization trick and whether the factorization would exhibit the behaviour which was expected of it with regards to independent entries in $R$.