TENSOR

DECOMPOSITION

WITH  MISSING

INDICES

Chaoran  Huang
Learning Group

山口  祐人、林  浩平

山口 祐人
Yuto Yamaguchi, Ph.D.

Software Engineer, Indeed Japan

林 浩平
Kohei Hayashi, D.Eng

Researcher, AIST, Tokyo, Japan

Motivation

Tensor data in real application scenarios are often incomplete

Missing Values
   can be estimated by normal decomposition technologies.


Missing Indices
   only 0.42% of twitter have location information.
         [Cheng et al. 2010]


Incomplete entries usually are discarded

Basics

Assumed knowns and definitions

Squared-loss tensor decomposition

$\mathcal{X_{i,j,k}}:\mathcal{P}\rightarrow\mathbb{R}$; $\mathcal{X}\in\{\mathbb{R}\cup\emptyset\}^{{I}\times{J}\times{K}}$

$\hat{\Theta} = \underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X;\Theta)$, where
   $L(\mathcal{X};\Theta)= \frac{1}{2}\sum_{i,j,k \in D}(\mathcal{X_{i,j,k}}-\mathcal{\hat{X}}(\Theta)_{i,j,k}^2 $


Rank-$R$ CP Decomposition ($R \in \mathbb{N}$)

$\Theta = \{U, V, W\}$
$U\in\mathbb{R}^{I\times{R}}$, $V\in\mathbb{R}^{J\times{R}}$, $W\in\mathbb{R}^{K\times{R}}$

$\mathcal{\hat{X}}(\Theta)_{i,j,k} = \sum_{i,j,k \in D}^{R}U_{ir}V_{jr}W_{kr}$


more see Learning Group - 19 Jun., 2017

Tensor Decomposition with Missing Indices

We can denote $N \in \mathbb{N}$ observations, by $X=\{x_n\}_{n=1}^{N}$ the values and by $\hat{Z}=\{(\hat{i}_n, \hat{j}_n, \hat{k}_n)\}_{n=1}^{N}$ the corresponded potential missing indices;
and the missing index can be denoted as $\emptyset$.

$\hat{Z} = \{[I]\cup\{\emptyset\}\}\times{\{[J]\cup\{\emptyset\}\}}\times{\{[K]\cup\{\emptyset\}\}}^{N}$

Tensor decomposition model: $(\mathcal{P},\hat{\mathcal{X}})$

Oracle: $\mathcal{O}$ which returns true indices


$\hat{\Theta} = \underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X,Z;\Theta)$,
   where $Z = \mathcal{O}(\hat{Z})$ and $L(X,Z;\Theta) = \frac{1}{2}\sum_{n}(x_n-\mathcal{\hat{X}}_{i_n,j_n,k_n})^2 $

Methodology

A probabilistic generative model for tensor decomposition

Proposed Model


Hypothesize

Taking $i$ as an example:
$p_I(i_n|\hat{i}_n) = \begin{cases} \delta(i_n,\hat{i}) &(\hat{i}_n = \hat{i} \in [I])\\ \mathrm{Unif}(1,I) &(\hat{i}_n = \emptyset) \end{cases}$

Similarly we define $p_j$ and $p_K$.

Generative Process

Generate $\Theta \sim p_\Theta(\cdot|\lambda)$
For $n$ from $1$ to $N$:
   - Generate $i_n \sim p_I(\cdot|\hat{i}_n)$
   - Generate $j_n \sim p_J(\cdot|\hat{j}_n)$
   - Generate $k_n \sim p_K(\cdot|\hat{k}_n)$
   - Generate $x_n \sim N(\cdot|\hat{\mathcal{X}}_{i_n,j_n,k_n},1)$

where $N(\cdot|\mu,\sigma^2)$ is Gaussian distribution, and $p_\Theta(\cdot|\lambda)$ is the prior distribution of $\Theta$.

The authors put

   $\underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X,Z;\Theta) = \underset{\Theta\in\mathcal{P}}{\mathrm{argmax}}~p(X|\bar{Z};\Theta) $

   $ln~p(X|\bar{Z};\Theta) = -\frac{1}{2}\sum_{n}(x_n-\mathcal{\hat{X}}_{i_n,j_n,k_n})^2 + C $

 

While I would rather say

   $\underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X,Z;\Theta) \propto \underset{\Theta\in\mathcal{P}}{\mathrm{argmax}}~p(X|\bar{Z};\Theta)$

To prevent from overfitting, MAP is used here instead of MLE. [ref1, ref2]
$\underset{\Theta\in\mathcal{P}}{\mathrm{argmax}}~p(X|\bar{Z};\Theta) \longrightarrow \hat{\Theta}_{MAP} = \underset{\Theta}{\mathrm{argmax}}~p(\Theta|X,\hat{Z},\lambda)$.

Since $p(\Theta|X,\hat{Z},\lambda) \propto p(X,\hat{Z},\Theta|\lambda)$,
With log-likehood, we have $\underset{\Theta}{\mathrm{argmax}}~\mathrm{ln}~p(X,\hat{Z},\Theta|\lambda)$,
and
$\begin{align} & \mathrm{ln}~p(X,\hat{Z},\Theta|\lambda)\\ = & \mathrm{ln}\sum_{Z}p(X,Z,\hat{Z}|\Theta) + \mathrm{ln}~p(\Theta|\lambda) \end{align}$
where
$\begin{align} & \mathrm{ln}\sum_{Z}p(X,Z,\hat{Z}|\Theta) \\= & \sum_{Z} \sum_{i_n,j_n,k_n} N(x_n|\mathcal{\hat{X}}_{i_n,j_n,k_n})p_I(i_n|\hat{i}_n)p_J(j_n|\hat{j}_n)p_K(k_n|\hat{k}_n) \end{align}$

Parameter Inference

An algorithm based on Variational MAP-EM [ref] :

Rewrite predefined joint distribution as
$p(X,\hat{Z},\Theta|\lambda) = L[q,\Theta] + KL(q||p)$,
where
$L[q,\Theta] = E_{q(Z)}[\mathrm{ln}~\frac{p(X,Z,\hat{Z},\Theta|\lambda)}{q(Z)}]$ ,
$KL(q||p) = E_{q(Z)}[\mathrm{ln}~\frac{q(Z)}{p(Z|X,\hat{Z},\Theta,\lambda)}]$ ,
and $q$ is any distribution over $Z$.

Since $KL(q||p) > 0 $, the joint distribution is always greater than $L[q,\Theta]$, which is called Variational Lower Bound (VLB).

In the E-step:

maximize $L[q,\Theta]$ to $q(Z)$ with $\Theta$ fixed.

Assuming a mean field approximation $q(Z)=\prod_{n}q(i_n)q(j_n)q(k_n)$
and differentiating $L[q,\Theta]$ by
  $q(i_n=i)$, $q(j_n=j)$ and $q(k_n=k)$ accordingly.

Take $q(i_n=i)$ as an example, we have updating equation:

$q(i_n=\hat{i}) = \frac{p_I(i_n=\hat{i}|\hat{i})e^{a_{n,\hat{i}}}}{\sum_{i}p_I(i_n=i|\hat{i})e^{a_{n,i}}})$,
where
$a_{n,i} = - \frac{1}{2}E_q(j_n,k_n)[(x_n-\mathcal{\hat{X}}_{i,j_n,k_n})^2]$

Similarly, $q(j_n)$ and $q(k_n)$ are updated and the process is repeated.

In the M-step:

maximize $L[q,\Theta]$ to $\Theta$ with $q(Z)$ fixed as obtained in E-step.

Differentiating $L[q,\Theta]$ by $\theta\in\Theta$, we have

$\begin{align} &\frac{\partial L[q,\Theta]}{\partial\theta} \\ = & \sum_{n}\sum_{i_n,j_n,k_n}q(i_n,j_n,k_n)(x_n-\mathcal{\hat{X}}_{i_n,j_n,k_n})\frac{\partial\mathcal{\hat{X}}_{i_n,j_n,k_n}}{\partial\theta}+\frac{\partial\mathrm{ln}~p(\Theta|\lambda)}{\partial\theta} \end{align}$

Parameter Inference
    for CP Decomposition

In CP Decomposition, $\Theta = \{U, V, W\}$ ;

So that, in E-step, we have
$\begin{align} a_{n,i} & = x_n U_i^\top (E_q[V_{j_n}] \circledast E_q[W_{k_n}])\\ & - \frac{1}{2} U_i^\top (E_q[V_{j_n}V_{j_n}^\top] \circledast E_q[W_{k_n} W_{k_n}^\top]) U_i \end{align}$ here $\circledast$ denotes Hadamard Product(see page 7, Learning Group - 19 Jun., 2017).

In M-step, we have to define $p_{\Theta}(\Theta|\lambda)$, here it is
$p_{\Theta}(\Theta|\lambda) = \prod_i N(U_i|\mathbf{0}, \lambda^{-1}\boldsymbol{I})\prod_i N(V_j|\mathbf{0}, \lambda^{-1}\boldsymbol{I})\prod_i N(W_k|\mathbf{0}, \lambda^{-1}\boldsymbol{I})$
where $\boldsymbol{I}$ is the identity matrix of appropriate dimension. And as the same example we taken for i:
$U_i = ( G_i + \lambda\boldsymbol{I})^{-1}) H_i$
here,
$G_i = \sum_{n:i_n=i}q(i_n=i)(E_q[V_{j_n}V_{j_n}^\top] \circledast E_q[W_{k_n} W_{k_n}^\top])$
$H_i = \sum{n:i_n=i}x_n(E_q[V_{j_n}] \circledast E_q[W_{k_n})$

Algorithm tables

Experiments

The End