Software Engineer, Indeed Japan
Researcher, AIST, Tokyo, Japan
Missing Values
can be estimated by normal decomposition technologies.
Missing Indices
only 0.42% of twitter have location information.
[Cheng et al. 2010]
$\mathcal{X_{i,j,k}}:\mathcal{P}\rightarrow\mathbb{R}$; $\mathcal{X}\in\{\mathbb{R}\cup\emptyset\}^{{I}\times{J}\times{K}}$
$\hat{\Theta} = \underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X;\Theta)$, where
$L(\mathcal{X};\Theta)= \frac{1}{2}\sum_{i,j,k \in D}(\mathcal{X_{i,j,k}}-\mathcal{\hat{X}}(\Theta)_{i,j,k}^2 $
$\Theta = \{U, V, W\}$
$U\in\mathbb{R}^{I\times{R}}$, $V\in\mathbb{R}^{J\times{R}}$, $W\in\mathbb{R}^{K\times{R}}$
$\mathcal{\hat{X}}(\Theta)_{i,j,k} = \sum_{i,j,k \in D}^{R}U_{ir}V_{jr}W_{kr}$
more see Learning Group - 19 Jun., 2017
We can denote $N \in \mathbb{N}$ observations, by $X=\{x_n\}_{n=1}^{N}$ the values and by $\hat{Z}=\{(\hat{i}_n, \hat{j}_n, \hat{k}_n)\}_{n=1}^{N}$ the corresponded potential missing indices;
and the missing index can be denoted as $\emptyset$.
$\hat{Z} = \{[I]\cup\{\emptyset\}\}\times{\{[J]\cup\{\emptyset\}\}}\times{\{[K]\cup\{\emptyset\}\}}^{N}$
Tensor decomposition model: $(\mathcal{P},\hat{\mathcal{X}})$
Oracle: $\mathcal{O}$ which returns true indices
$\hat{\Theta} = \underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X,Z;\Theta)$,
where $Z = \mathcal{O}(\hat{Z})$ and $L(X,Z;\Theta) = \frac{1}{2}\sum_{n}(x_n-\mathcal{\hat{X}}_{i_n,j_n,k_n})^2 $
Taking $i$ as an example:
$p_I(i_n|\hat{i}_n) =
\begin{cases}
\delta(i_n,\hat{i}) &(\hat{i}_n = \hat{i} \in [I])\\
\mathrm{Unif}(1,I) &(\hat{i}_n = \emptyset)
\end{cases}$
Similarly we define $p_j$ and $p_K$.
Generate $\Theta \sim p_\Theta(\cdot|\lambda)$
For $n$ from $1$ to $N$:
- Generate $i_n \sim p_I(\cdot|\hat{i}_n)$
- Generate $j_n \sim p_J(\cdot|\hat{j}_n)$
- Generate $k_n \sim p_K(\cdot|\hat{k}_n)$
- Generate $x_n \sim N(\cdot|\hat{\mathcal{X}}_{i_n,j_n,k_n},1)$
where $N(\cdot|\mu,\sigma^2)$ is Gaussian distribution, and $p_\Theta(\cdot|\lambda)$ is the prior distribution of $\Theta$.
The authors put
$\underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X,Z;\Theta) = \underset{\Theta\in\mathcal{P}}{\mathrm{argmax}}~p(X|\bar{Z};\Theta) $
$ln~p(X|\bar{Z};\Theta) = -\frac{1}{2}\sum_{n}(x_n-\mathcal{\hat{X}}_{i_n,j_n,k_n})^2 + C $
While I would rather say
$\underset{\Theta\in\mathcal{P}}{\mathrm{argmin}}~L(X,Z;\Theta) \propto \underset{\Theta\in\mathcal{P}}{\mathrm{argmax}}~p(X|\bar{Z};\Theta)$
To prevent from overfitting, MAP is used here instead of MLE.
[ref1,
ref2]
$\underset{\Theta\in\mathcal{P}}{\mathrm{argmax}}~p(X|\bar{Z};\Theta) \longrightarrow \hat{\Theta}_{MAP} = \underset{\Theta}{\mathrm{argmax}}~p(\Theta|X,\hat{Z},\lambda)$.
Since $p(\Theta|X,\hat{Z},\lambda) \propto p(X,\hat{Z},\Theta|\lambda)$,
With log-likehood, we have $\underset{\Theta}{\mathrm{argmax}}~\mathrm{ln}~p(X,\hat{Z},\Theta|\lambda)$,
and
$\begin{align}
& \mathrm{ln}~p(X,\hat{Z},\Theta|\lambda)\\
= & \mathrm{ln}\sum_{Z}p(X,Z,\hat{Z}|\Theta) + \mathrm{ln}~p(\Theta|\lambda)
\end{align}$
where
$\begin{align}
& \mathrm{ln}\sum_{Z}p(X,Z,\hat{Z}|\Theta) \\= & \sum_{Z} \sum_{i_n,j_n,k_n} N(x_n|\mathcal{\hat{X}}_{i_n,j_n,k_n})p_I(i_n|\hat{i}_n)p_J(j_n|\hat{j}_n)p_K(k_n|\hat{k}_n)
\end{align}$
Rewrite predefined joint distribution as
$p(X,\hat{Z},\Theta|\lambda) = L[q,\Theta] + KL(q||p)$,
where
$L[q,\Theta] = E_{q(Z)}[\mathrm{ln}~\frac{p(X,Z,\hat{Z},\Theta|\lambda)}{q(Z)}]$ ,
$KL(q||p) = E_{q(Z)}[\mathrm{ln}~\frac{q(Z)}{p(Z|X,\hat{Z},\Theta,\lambda)}]$ ,
and $q$ is any distribution over $Z$.
Since $KL(q||p) > 0 $, the joint distribution is always greater than $L[q,\Theta]$, which is called Variational Lower Bound (VLB).
maximize $L[q,\Theta]$ to $q(Z)$ with $\Theta$ fixed.
Assuming a mean field approximation $q(Z)=\prod_{n}q(i_n)q(j_n)q(k_n)$
and differentiating $L[q,\Theta]$ by
$q(i_n=i)$, $q(j_n=j)$ and $q(k_n=k)$ accordingly.
Take $q(i_n=i)$ as an example, we have updating equation:
$q(i_n=\hat{i}) = \frac{p_I(i_n=\hat{i}|\hat{i})e^{a_{n,\hat{i}}}}{\sum_{i}p_I(i_n=i|\hat{i})e^{a_{n,i}}})$,
where
$a_{n,i} = - \frac{1}{2}E_q(j_n,k_n)[(x_n-\mathcal{\hat{X}}_{i,j_n,k_n})^2]$
Similarly, $q(j_n)$ and $q(k_n)$ are updated and the process is repeated.
maximize $L[q,\Theta]$ to $\Theta$ with $q(Z)$ fixed as obtained in E-step.
Differentiating $L[q,\Theta]$ by $\theta\in\Theta$, we have
$\begin{align} &\frac{\partial L[q,\Theta]}{\partial\theta} \\ = & \sum_{n}\sum_{i_n,j_n,k_n}q(i_n,j_n,k_n)(x_n-\mathcal{\hat{X}}_{i_n,j_n,k_n})\frac{\partial\mathcal{\hat{X}}_{i_n,j_n,k_n}}{\partial\theta}+\frac{\partial\mathrm{ln}~p(\Theta|\lambda)}{\partial\theta} \end{align}$
In CP Decomposition, $\Theta = \{U, V, W\}$ ;
So that, in E-step, we have
$\begin{align}
a_{n,i} & = x_n U_i^\top (E_q[V_{j_n}] \circledast E_q[W_{k_n}])\\
& - \frac{1}{2} U_i^\top (E_q[V_{j_n}V_{j_n}^\top] \circledast E_q[W_{k_n} W_{k_n}^\top]) U_i
\end{align}$
here $\circledast$ denotes Hadamard Product(see page 7, Learning Group - 19 Jun., 2017).
In M-step, we have to define $p_{\Theta}(\Theta|\lambda)$, here it is
$p_{\Theta}(\Theta|\lambda) = \prod_i N(U_i|\mathbf{0}, \lambda^{-1}\boldsymbol{I})\prod_i N(V_j|\mathbf{0}, \lambda^{-1}\boldsymbol{I})\prod_i N(W_k|\mathbf{0}, \lambda^{-1}\boldsymbol{I})$
where $\boldsymbol{I}$ is the identity matrix of appropriate dimension. And as the same example we taken for i:
$U_i = ( G_i + \lambda\boldsymbol{I})^{-1}) H_i$
here,
$G_i = \sum_{n:i_n=i}q(i_n=i)(E_q[V_{j_n}V_{j_n}^\top] \circledast E_q[W_{k_n} W_{k_n}^\top])$
$H_i = \sum{n:i_n=i}x_n(E_q[V_{j_n}] \circledast E_q[W_{k_n})$