Is my object in this video?

Reconstruction-based object search in videos / Tan Yu; Jingjing Meng; Junsong Yuan

Chaoran Huang
Reading Group

Fundamental premise

No concern about "when" and "where" the query object appears

Backgrounds

Conventional approaches

Steps

1. Produce candidate locations by
     frame-wise sliding windows or
     frame-wise object proposals

2. Compute matching score of candidates and query


Problems

1. No need for location information
     not necessary
     not efficient

2. Exhaustive search is not practical
     a 30 Sec video usually contains more than $30\times 24= 720$ frames and
     10k+ object proposals

Recent studies

Normally high redundancy of objects can be found in one video

$\rightarrow$ select representative objects for videos and match with query object.


Problems

High recall can be crucial in some cases
     e.g. Scenario in cover page

The
proposed
method

Steps

1. Train a compact model for each video to reconstruct all of its object proposals

2. Trying to reconstruct the query to answer whether it has ever seen


Properties

Only cares about whether the query object is contained in a video or not

The likely spatio-temporally overlapping enlarge the chance of reasonable recall

Offline training, online search

Problem Formulation

Training Stage

We denote by $r_{\theta_i}(\cdot): \mathbb{R}^d \rightarrow \mathbb{R}^d$ the reconstruction model learned by the set of object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ from the video $V_i$, where $\theta_i$ is the parameters of the reconstruction model which are learned by reconstructing all object proposals in the training phase:

$\theta_i= \underset{\theta}{\arg\min}\sum_{x\in\mathcal{S}_i}{||x-r_\theta(x)||_2^2}$

Problem Formulation

Search Stage

In the search phase, for each video $V_i$, we calculate the query’s reconstruction error $||r_{\theta_i}(q)-q||$ using the reconstruction model learned from the video. We use the reconstruction error as a similarity measurement to determine the relevance between the query $q$ and the whole video $V_i$:

$dist({q}, V_i) = ||q − r_{\theta_i}(q)||2$

The smaller the reconstruction error $||q − r_{\theta_i}(q)||2$ is, the more relevant the query is to the video $V_i$.

Problem Formulation

Objectives

After training the reconstruction model for the video $V_i$, we no longer rely on the object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ and only need to store the parameters $\theta_i$, which is more compact than $\mathcal{S}_i$.

Rather than comparing the query $q$ with all the object proposals in the set $\mathcal{S}_i$, the reconstruction model only need to compute $||q - r_{\theta_i}(q)||$ to obtain the relevance between $q$ and $\mathcal{S}_i$.

Proposed implementations of reconstruction model


Subspace Projection

$r_{\theta_i}(x)=\theta_i^\top \theta_i x, s.t. \theta_i\theta_i^\top=\mathit{I}$


Auto-encoder

$r_{\theta_i}(x)=f_2(\mathit{W}_i^2f_1(\mathit{W}_i^1x+b_i^1)+b_i^2)$


Sparse Dictionary Learnings

$r_{\theta_i}(x)=\theta_i\mathit{h}_{\theta_i}(x)$

Experiments

End