1. Produce candidate locations by
frame-wise sliding windows or
frame-wise object proposals
2. Compute matching score of candidates and query
1. No need for location information
not necessary
not efficient
2. Exhaustive search is not practical
a 30 Sec video usually contains more than $30\times 24= 720$ frames and
10k+ object proposals
Normally high redundancy of objects can be found in one video
$\rightarrow$ select representative objects for videos and match with query object.
High recall can be crucial in some cases
e.g. Scenario in cover page
1. Train a compact model for each video to reconstruct all of its object proposals
2. Trying to reconstruct the query to answer whether it has ever seen
Only cares about whether the query object is contained in a video or not
The likely spatio-temporally overlapping enlarge the chance of reasonable recall
Offline training, online search
We denote by $r_{\theta_i}(\cdot): \mathbb{R}^d \rightarrow \mathbb{R}^d$ the reconstruction model learned by the set of object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ from the video $V_i$, where $\theta_i$ is the parameters of the reconstruction model which are learned by reconstructing all object proposals in the training phase:
$\theta_i= \underset{\theta}{\arg\min}\sum_{x\in\mathcal{S}_i}{||x-r_\theta(x)||_2^2}$
In the search phase, for each video $V_i$, we calculate the query’s reconstruction error $||r_{\theta_i}(q)-q||$ using the reconstruction model learned from the video. We use the reconstruction error as a similarity measurement to determine the relevance between the query $q$ and the whole video $V_i$:
$dist({q}, V_i) = ||q − r_{\theta_i}(q)||2$
The smaller the reconstruction error $||q − r_{\theta_i}(q)||2$ is, the more relevant the query is to the video $V_i$.
After training the reconstruction model for the video $V_i$, we no longer rely on the object proposals $\mathcal{S}_i=\{\mathrm{x}_i^1,...,\mathrm{x}_i^m\}$ and only need to store the parameters $\theta_i$, which is more compact than $\mathcal{S}_i$.
Rather than comparing the query $q$ with all the object proposals in the set $\mathcal{S}_i$, the reconstruction model only need to compute $||q - r_{\theta_i}(q)||$ to obtain the relevance between $q$ and $\mathcal{S}_i$.
$r_{\theta_i}(x)=\theta_i^\top \theta_i x, s.t. \theta_i\theta_i^\top=\mathit{I}$
$r_{\theta_i}(x)=f_2(\mathit{W}_i^2f_1(\mathit{W}_i^1x+b_i^1)+b_i^2)$
$r_{\theta_i}(x)=\theta_i\mathit{h}_{\theta_i}(x)$