《Object Detection from Video Tubelets with Convolutional Neural Networks》阅读笔记




Detecting objects in image vs. in videos

apperance changes of objects (more postures need more training data).

The correct recognition result needs to be inferred from information in previous and future frames, because the appearance of an object in video is highly correlated.

even same object in adjacent frames will have large difference.

To solve this
problem, we proposed a tubelet box perturbation and maxpooling process to increase the performance from 37.4% to 45.2%, which is comparable to the performance of image object proposal with only 1/38 the number of boxes.

The blue line are overlaps with ground truth annotations and purple lines are the output of TCN. The detection scores have large temporal variations while the TCN output has temporal consistency and comply better to ground truth overlaps.

Comapred to CNN-based trackers

However, even for the CNN-based trackers, they might still drift in long-term tracking because they mostly utilize the object appearance information within the video without semantic understanding on its class.

Get better detection precision by the object proposals provided by the tracker!

Algorithm’s evaluation

For each video clip, algorithms need to produce a set of annotations ($f_i$, $c_i$, $s_i$, $b_i$) of frame number $f_i$, class label $c_i$, confidence scores $s_i$ and bounding boxes $b_i$.
Therefore, we use the conventional mean average precision (mean AP) on all classes as the evaluation metric.


The framework consists of two main modules:

  1. Spatio-temporal tubelet proposal module
  2. Tubelet classification and re-scoring module

1. Spatio-temporal tubelet proposal module

a tubelet proposal module that combines object detection and object tracking for tubelet object proposal.

The tubelet proposal module has 3 major steps:

  1. Image object proposal (Framework figure (a))
    use selective search algorithm to generate object proposals and then remove easy negative of them like R-CNN.
  2. Object proposal scoring (Framework figure (b))
    classify by SVM. The higher the SVM score, the higher the confidence that the box contains an object of that class. The darker the color, the higher the score.
  3. High-confidence object tracking (Framework figure (c))
    Read Chapter 3.2 Step 3 carefully!
    End tracking when:
    • the tracking confidence is below 0.1 (parameter can be setted)
    • no more detection beyond minimum detection score for a new tracking anchor (set 0) , tracking ends for this class.

2. Tubelet classification and re-scoring module

a tubelet classification and re-scoring module that performs spatial max-pooling for robust box scoring and temporal convolution for incorporating temporal consistency.

  1. Tubelet box perturbation and max-pooling (Framework figure (d))
    Replacing tubelet boxs with boxes of higher confidence. There are two kinds of perturbations in framework (Step 2.4 offsets Step 1.3)

    • Generate new boxes around each tubelet box on each frame by randomly perturbing the boundaries of the tubelet box.
    • Replace each tubelet box with original object detections that have overlaps with the tubelet box beyond a threshold.

      After the box perturbation step, all augmented boxes and the original tubelet boxes are scored using the same detector in Step 1.2. For each tubelet box, only the augmented box with the maximum detection score is kept and used to replace the original tubelet box.

  2. Temporal convolution and re-scoring (Framework figure (e))
    If tubelet boxes on ajacent frames all have high detection scores, it is very likely that the tublet box on this frame also has high confidence on the same object.

The TCN is a 1-D convolutional network that operates on tubelet proposals. The inputs are time series including detection scores, tracking scores and anchor offsets. The output values are probablities that whether each tubelet box has overlap with ground truth above 0.5 / The output values are probablities whether each tubelet box contains objects of the class(outputs temporally dense prediction scores on every tubelet box).

On one hand, object detection produces high-confidence anchors to initiate tracking and reduces tracking failure by spatial maxpooling. On the other hand, tracking also generates new proposals for object detection and the tracked boxes act as anchors to aggregate existing detections.


  1. 视频目标检测的任务见本页上方的Algorithm’s evaluation。
  2. 用这篇论文提到的方法,在一段视频中,一个物体的tracker可能不是一条完整的线(Framework figure (c) 最上面一条线),可能被分为多段(由文章中Chapter 3.2的track规则定义。本质上是因为物体信息不够明显,无法在下一帧继续追踪了,提前停止了track)。分段的好处是,在连续的几帧(而不是所有帧),同一个物体的相似性最大,这些帧可以作为一段。
  3. 注意Chapter 3.2 Step 3中特殊的NMS。
  4. tracking confidence这个概念在这篇论文里没有说明。具体的跟踪方法在作者团队的另一篇论文《Visual tracking with fully convolutiona networks》里。
  5. 对于TCN,输入的是每一段的track(见上面一点)的tubelet proposals。更详细介绍TCN的论文:《T-cnn: Tubelets with convolutional neural networks for object detection from videos》。


文章标题:《Object Detection from Video Tubelets with Convolutional Neural Networks》阅读笔记


发布时间:2018年05月09日 - 18:05

最后更新:2018年06月05日 - 11:06


许可协议: 署名-非商业性使用-禁止演绎 4.0 国际 转载请保留原文链接及作者。