《Track-RNN:Joint Detection and Tracking Using Recurrent Neural Networks》阅读笔记


这篇report出自 Stanford Computational Vision and Geometry LabCS231A COURSES: Computer Vision, From 3D Reconstruction to Recognition

1. Dataset

Multiple Object Tracking Benchmark

2. Abstract

In this project, we design and implement a tracking pipeline using convolutional neural networks and recurrent neural networks. Our model can handle detection and tracking jointly using appearance and motion features.

主要实验室单目标跟踪,侧重阐述了highly occluded(高度遮挡)问题。

3. Methods

In our model, the detection is used for initializing a tracking trajectory, while for tracking itself no detection is needed.


  • A joint detection and tracking framework using region convolutional neural networks (RCNN) and recurrent neural networks (RNN).

  • A training procedure that can train our model efficiently. It needs to handle data augmentation and data balance well.

3.1 Markov decision

We formulate the tracking problem as a Markov decision problem.


3.2 Track-Rnn Model

Our track-rnn model is mainly composed of two parts: the detection part and the tracking part. The two parts share the convolutional layers in the bottom.

3.2.1 detection part

The detection part is a Fast-RCNN model used for initializing a trajectory. When an object is detected, a new trajectory is added to the trajectory list with an initial RNN hidden state computed from the detected bounding box.

3.2.2 tracking part

The tracking part is composed of a motion prior and a appearance comparison network.

  • motion prior

    The motion prior propose region proposals in the current frame of a trajectory given the selective search results and previous tracking history.

    A motion prior model takes the previous bounding box coordinates (center x, center y, width and height) and predict the most possible bounding box coordinate in the next frame.

  • appearance comparison network

    The appearance comparison network outputs a tracking score given the region proposal and the tracking history. And we will choose the bounding box corresponding to the largest tracking score in each frame.

    The appearance comparison network takes a bunch of region proposals and the frame image as input. It predicts the corresponding intersection of union (IOU) between the region proposals and the ground truth bounding box as a confidence score.

    这里还需要region proposals,觉得和之前说的”In our model, the detection is used for initializing a tracking trajectory, while for tracking itself no detection is needed.”有点矛盾。仔细想想可能是对每帧图片都有一个detection和tracking过程。

    On the top, we design a novel recurrent neural network (RNN) to utilize the temporal information from previous time steps and image features from the current time step. At each time step, the RNN compute the IOU prediction of each region proposal with the current hidden state.

3.3.3 Track a Single Object

When tracking a single object, we will first predict the most possible bounding box in the current frames and sample 256 region proposals from the pool of selective search results. For each of the region proposals, we predict a IOU score and choose the region proposal with the highest score in the current frame.

4. The Training Procedure

5. Experiments

We train our model using the Adam optimizer with default hyperparameters on a Titan X GPU. Each iteration takes 0.3 second. The training loss went down to 0.0781 after 10 epochs (22130 iterations).

When the object is occluded by others, 5 none of the boxes gives a high enough score. When the object shows up again, the tracker reidentify it by predicting a high tracking score around the correct bounding box:

After the target object is occluded, the tracker wrongly switch to the occlusion: