In this project, we design and implement a tracking pipeline using convolutional neural networks and recurrent neural networks. Our model can handle detection and tracking jointly using appearance and motion features.
In our model, the detection is used for initializing a tracking trajectory, while for tracking itself no detection is needed.
A joint detection and tracking framework using region convolutional neural networks (RCNN) and recurrent neural networks (RNN).
A training procedure that can train our model efficiently. It needs to handle data augmentation and data balance well.
We formulate the tracking problem as a Markov decision problem.
Our track-rnn model is mainly composed of two parts: the detection part and the tracking part. The two parts share the convolutional layers in the bottom.
The detection part is a Fast-RCNN model used for initializing a trajectory. When an object is detected, a new trajectory is added to the trajectory list with an initial RNN hidden state computed from the detected bounding box.
The tracking part is composed of a motion prior and a appearance comparison network.
The motion prior propose region proposals in the current frame of a trajectory given the selective search results and previous tracking history.
A motion prior model takes the previous bounding box coordinates (center x, center y, width and height) and predict the most possible bounding box coordinate in the next frame.
appearance comparison network
The appearance comparison network outputs a tracking score given the region proposal and the tracking history. And we will choose the bounding box corresponding to the largest tracking score in each frame.
The appearance comparison network takes a bunch of region proposals and the frame image as input. It predicts the corresponding intersection of union (IOU) between the region proposals and the ground truth bounding box as a confidence score.
这里还需要region proposals，觉得和之前说的”In our model, the detection is used for initializing a tracking trajectory, while for tracking itself no detection is needed.”有点矛盾。仔细想想可能是对每帧图片都有一个detection和tracking过程。
On the top, we design a novel recurrent neural network (RNN) to utilize the temporal information from previous time steps and image features from the current time step. At each time step, the RNN compute the IOU prediction of each region proposal with the current hidden state.
When tracking a single object, we will first predict the most possible bounding box in the current frames and sample 256 region proposals from the pool of selective search results. For each of the region proposals, we predict a IOU score and choose the region proposal with the highest score in the current frame.
We train our model using the Adam optimizer with default hyperparameters on a Titan X GPU. Each iteration takes 0.3 second. The training loss went down to 0.0781 after 10 epochs (22130 iterations).
When the object is occluded by others, 5 none of the boxes gives a high enough score. When the object shows up again, the tracker reidentify it by predicting a high tracking score around the correct bounding box:
After the target object is occluded, the tracker wrongly switch to the occlusion: