Video Object Detection

All I know about video det&track.
These two topics are NOT identical.
Feature extraction based 🆚 Metrics learning based


mostly used in video understanding, eg: video abnormal detection, event recognization, find content…
Extract global action & scene information


How to leverage temporal information?

Tracking: 提模版特征,特征图匹配,找
Detection in video:

  1. frame by frame

  2. 使用temporal information作为类别判断的依据
    使用LSTM传递时间信息(any context information?

  3. 使用temporal预测可能出现的位置,不确定性
    Fuse 检测位置+预测位置 with uncertainty
    Multi hypothesis tracking

Detect to Track and Track to Detect Papers:

Detect to Track and Track to Detect
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

video object segmentation hot topic. datasets: youtube-VOS, DAVIS

  1. Spatiotemporal CNN for Video Object Segmentation use LSTM, two branch, attention mechanism
  2. See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks apply co-attention
  3. Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks
  4. RVOS: End-To-End Recurrent Network for Video Object Segmentation
  5. BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames does not have to be first frame, and select the best frame in the training sets, ranking frame mechanism
  6. FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation use pixel-wise embedding and global&local matching mechanism to transfer the information from first and previous to current frame
  7. Object Discovery in Videos as Foreground Motion Clustering model the VOS problem as foreground motion clustering, cluster foreground pixel into different object. Use RNN to learn embedding of fore-pixel trajectory, add correspondence of pixels in frames.
  8. MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation_ multiple hypothesis tracking

Re-ID in video
Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification attribute-driven feature disentangling & frame re-weighting
VRSTC: Occlusion-Free Video Person Re-Identification use temporal information to recover occluded frame

fusion spatial and temporal feature, using weighted sum, optical flow
Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video

unsupervised manner add other training signal

weakly-supervised manner use motion and video clue to generate more precise proposals.
You Reap What You Sow: Using Videos to Generate High Precision Object Proposals for Weakly-Supervised Object Detection

graph convolution network perform temporal reasoning

downsampling is sometimes beneficial in terms of accuracy. By means of 1) reducing unnecessary details 2) resize the too-large objects and increase confidence Adascale: Towards Real-time video object detection using adaptive scaling

utilize temporal information 1. wrap temporal info with feature to generate future feature 2. for partial occlusion, motion blur in video

iteratively refine
STEP: Spatio-Temporal Progressive Learning for Video Action Detection
refine the proposal to action, step by step. Spatial-temporal: spatial displacement + action tube(temporal info)


  1. ImageNet VID: ILSVRC2017
    30 categories
    train 1952 snippets, 405014 (186358+218656) images
    test 458 snippets, 127618 images
    val 281 snippets, 64698 images
    train 3862 snippets, 1122397 images
    test 937 snippets, 315176 images
    val 555 snippets, 176126 images
  2. Youtube-BB
    5.6M bounding boxes
    240k snippets (380k in paper, about 19s long)
    23 categories, NONE category for unseen category
    Annotate video with 1 frame per second
  4. UAVDT
  5. MOT challenge (Design for MOT)


Integrated Object Detection and Tracking with Tracklet-Conditioned Detection
Tracklet-Conditioned Detection+DCNv2+FGFA

Integrate tracking in detection not post processing
Compute embeddings of tracking trajectory with detection box, embeddings-weighted sum trajectory category confidence with detect category confidence.

Weight = f(embeddings)
Update trajectory confidence with new + old
Class confidence = trajectory confidence + det confidence
Output = weighted-sum(weights*Class confidence)

Category(only) is determined jointly weighted by last trajectory category and detect box category

code released Flow-Guided Feature Aggregation for Video Object Detection
mAP=80.1, 2017
code released


  1. No keyframe use LSTM to directly generate detection result
    Input image -> every frame, LSTM to hidden layer and output bbox.
  2. Keyframe select only keyframe for deep and warp to generate interval frame’s feature map (based on optical flow)
    👆 How to get feature map with low cost

👇 How to get box with previous information

  1. Tracking based detect by tracking and tracking by detect

Detection and Tracking

做video detection latex_equ 避开tracking:物体不动,分类,3D框,使用LSTM特征传播(一帧效果差,多帧序列变好)

Why temporal information is not leveraged in tracking?

传递清晰信息,防止motion blur

Topics in Workshop

-- Large scale surveillance video: GigaVision

— Autonomous driving: Workshop on autonomous driving 3D bounding box Baidu Apollos

Aerial image (remote sensor): Detecting Objects in Aerial Images (DOAI)
难点:1. Scale variance 2. Small object densely distributed 3. Arbitrary orientation

UAVision: UAV 1920x1080, 15m, 2min, no classification

MOT: BMTT MOTChallenge 2019

ReId, Multi-target multi-camera tracking: Target Re-identification and Multi-Target Multi-Camera Tracking

Autonomous driving:
D2-city: 10k video, 1k for tracking, HD
BDD100k: 100k video, nano on keyframe, 40s, 720p 30fps You Can Now Download the World’s Largest Self-Driving Dataset
nuScenes: 1.4M frames, 3D box annotation
Other autonomous driving datasets: Oxford Robotcar, TorontoCity, KITTI, Apollo Scape (1M), Waymo Open Dataset (16.7h, 600k frame, 22m 2D-bbox)

Papers at ECCV18

Temporal information for Classifying Multi-Fiber Networks for Video Recognization (ECCV18)
All Fully Motion-Aware Network for Video Object Detection
Video Object Detection with an Aligned Spatial-Temporal Memory
Hard example mining Unsupervised Hard Example Mining from Videos for Improved Object Detection
Sampling? Object Detection in Video with Spatiotemporal Sampling Networks
3D Tracking & Trajectory 3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints

RCNN -> Fast RCNN: 使用RoI pooling代替resize,只计算一次特征图(RoI projection),多任务训练(bbox regre.和classif.一起训练)
Fast RCNN -> Faster RCNN: 使用RPN代替selective search

一阶段相比二阶段少了RoI pooling过程,拿到框直接在整张图的特征图上分类回归,而不在框中进行。导致可能特征偏移问题


Object Detection in Video with Saptiotemporal Sampling Networks





Deformable Convolution: 通过数据计算出的偏移量,是卷积的receptive field可变。不只是基于中心的{(-1,-1),(-1,0),(-1,1),...,(1,0),(1,1)},即latex_equ,而可以是latex_equ。其中latex_equ为小数,使用双线性插值计算。

Spatiotemporal Sampling Network

选择前后K帧的特征图进行融合,当前帧reference frame,其他帧supporting frame。

  1. 求特征时进行四次变形卷积






And so on...



  1. 融合时,将前后K帧进行融合。


三层子网络S对g计算中间表示,求余弦距离的exp来计算权值。对前后的每一张support frame的每一个像素p计算融合权重


归一化后融合,在t-K到t+K的时间范围上加权求和,获得每个像素点在reference frame(t时刻)的融合特征,输入检测网络。


Looking Fast and Slow: Memory-Guided Mobile Video Object Detection

Using memory(LSTM) in object detection

SOTA of ImageNet VID

Concern more on light-weight and low computation time.




Use two feature extractor parallel (accuracy🆚speed)





latex_equ为选择的特征提取网络,m为memory module.


定义latex_equlatex_equ超参数,也可以通过interleaving policy获得

other methods:减少深度0.35,降低分辨率160x160,SSDLite,限制anchor的长宽比latex_equ

memory module


Modified LSTM module👆:

  1. skip connection between the bottleneck and output
  2. grouped convolution process LSTM state groups separately

Ps. standard LSTM👇


To perserve long-term dependencies latex_equ skip state update: when latex_equ run, always reuse output state from the last time latex_equ was run


Pretrain LSTM on Imagenet Cls for initialization

Unroll LSTM to six steps

Random select feature extractor

Crop and shift to augment training data

Adaptive Interleaving Policy(RL)

Policy network latex_equ to measure detection confidence, examines LSTM state and decide next feature extractor to run

Train policy network using Double Q-learning(DDQN)

Action space: latex_equ at next step

State space: latex_equ, LSTM states and their changes, action history term latex_equ (binary vector, len=20).

Reward space: speed reward positive reward when latex_equ is run, accuracy reward loss difference between min-loss extractor.


Policy network to devide which extractor👇


Generate batches of latex_equ by run interleaved network in inference mode

Training process👇


Inference Optimization

  1. Asynchronous mode

latex_equ and latex_equ run in separate threads, latex_equ keeps detection and latex_equ updates memory when finished its computation. Memory module use most recent available memory, NO WAIT for slow extractor.

Potential Weakness: latency/mismatch of call large extractor and accuracy memory output. Delay of generate more powerful memory using large extractor when encounter hard example. Memory will remains less powerful before large extractor generates new one.

  1. Quantization



ImageNet VID val👆


👆RL demonstration: red means call large model, blue for small model.

Object detection in videos with Tubelet Proposal Networks

如何高效的产生时间维度的proposal (aka. ::tubelet::)?
通过关键帧检测结果产生一条序列的所有proposal ::detect by track::。然后使用LSTM分类

产生tubelet有两种方法 1. Motion-based (only for short-term) 2. Appearance-based (tracking, expensive/?)


↖️首先对静态图片进行检测获得检测结果,然后在 相同位置 不同时间上pooling,获得spatial anchors。基于假设感受野足够大可以获得运动物体的特征(中心不会移出物体框)。Align之后用于预测物体的移动

使用Tubelet Proposal Network回归网络预测相对于 第一帧 的运动量(为了防止追踪过程中的drift,累计误差)。预测的时间序列长度为omega

同时,认为GT的bbox就是tubelet proposal的监督信号。同时对运动表示进行归一化。(对归一化后的残量进行学习)



首先训练预测时间序列长度为2的TPN,得到参数W_2和b_2。由于第二帧运动量m_2由第1和第2帧的特征图预测,第三帧运动量由第1和第3帧特征图预测,m_4由第1&4帧预测。和中间帧无关,所以认为预测过程有相似性(1&2 -> m2, 1&3 -> m3),可以使用W_2和b_2部分初始化W_3和b_3参数中的一块👇

最后循环产生所有帧的所有static anchor的tubelet proposal👇


↗️RoI-pooling之后的tubelet proposal中特征放入一层的LSTM encoder,再将memory和hidden放入decoder反序输出类别预测

IoU tracker

对于 某一帧 ,对于每个正在追踪的 trajectory ,在当前帧的检测结果中找IoU最大的检测结果。如果IoU大于阈值,添加到检测结果中;如果最大的IoU都没有大于阈值,则判断trajectory的长度和最高置信度,判断是否从T_a删除并加入检测完成trajectory集合中T_f。认为消失/追踪完成

Multiple Hypothesis Tracking


每一帧的观测产生一个跟踪树,将出现在geting area的观测添加作为其子节点

Mahalonobis Distance

Measure the distance between a vector(point) and a distribution

Why use Mahalonobis distance?

  1. normalized:
    normalize the distribution into latex_equ
  2. consider all the sample points in the distribution, not the center of distribution only, especially when the two random variable is correlated.

How is Mahalonobis distance different from Euclidean distance?

  1. It transforms the columns into uncorrelated variables
  2. Scale the columns to make their variance equal to 1
  3. Finally, it calculates the Euclidean distance.

latex_equ is the observation
latex_equ is the mean value of the independent variables
latex_equ is the inverse of covariance matrix
Kalman Filter: an estimation method

Why use kalman filter?

Estimate state of a system from different sources that may be subject to noise. Observe external, predict internal
Fuse the observations to estimate

formulas ps. latex_equ means the derivate of x
latex_equ, latex_equ
latex_equ, latex_equ
Multiple the predicted position's p.d.f. and the measured position's, p.d.f., and form a new Gaussian Distribution.See more


latex_equ means instance i's location in k time, subject to latex_equ, latex_equ Gaussian distribution. latex_equ, latex_equ can be estimated via Kalman Filter.
Use Mahalonobis Distance between observed location and predicted location to determine add to trajectory or not.
threshold determine range the gating area.

Plug & Play Convolutional Regression Tracker for Video Object Detection

Detector中加入light-weight tracker,使用detector提取的特征
