Video Object Detection

All I know about video det&track.
These two topics are NOT identical.
Feature extraction based 🆚 Metrics learning based

Trendings

LSTM
mostly used in video understanding, eg: video abnormal detection, event recognization, find content…
Extract global action & scene information

Detect+Track

How to leverage temporal information?

Tracking: 提模版特征,特征图匹配,找
Detection in video:

  1. frame by frame

  2. 使用temporal information作为类别判断的依据
    使用LSTM传递时间信息(any context information?

  3. 使用temporal预测可能出现的位置,不确定性
    Fuse 检测位置+预测位置 with uncertainty
    Multi hypothesis tracking

Detect to Track and Track to Detect Papers:

Detect to Track and Track to Detect https://github.com/feichtenhofer/Detect-Track
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

video object segmentation hot topic. datasets: youtube-VOS, DAVIS

  1. Spatiotemporal CNN for Video Object Segmentation use LSTM, two branch, attention mechanism
  2. See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks apply co-attention
  3. Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks
  4. RVOS: End-To-End Recurrent Network for Video Object Segmentation
  5. BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames does not have to be first frame, and select the best frame in the training sets, ranking frame mechanism
  6. FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation use pixel-wise embedding and global&local matching mechanism to transfer the information from first and previous to current frame
  7. Object Discovery in Videos as Foreground Motion Clustering model the VOS problem as foreground motion clustering, cluster foreground pixel into different object. Use RNN to learn embedding of fore-pixel trajectory, add correspondence of pixels in frames.
  8. MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation_ multiple hypothesis tracking

Re-ID in video
Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification attribute-driven feature disentangling & frame re-weighting
VRSTC: Occlusion-Free Video Person Re-Identification use temporal information to recover occluded frame

fusion spatial and temporal feature, using weighted sum, optical flow
Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video

unsupervised manner add other training signal

weakly-supervised manner use motion and video clue to generate more precise proposals.
You Reap What You Sow: Using Videos to Generate High Precision Object Proposals for Weakly-Supervised Object Detection

graph convolution network perform temporal reasoning

downsampling is sometimes beneficial in terms of accuracy. By means of 1) reducing unnecessary details 2) resize the too-large objects and increase confidence Adascale: Towards Real-time video object detection using adaptive scaling

utilize temporal information 1. wrap temporal info with feature to generate future feature 2. for partial occlusion, motion blur in video

iteratively refine
STEP: Spatio-Temporal Progressive Learning for Video Action Detection
refine the proposal to action, step by step. Spatial-temporal: spatial displacement + action tube(temporal info)

Datasets

  1. ImageNet VID: ILSVRC2017
    30 categories
    2015:
    train 1952 snippets, 405014 (186358+218656) images
    test 458 snippets, 127618 images
    val 281 snippets, 64698 images
    2017:
    train 3862 snippets, 1122397 images
    test 937 snippets, 315176 images
    val 555 snippets, 176126 images
  2. Youtube-BB
    5.6M bounding boxes
    240k snippets (380k in paper, about 19s long)
    23 categories, NONE category for unseen category
    Annotate video with 1 frame per second
  3. UA-DETRAC
  4. UAVDT
  5. MOT challenge (Design for MOT)

SOTAs

Integrated Object Detection and Tracking with Tracklet-Conditioned Detection
Tracklet-Conditioned Detection+DCNv2+FGFA
mAP=83.5

Integrate tracking in detection not post processing
Compute embeddings of tracking trajectory with detection box, embeddings-weighted sum trajectory category confidence with detect category confidence.

Weight = f(embeddings)
Update trajectory confidence with new + old
Class confidence = trajectory confidence + det confidence
Output = weighted-sum(weights*Class confidence)

Category(only) is determined jointly weighted by last trajectory category and detect box category

code released Flow-Guided Feature Aggregation for Video Object Detection
mAP=80.1, 2017
code released

Thinkings

  1. No keyframe use LSTM to directly generate detection result
    Input image -> every frame, LSTM to hidden layer and output bbox.
  2. Keyframe select only keyframe for deep and warp to generate interval frame’s feature map (based on optical flow)
    👆 How to get feature map with low cost

👇 How to get box with previous information

  1. Tracking based detect by tracking and tracking by detect

Detection and Tracking

做video detection latex_equ 避开tracking:物体不动,分类,3D框,使用LSTM特征传播(一帧效果差,多帧序列变好)
静态图片detection

Why temporal information is not leveraged in tracking?

难点:帧间信息,temporal信息的高效传递
传递清晰信息,防止motion blur
tubelet

Topics in Workshop

-- Large scale surveillance video: GigaVision

— Autonomous driving: Workshop on autonomous driving 3D bounding box Baidu Apollos

Aerial image (remote sensor): Detecting Objects in Aerial Images (DOAI)
难点:1. Scale variance 2. Small object densely distributed 3. Arbitrary orientation

UAVision: https://sites.google.com/site/uavision2019/home UAV 1920x1080, 15m, 2min, no classification

MOT: BMTT MOTChallenge 2019

ReId, Multi-target multi-camera tracking: Target Re-identification and Multi-Target Multi-Camera Tracking

Autonomous driving: https://sites.google.com/view/wad2019/challenge
D2-city: 10k video, 1k for tracking, HD
BDD100k: 100k video, nano on keyframe, 40s, 720p 30fps You Can Now Download the World’s Largest Self-Driving Dataset
nuScenes: 1.4M frames, 3D box annotation
Other autonomous driving datasets: Oxford Robotcar, TorontoCity, KITTI, Apollo Scape (1M), Waymo Open Dataset (16.7h, 600k frame, 22m 2D-bbox) https://scale.com/open-datasets

Papers at ECCV18

Temporal information for Classifying Multi-Fiber Networks for Video Recognization (ECCV18)
All Fully Motion-Aware Network for Video Object Detection
Video Object Detection with an Aligned Spatial-Temporal Memory
Hard example mining Unsupervised Hard Example Mining from Videos for Improved Object Detection
Sampling? Object Detection in Video with Spatiotemporal Sampling Networks
3D Tracking & Trajectory 3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints

RCNN -> Fast RCNN: 使用RoI pooling代替resize,只计算一次特征图(RoI projection),多任务训练(bbox regre.和classif.一起训练)
Fast RCNN -> Faster RCNN: 使用RPN代替selective search


一阶段相比二阶段少了RoI pooling过程,拿到框直接在整张图的特征图上分类回归,而不在框中进行。导致可能特征偏移问题


Papers

Object Detection in Video with Saptiotemporal Sampling Networks

使用类似FGFA的方法,但是增加deformable卷积,简化求其他帧feature和权重的步骤

Motivation

去掉训练中需要的光流数据,提升(训练)速度

Approach

Deformable Convolution: 通过数据计算出的偏移量,是卷积的receptive field可变。不只是基于中心的{(-1,-1),(-1,0),(-1,1),...,(1,0),(1,1)},即latex_equ,而可以是latex_equ。其中latex_equ为小数,使用双线性插值计算。

Spatiotemporal Sampling Network

选择前后K帧的特征图进行融合,当前帧reference frame,其他帧supporting frame。

  1. 求特征时进行四次变形卷积

latex_equ

latex_equ

latex_equ

latex_equ

latex_equ

And so on...

但最后一次,使用最初的

latex_equ

  1. 融合时,将前后K帧进行融合。

计算第t+k帧权重:

三层子网络S对g计算中间表示,求余弦距离的exp来计算权值。对前后的每一张support frame的每一个像素p计算融合权重

latex_equ

归一化后融合,在t-K到t+K的时间范围上加权求和,获得每个像素点在reference frame(t时刻)的融合特征,输入检测网络。

细节


Looking Fast and Slow: Memory-Guided Mobile Video Object Detection

Using memory(LSTM) in object detection

SOTA of ImageNet VID

Concern more on light-weight and low computation time.

使用轻量级网络mobilenet识别场景的主要内容,快速的特征提取需要维护memory作为补充信息

一个精确的特征提取器用于初始化和维护memory,之后快速处理,使用LSTM维护memory。强化学习用来决定使用快速/慢速特征提取器(tradeoff)


多分支特征提取

Use two feature extractor parallel (accuracy🆚speed)

image-20191017232410901

inference流程

latex_equ

latex_equ

latex_equ为选择的特征提取网络,m为memory module.

latex_equd为SSD检测网络

定义latex_equlatex_equ超参数,也可以通过interleaving policy获得

other methods:减少深度0.35,降低分辨率160x160,SSDLite,限制anchor的长宽比latex_equ

memory module

image-20191018103006549

Modified LSTM module👆:

  1. skip connection between the bottleneck and output
  2. grouped convolution process LSTM state groups separately

Ps. standard LSTM👇

image-20191018103139589

To perserve long-term dependencies latex_equ skip state update: when latex_equ run, always reuse output state from the last time latex_equ was run

Training

Pretrain LSTM on Imagenet Cls for initialization

Unroll LSTM to six steps

Random select feature extractor

Crop and shift to augment training data


Adaptive Interleaving Policy(RL)

Policy network latex_equ to measure detection confidence, examines LSTM state and decide next feature extractor to run

Train policy network using Double Q-learning(DDQN)

Action space: latex_equ at next step

State space: latex_equ, LSTM states and their changes, action history term latex_equ (binary vector, len=20).

Reward space: speed reward positive reward when latex_equ is run, accuracy reward loss difference between min-loss extractor.

image-20191018120813917

Policy network to devide which extractor👇

image-20191018121558765

Generate batches of latex_equ by run interleaved network in inference mode

Training process👇

image-20191018121739427


Inference Optimization

  1. Asynchronous mode

latex_equ and latex_equ run in separate threads, latex_equ keeps detection and latex_equ updates memory when finished its computation. Memory module use most recent available memory, NO WAIT for slow extractor.

Potential Weakness: latency/mismatch of call large extractor and accuracy memory output. Delay of generate more powerful memory using large extractor when encounter hard example. Memory will remains less powerful before large extractor generates new one.

  1. Quantization

Experiments

ImageNetVID-val

ImageNet VID val👆

image-20191018142044512

👆RL demonstration: red means call large model, blue for small model.


Object detection in videos with Tubelet Proposal Networks

如何高效的产生时间维度的proposal (aka. ::tubelet::)?
通过关键帧检测结果产生一条序列的所有proposal ::detect by track::。然后使用LSTM分类

产生tubelet有两种方法 1. Motion-based (only for short-term) 2. Appearance-based (tracking, expensive/?)

Approach


↖️首先对静态图片进行检测获得检测结果,然后在 相同位置 不同时间上pooling,获得spatial anchors。基于假设感受野足够大可以获得运动物体的特征(中心不会移出物体框)。Align之后用于预测物体的移动

使用Tubelet Proposal Network回归网络预测相对于 第一帧 的运动量(为了防止追踪过程中的drift,累计误差)。预测的时间序列长度为omega



同时,认为GT的bbox就是tubelet proposal的监督信号。同时对运动表示进行归一化。(对归一化后的残量进行学习)

损失函数👇

👆M为GT,M_hat为归一化后的offset

创新点:::分块初始化::
首先训练预测时间序列长度为2的TPN,得到参数W_2和b_2。由于第二帧运动量m_2由第1和第2帧的特征图预测,第三帧运动量由第1和第3帧特征图预测,m_4由第1&4帧预测。和中间帧无关,所以认为预测过程有相似性(1&2 -> m2, 1&3 -> m3),可以使用W_2和b_2部分初始化W_3和b_3参数中的一块👇

最后循环产生所有帧的所有static anchor的tubelet proposal👇

LSTM做类别预测↘️

↗️RoI-pooling之后的tubelet proposal中特征放入一层的LSTM encoder,再将memory和hidden放入decoder反序输出类别预测


IoU tracker


D表示检测结果,F帧,每一帧至多N个检测结果
T_a表示正在追踪未结束的目标,T_f表示已经最终完成的trajectory(移出画面外)
思路
对于 某一帧 ,对于每个正在追踪的 trajectory ,在当前帧的检测结果中找IoU最大的检测结果。如果IoU大于阈值,添加到检测结果中;如果最大的IoU都没有大于阈值,则判断trajectory的长度和最高置信度,判断是否从T_a删除并加入检测完成trajectory集合中T_f。认为消失/追踪完成
继续下一个trajectory。剩余的检测框,建立一个新的trajectory。
最后T_a中trajectory判断长度和最高置信度,决定是否加入T_f
T_f即为追踪结果


Multiple Hypothesis Tracking

构建跟踪树

每一帧的观测产生一个跟踪树,将出现在geting area的观测添加作为其子节点
增加一个分支标记跟踪丢失的节点

Mahalonobis Distance

Measure the distance between a vector(point) and a distribution

Why use Mahalonobis distance?

  1. normalized:
    normalize the distribution into latex_equ
  2. consider all the sample points in the distribution, not the center of distribution only, especially when the two random variable is correlated.
    15688765365463

How is Mahalonobis distance different from Euclidean distance?

  1. It transforms the columns into uncorrelated variables
  2. Scale the columns to make their variance equal to 1
  3. Finally, it calculates the Euclidean distance.

formula
latex_equ
latex_equ is the observation
latex_equ is the mean value of the independent variables
latex_equ is the inverse of covariance matrix
Read more

Kalman Filter: an estimation method

Why use kalman filter?

Estimate state of a system from different sources that may be subject to noise. Observe external, predict internal
Fuse the observations to estimate

15688950199889
formulas ps. latex_equ means the derivate of x
latex_equ
latex_equ, latex_equ
latex_equ, latex_equ
subtract
latex_equ
15688965872265
Multiple the predicted position's p.d.f. and the measured position's, p.d.f., and form a new Gaussian Distribution.See more

Gating

latex_equ means instance i's location in k time, subject to latex_equ, latex_equ Gaussian distribution. latex_equ, latex_equ can be estimated via Kalman Filter.
Use Mahalonobis Distance between observed location and predicted location to determine add to trajectory or not.
latex_equ
threshold determine range the gating area.

Plug & Play Convolutional Regression Tracker for Video Object Detection

Detector中加入light-weight tracker,使用detector提取的特征

image-20200304123522688