Video Object Detection

All I know about video det&track.
These two topics are NOT identical.
Feature extraction based 🆚 Metrics learning based

Trendings

LSTM
mostly used in video understanding, eg: video abnormal detection, event recognization, find content…
Extract global action & scene information

Detect+Track

How to leverage temporal information?

Tracking: 提模版特征，特征图匹配，找
Detection in video:

frame by frame
使用temporal information作为类别判断的依据
使用LSTM传递时间信息（any context information?）
使用temporal预测可能出现的位置，不确定性
Fuse 检测位置+预测位置 with uncertainty
Multi hypothesis tracking

Detect to Track and Track to Detect Papers:

Detect to Track and Track to Detect https://github.com/feichtenhofer/Detect-Track
Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

video object segmentation hot topic. datasets: youtube-VOS, DAVIS

Spatiotemporal CNN for Video Object Segmentation use LSTM, two branch, attention mechanism
See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks apply co-attention
Fast User-Guided Video Object Segmentation by Interaction-And-Propagation Networks
RVOS: End-To-End Recurrent Network for Video Object Segmentation
BubbleNets: Learning to Select the Guidance Frame in Video Object Segmentation by Deep Sorting Frames does not have to be first frame, and select the best frame in the training sets, ranking frame mechanism
FEELVOS: Fast End-To-End Embedding Learning for Video Object Segmentation use pixel-wise embedding and global&local matching mechanism to transfer the information from first and previous to current frame
Object Discovery in Videos as Foreground Motion Clustering model the VOS problem as foreground motion clustering, cluster foreground pixel into different object. Use RNN to learn embedding of fore-pixel trajectory, add correspondence of pixels in frames.
MHP-VOS: Multiple Hypotheses Propagation for Video Object Segmentation_ multiple hypothesis tracking

Re-ID in video
Attribute-Driven Feature Disentangling and Temporal Aggregation for Video Person Re-Identification attribute-driven feature disentangling & frame re-weighting
VRSTC: Occlusion-Free Video Person Re-Identification use temporal information to recover occluded frame

fusion spatial and temporal feature, using weighted sum, optical flow
Accel: A Corrective Fusion Network for Efficient Semantic Segmentation on Video

unsupervised manner add other training signal

weakly-supervised manner use motion and video clue to generate more precise proposals.
You Reap What You Sow: Using Videos to Generate High Precision Object Proposals for Weakly-Supervised Object Detection

graph convolution network perform temporal reasoning

downsampling is sometimes beneficial in terms of accuracy. By means of 1) reducing unnecessary details 2) resize the too-large objects and increase confidence Adascale: Towards Real-time video object detection using adaptive scaling

utilize temporal information 1. wrap temporal info with feature to generate future feature 2. for partial occlusion, motion blur in video

iteratively refine
STEP: Spatio-Temporal Progressive Learning for Video Action Detection
refine the proposal to action, step by step. Spatial-temporal: spatial displacement + action tube(temporal info)

Datasets

ImageNet VID: ILSVRC2017
30 categories
2015:
train 1952 snippets, 405014 (186358+218656) images
test 458 snippets, 127618 images
val 281 snippets, 64698 images
2017:
train 3862 snippets, 1122397 images
test 937 snippets, 315176 images
val 555 snippets, 176126 images
Youtube-BB
5.6M bounding boxes
240k snippets (380k in paper, about 19s long)
23 categories, NONE category for unseen category
Annotate video with 1 frame per second
UA-DETRAC
UAVDT
MOT challenge (Design for MOT)

SOTAs

Integrated Object Detection and Tracking with Tracklet-Conditioned Detection
Tracklet-Conditioned Detection+DCNv2+FGFA
mAP=83.5

Integrate tracking in detection not post processing
Compute embeddings of tracking trajectory with detection box, embeddings-weighted sum trajectory category confidence with detect category confidence.

Weight = f(embeddings)
Update trajectory confidence with new + old
Class confidence = trajectory confidence + det confidence
Output = weighted-sum(weights*Class confidence)

Category(only) is determined jointly weighted by last trajectory category and detect box category

code released Flow-Guided Feature Aggregation for Video Object Detection
mAP=80.1, 2017
code released

Thinkings

No keyframe use LSTM to directly generate detection result
Input image -> every frame, LSTM to hidden layer and output bbox.
Keyframe select only keyframe for deep and warp to generate interval frame’s feature map (based on optical flow)
👆 How to get feature map with low cost

👇 How to get box with previous information

Tracking based detect by tracking and tracking by detect

Detection and Tracking

做video detection $latex_equ$ 避开tracking：物体不动，分类，3D框，使用LSTM特征传播（一帧效果差，多帧序列变好）
静态图片detection

Why temporal information is not leveraged in tracking?

难点：帧间信息，temporal信息的高效传递
传递清晰信息，防止motion blur
tubelet

Topics in Workshop

-- Large scale surveillance video: GigaVision

— Autonomous driving: Workshop on autonomous driving 3D bounding box Baidu Apollos

— Aerial image (remote sensor): Detecting Objects in Aerial Images (DOAI)
难点：1. Scale variance 2. Small object densely distributed 3. Arbitrary orientation

— UAVision: https://sites.google.com/site/uavision2019/home UAV 1920x1080, 15m, 2min, no classification

— MOT: BMTT MOTChallenge 2019

— ReId, Multi-target multi-camera tracking: Target Re-identification and Multi-Target Multi-Camera Tracking

— Autonomous driving: https://sites.google.com/view/wad2019/challenge
D2-city: 10k video, 1k for tracking, HD
BDD100k: 100k video, nano on keyframe, 40s, 720p 30fps You Can Now Download the World’s Largest Self-Driving Dataset
nuScenes: 1.4M frames, 3D box annotation
Other autonomous driving datasets: Oxford Robotcar, TorontoCity, KITTI, Apollo Scape (1M), Waymo Open Dataset (16.7h, 600k frame, 22m 2D-bbox) https://scale.com/open-datasets

Papers at ECCV18

Temporal information for Classifying Multi-Fiber Networks for Video Recognization (ECCV18)
All Fully Motion-Aware Network for Video Object Detection
Video Object Detection with an Aligned Spatial-Temporal Memory
Hard example mining Unsupervised Hard Example Mining from Videos for Improved Object Detection
Sampling? Object Detection in Video with Spatiotemporal Sampling Networks
3D Tracking & Trajectory 3D Vehicle Trajectory Reconstruction in Monocular Video Data Using Environment Structure Constraints

RCNN -> Fast RCNN: 使用RoI pooling代替resize，只计算一次特征图(RoI projection)，多任务训练(bbox regre.和classif.一起训练)
Fast RCNN -> Faster RCNN: 使用RPN代替selective search

一阶段相比二阶段少了RoI pooling过程，拿到框直接在整张图的特征图上分类回归，而不在框中进行。导致可能特征偏移问题

Papers

Object Detection in Video with Saptiotemporal Sampling Networks

使用类似FGFA的方法，但是增加deformable卷积，简化求其他帧feature和权重的步骤

Motivation

去掉训练中需要的光流数据，提升（训练）速度

Approach

Deformable Convolution: 通过数据计算出的偏移量，是卷积的receptive field可变。不只是基于中心的{(-1,-1),(-1,0),(-1,1),...,(1,0),(1,1)}，即 $latex_equ$ ，而可以是 $latex_equ$ 。其中 $latex_equ$ 为小数，使用双线性插值计算。

Spatiotemporal Sampling Network

选择前后K帧的特征图进行融合，当前帧reference frame，其他帧supporting frame。

求特征时进行四次变形卷积

$latex_equ$

And so on...

但最后一次，使用最初的

$latex_equ$

融合时，将前后K帧进行融合。

计算第t+k帧权重：

三层子网络S对g计算中间表示，求余弦距离的exp来计算权值。对前后的每一张support frame的每一个像素p计算融合权重

$latex_equ$

归一化后融合，在t-K到t+K的时间范围上加权求和，获得每个像素点在reference frame（t时刻）的融合特征，输入检测网络。

细节

backbone采用增加4个 $latex_equ$ 变形卷积的ResNet-101网络。
获得融合特征 $latex_equ$ 后，拆成两部分，一半输入RPN产生proposal（每点9个anchor和一共300个proposal），另一半输入R-FCN。
训练时K较小，K=1，前后各一帧，随机sample的。
先在DET上预训练，support frame就是本身。
测试时使用较大K，K=13。先算出特征图然后缓存来解决GPU RAM问题。

Looking Fast and Slow: Memory-Guided Mobile Video Object Detection

Using memory(LSTM) in object detection

SOTA of ImageNet VID

Concern more on light-weight and low computation time.

使用轻量级网络mobilenet识别场景的主要内容，快速的特征提取需要维护memory作为补充信息

一个精确的特征提取器用于初始化和维护memory，之后快速处理，使用LSTM维护memory。强化学习用来决定使用快速/慢速特征提取器(tradeoff)

多分支特征提取

Use two feature extractor parallel (accuracy🆚speed)

inference流程

$latex_equ$

$latex_equ$ 为选择的特征提取网络，m为memory module.

$latex_equ$ ，d为SSD检测网络

定义 $latex_equ$ 为 $latex_equ$ 超参数，也可以通过interleaving policy获得

other methods：减少深度0.35，降低分辨率160x160，SSDLite，限制anchor的长宽比 $latex_equ$

memory module

Modified LSTM module👆:

skip connection between the bottleneck and output
grouped convolution process LSTM state groups separately

Ps. standard LSTM👇

To perserve long-term dependencies $latex_equ$ skip state update: when $latex_equ$ run, always reuse output state from the last time $latex_equ$ was run

Training

Pretrain LSTM on Imagenet Cls for initialization

Unroll LSTM to six steps

Random select feature extractor

Crop and shift to augment training data

Adaptive Interleaving Policy(RL)

Policy network $latex_equ$ to measure detection confidence, examines LSTM state and decide next feature extractor to run

Train policy network using Double Q-learning(DDQN)

Action space: $latex_equ$ at next step

State space: $latex_equ$ , LSTM states and their changes, action history term $latex_equ$ (binary vector, len=20).

Reward space: speed reward positive reward when $latex_equ$ is run, accuracy reward loss difference between min-loss extractor.

Policy network to devide which extractor👇

Generate batches of $latex_equ$ by run interleaved network in inference mode

Training process👇

Inference Optimization

Asynchronous mode

$latex_equ$ and $latex_equ$ run in separate threads, $latex_equ$ keeps detection and $latex_equ$ updates memory when finished its computation. Memory module use most recent available memory, NO WAIT for slow extractor.

Potential Weakness: latency/mismatch of call large extractor and accuracy memory output. Delay of generate more powerful memory using large extractor when encounter hard example. Memory will remains less powerful before large extractor generates new one.

Quantization

Experiments

ImageNetVID-val

ImageNet VID val👆

👆RL demonstration: red means call large model, blue for small model.

Object detection in videos with Tubelet Proposal Networks

如何高效的产生时间维度的proposal (aka. ::tubelet::)?
通过关键帧检测结果产生一条序列的所有proposal ::detect by track::。然后使用LSTM分类

产生tubelet有两种方法 1. Motion-based (only for short-term) 2. Appearance-based (tracking, expensive/?)

Approach

↖️首先对静态图片进行检测获得检测结果，然后在 相同位置 不同时间上pooling，获得spatial anchors。基于假设感受野足够大可以获得运动物体的特征（中心不会移出物体框）。Align之后用于预测物体的移动

使用Tubelet Proposal Network回归网络预测相对于 第一帧 的运动量（为了防止追踪过程中的drift，累计误差）。预测的时间序列长度为omega

同时，认为GT的bbox就是tubelet proposal的监督信号。同时对运动表示进行归一化。（对归一化后的残量进行学习）

损失函数👇

👆M为GT，M_hat为归一化后的offset

创新点：::分块初始化::
首先训练预测时间序列长度为2的TPN，得到参数W_2和b_2。由于第二帧运动量m_2由第1和第2帧的特征图预测，第三帧运动量由第1和第3帧特征图预测，m_4由第1&4帧预测。和中间帧无关，所以认为预测过程有相似性（1&2 -> m2, 1&3 -> m3)，可以使用W_2和b_2部分初始化W_3和b_3参数中的一块👇

最后循环产生所有帧的所有static anchor的tubelet proposal👇

LSTM做类别预测↘️

↗️RoI-pooling之后的tubelet proposal中特征放入一层的LSTM encoder，再将memory和hidden放入decoder反序输出类别预测

IoU tracker

D表示检测结果，F帧，每一帧至多N个检测结果
T_a表示正在追踪未结束的目标，T_f表示已经最终完成的trajectory（移出画面外）
思路：
对于 某一帧 ，对于每个正在追踪的 trajectory ，在当前帧的检测结果中找IoU最大的检测结果。如果IoU大于阈值，添加到检测结果中；如果最大的IoU都没有大于阈值，则判断trajectory的长度和最高置信度，判断是否从T_a删除并加入检测完成trajectory集合中T_f。认为消失/追踪完成
继续下一个trajectory。剩余的检测框，建立一个新的trajectory。
最后T_a中trajectory判断长度和最高置信度，决定是否加入T_f
T_f即为追踪结果

Multiple Hypothesis Tracking

构建跟踪树

每一帧的观测产生一个跟踪树，将出现在geting area的观测添加作为其子节点
增加一个分支标记跟踪丢失的节点

Mahalonobis Distance

Measure the distance between a vector(point) and a distribution

Why use Mahalonobis distance?

normalized:
normalize the distribution into $latex_equ$
consider all the sample points in the distribution, not the center of distribution only, especially when the two random variable is correlated.

How is Mahalonobis distance different from Euclidean distance?

It transforms the columns into uncorrelated variables
Scale the columns to make their variance equal to 1
Finally, it calculates the Euclidean distance.

formula
$latex_equ$
$latex_equ$ is the observation
$latex_equ$ is the mean value of the independent variables
$latex_equ$ is the inverse of covariance matrix
Read more

Kalman Filter: an estimation method

Why use kalman filter?

Estimate state of a system from different sources that may be subject to noise. Observe external, predict internal
Fuse the observations to estimate

15688950199889
formulas ps. $latex_equ$ means the derivate of x
$latex_equ$
$latex_equ$ , $latex_equ$
$latex_equ$ , $latex_equ$
subtract
$latex_equ$
15688965872265
Multiple the predicted position's p.d.f. and the measured position's, p.d.f., and form a new Gaussian Distribution.See more

Gating

$latex_equ$ means instance i's location in k time, subject to $latex_equ$ , $latex_equ$ Gaussian distribution. $latex_equ$ , $latex_equ$ can be estimated via Kalman Filter.
Use Mahalonobis Distance between observed location and predicted location to determine add to trajectory or not.
$latex_equ$
threshold determine range the gating area.

Plug & Play Convolutional Regression Tracker for Video Object Detection

Detector中加入light-weight tracker，使用detector提取的特征