《JCDNet》 for Weakly Supervised Temporal Action Localization
PyTorch Implementation of 'JCDNet: Joint of Common and Definite phases Network for Weakly Supervised Temporal Action Localization'
CDNet: Joint of Common and Definite phases Network for Weakly Supervised Temporal Action Localization
Yifu Liu, Xiaoxia Li, Zhiling Luo, and Wei ZhouPaper: https://arxiv.org/abs/2303.17294
Abstract: Weakly-supervised temporal action localization aims to localize action instances in untrimmed videos with only video-level supervision. We witness that different actions record common phases, e.g., the run-up in the HighJump and LongJump. These different actions are defined as conjoint actions, whose rest parts are definite phases, e.g., leaping over the bar in a HighJump. Compared with the common phases, the definite phases are more easily localized in existing researches. Most of them formulate this task as a Multiple Instance Learning paradigm, in which the common phases are tended to be confused with the background, and affect the localization completeness of the conjoint actions. To tackle this challenge, we propose a Joint of Common and Definite phases Network (JCDNet) by improving feature discriminability of the conjoint actions. Specifically, we design a Class-Aware Discriminative module to enhance the contribution of the common phases in classification by the guidance of the coarse definite-phase features. Besides, we introduce a temporal attention module to learn robust action-ness scores via modeling temporal dependencies, distinguishing the common phases from the background. Extensive experiments on three datasets (THUMOS14, ActivityNetv1.2, and a conjoint-action subset) demonstrate that JCDNet achieves competitive performance against the state-of-the-art methods.
Prerequisites
Recommended Environment
- Python 3.6
- Pytorch 1.6
- Tensorflow 1.15 (for Tensorboard)
- CUDA 10.2
Depencencies
You can set up the environments by using
pip install -r requirements.txt
Data Preparation
Prepare THUMOS'14 dataset.
- We excluded three test videos (270, 1292, 1496) as previous work did.
Extract features with two-stream I3D networks
Place the features inside the
dataset
folder.- Please ensure the data structure is as below.
├── dataset
└── THUMOS14
├── gt.json
├── split_train.txt
├── split_test.txt
└── features
├── train
├── rgb
├── video_validation_0000051.npy
├── video_validation_0000052.npy
└── ...
└── flow
├── video_validation_0000051.npy
├── video_validation_0000052.npy
└── ...
└── test
├── rgb
├── video_test_0000004.npy
├── video_test_0000006.npy
└── ...
└── flow
├── video_test_0000004.npy
├── video_test_0000006.npy
└── ...
Usage
Running
You can easily train and evaluate the model by running the script below.
If you want to try other training options, please refer to options.py
.
$ bash run.sh
Citation
If you find this code useful, please cite our paper.
@misc{liu2023jcdnet,
title={JCDNet: Joint of Common and Definite phases Network for Weakly Supervised Temporal Action Localization},
author={Yifu Liu and Xiaoxia Li and Zhiling Luo and Wei Zhou},
year={2023},
eprint={2303.17294},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
评论