Skip to main navigation Skip to search Skip to main content

Weakly Semi-Supervised Temporal Sentence Grounding in Videos With Point Annotations

  • Stony Brook University

Research output: Contribution to journalArticlepeer-review

Abstract

Temporal Sentence Grounding (TSG) in videos aims to localize a temporal interval from an untrimmed video that is semantically relevant to a given query sentence. To achieve a balance between tremendous annotation burden and grounding performance, we propose a new Weakly Semi-supervised Temporal Sentence Grounding with Points (WSS-TSG-P) task, where the dataset comprises limited fully-annotated video-sentence pairs by start and end timestamps (full label) and a large amount of weakly-annotated pairs by a single point timestamp (point label). Based on this setting, we first introduce a point-to-moment1 regressor which converts point annotations to pseudo moment labels. To train a good regressor for reliable pseudo moment labels, we propose a point-guided feature aggregation module to aggregate cross-modal representations based on the prototype feature at the given point position. In addition, we propose to perform regressor self-training and design pseudo label generation strategies to exploit both full annotations and point annotations. All heterogeneous labels (full, pseudo moment, and point labels) are used to train a TSG backbone. In addition, we propose a novel point-guided group contrastive learning method by constructing reliable positive and negative sets and re-weighting pseudo moment labels to further improve the model performance. Extensive experiments on benchmark datasets verify that our proposed method outperforms other semi-supervised learning methods and bridges the performance gap between weakly-supervised and fully-supervised learning methods in TSG.

Original languageEnglish
Pages (from-to)2268-2278
Number of pages11
JournalIEEE Transactions on Multimedia
Volume28
DOIs
StatePublished - 2026

Keywords

  • Temporal sentence grounding (TSG)
  • point annotations
  • weakly semi-supervised learning

Fingerprint

Dive into the research topics of 'Weakly Semi-Supervised Temporal Sentence Grounding in Videos With Point Annotations'. Together they form a unique fingerprint.

Cite this