Skip to main navigation Skip to search Skip to main content

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

  • Honglu Zhou
  • , Xiangyu Peng
  • , Shrikant Kendre
  • , Michael S. Ryoo
  • , Silvio Savarese
  • , Caiming Xiong
  • , Juan Carlos Niebles
  • Salesforce Ai Research

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, object-centric reasoning - especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions within dynamic video content. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata - capturing rich spatial and temporal structures such as subjects, objects, their locations as masklets, and action timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.

Original languageEnglish
Title of host publicationProceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages4348-4359
Number of pages12
ISBN (Electronic)9798331589882
DOIs
StatePublished - 2025
Event2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025 - Honolulu, United States
Duration: Oct 19 2025Oct 20 2025

Publication series

NameProceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025

Conference

Conference2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
Country/TerritoryUnited States
CityHonolulu
Period10/19/2510/20/25

Keywords

  • Automated Instruction Data Synthesis
  • Multimodal Video LLMs
  • Temporal and Spatial Regional Understanding Video LLM

Fingerprint

Dive into the research topics of 'Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data'. Together they form a unique fingerprint.

Cite this