TY - GEN
T1 - Strefer
T2 - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
AU - Zhou, Honglu
AU - Peng, Xiangyu
AU - Kendre, Shrikant
AU - Ryoo, Michael S.
AU - Savarese, Silvio
AU - Xiong, Caiming
AU - Niebles, Juan Carlos
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, object-centric reasoning - especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions within dynamic video content. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata - capturing rich spatial and temporal structures such as subjects, objects, their locations as masklets, and action timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.
AB - Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, object-centric reasoning - especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions within dynamic video content. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata - capturing rich spatial and temporal structures such as subjects, objects, their locations as masklets, and action timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.
KW - Automated Instruction Data Synthesis
KW - Multimodal Video LLMs
KW - Temporal and Spatial Regional Understanding Video LLM
UR - https://www.scopus.com/pages/publications/105035211318
U2 - 10.1109/ICCVW69036.2025.00451
DO - 10.1109/ICCVW69036.2025.00451
M3 - Conference contribution
AN - SCOPUS:105035211318
T3 - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
SP - 4348
EP - 4359
BT - Proceedings - 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 October 2025 through 20 October 2025
ER -