TY - GEN
T1 - ViFiT
T2 - 3rd ACM MobiCom Workshop on Integrated Sensing and Communication Systems, ISACom 2023
AU - Cao, Bryan Bo
AU - Alali, Abrar
AU - Liu, Hansi
AU - Meegan, Nicholas
AU - Gruteser, Marco
AU - Dana, Kristin
AU - Ashok, Ashwin
AU - Jain, Shubham
N1 - Publisher Copyright:
© 2023 ACM.
PY - 2023/10/2
Y1 - 2023/10/2
N2 - Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In computer vision domain, tracking is usually achieved by first detecting subjects, then associating detected bounding boxes across video frames. Typically, frames are transmitted to a remote site for processing, incurring high latency and network costs. To address this, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer's ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. Results demonstrate that ViFiT outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator and achieves a high frame reduction rate as 97.76% with IMU and Wi-Fi data.
AB - Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In computer vision domain, tracking is usually achieved by first detecting subjects, then associating detected bounding boxes across video frames. Typically, frames are transmitted to a remote site for processing, incurring high latency and network costs. To address this, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer's ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. Results demonstrate that ViFiT outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator and achieves a high frame reduction rate as 97.76% with IMU and Wi-Fi data.
KW - Efficient Video System
KW - IMU
KW - Multimodal Learning
KW - Multimodal Reconstruction
KW - Object Detection
KW - Tracking
KW - Transformer
UR - https://www.scopus.com/pages/publications/85176132303
U2 - 10.1145/3615984.3616503
DO - 10.1145/3615984.3616503
M3 - Conference contribution
AN - SCOPUS:85176132303
T3 - ISACom 2023 - Proceedings of the 2023 3rd ACM MobiCom Workshop on Integrated Sensing and Communication Systems
SP - 13
EP - 18
BT - ISACom 2023 - Proceedings of the 2023 3rd ACM MobiCom Workshop on Integrated Sensing and Communication Systems
PB - Association for Computing Machinery, Inc
Y2 - 6 October 2023 through 6 October 2023
ER -