TY - GEN
T1 - M4X
T2 - 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies, CHASE 2024
AU - Liu, Mengjing
AU - Xie, Zongxing
AU - Ye, Fan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Human activity recognition provides insights into physical and mental well-being by monitoring patterns of movement and behavior, facilitating personalized interventions and proactive health management. Radio Frequency (RF)-based human activity recognition (HAR) is gaining attention due to its less privacy exposure and non-contact characteristics. However, it suffers from data scarcity problems and is sensitive to environment changes. Collecting and labeling such data is labor-intensive and time consuming. The limited training data makes generalizability challenging when the sensor is deployed in a very different relative view in the real world. Synthetic data generation from abundant videos presents a potential to address data scarcity issues, yet the domain gaps between synthetic and real data constrain its benefit. In this paper, we firstly share our investigations and insights on the intrinsic limitations of existing video-based data synthesis methods. Then we present M4X, a method using metric learning to extract effective view-independent features from the more abundant synthetic data despite their domain gaps, thus enhancing cross-view general-izability. We explore two main design issues in different mining strategies for contrastive pairs/triplets construction, and different forms of loss functions. We find that the best choices are offline triplet mining with real data as anchors, balanced triplets, and a triplet loss function without hard negative mining for higher discriminative power. Comprehensive experiments show that M4X consistently outperform baseline methods in cross-view generalizability. In the most challenging case of the least amount of real training data, M4X outperforms three baselines by 7.9-16.5 % on all views, and 18.9-25.6 % on a view with only synthetic but no real data during training. This proves its effectiveness in extracting view-independent features from synthetic data despite their domain gaps. We also observe that given limited sensor deployments, a participant-facing viewpoint and another at a large angle (e.g. 60°) tend to produce much better performance.
AB - Human activity recognition provides insights into physical and mental well-being by monitoring patterns of movement and behavior, facilitating personalized interventions and proactive health management. Radio Frequency (RF)-based human activity recognition (HAR) is gaining attention due to its less privacy exposure and non-contact characteristics. However, it suffers from data scarcity problems and is sensitive to environment changes. Collecting and labeling such data is labor-intensive and time consuming. The limited training data makes generalizability challenging when the sensor is deployed in a very different relative view in the real world. Synthetic data generation from abundant videos presents a potential to address data scarcity issues, yet the domain gaps between synthetic and real data constrain its benefit. In this paper, we firstly share our investigations and insights on the intrinsic limitations of existing video-based data synthesis methods. Then we present M4X, a method using metric learning to extract effective view-independent features from the more abundant synthetic data despite their domain gaps, thus enhancing cross-view general-izability. We explore two main design issues in different mining strategies for contrastive pairs/triplets construction, and different forms of loss functions. We find that the best choices are offline triplet mining with real data as anchors, balanced triplets, and a triplet loss function without hard negative mining for higher discriminative power. Comprehensive experiments show that M4X consistently outperform baseline methods in cross-view generalizability. In the most challenging case of the least amount of real training data, M4X outperforms three baselines by 7.9-16.5 % on all views, and 18.9-25.6 % on a view with only synthetic but no real data during training. This proves its effectiveness in extracting view-independent features from synthetic data despite their domain gaps. We also observe that given limited sensor deployments, a participant-facing viewpoint and another at a large angle (e.g. 60°) tend to produce much better performance.
KW - Cross-view Generalizability
KW - Human Activity Recognition
KW - Metric Learning
KW - Radio Frequency
UR - https://www.scopus.com/pages/publications/85201204860
U2 - 10.1109/CHASE60773.2024.00015
DO - 10.1109/CHASE60773.2024.00015
M3 - Conference contribution
AN - SCOPUS:85201204860
T3 - Proceedings - 2024 IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies, CHASE 2024
SP - 49
EP - 60
BT - Proceedings - 2024 IEEE/ACM Conference on Connected Health
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 June 2024 through 21 June 2024
ER -