Markerless Motion Tracking Using Computer Vision and Wearable Sensing for Physical Exercise Quantification

Justin Albert

Chair for Digital Health - Connected Healthcare
Hasso Plattner Institute

Office: Campus III Building G2, Room G-2.1.20
Tel.: +49 331 5509-4853
Email: Justin.albert(at)hpi.de
Links: Homepage

Starting Date: 01.10.2019

In my research, I focus on human motion analysis using primarily 3D cameras but also other sensor modalities such as Inertial Measurement Units and Electrocardiography. The projects range from using a low-cost 3D camera for gait analysis to predicting subjective exertion in strength training. In the following sections, I want to give an overview of the past projects of the last year and my current research.

Current Research

Gait Analysis Using 3D Human Pose Estimation on 2D Images

Introduction

Gait analysis is essential to assess a patient's physical or mental state. It is valuable for the early detection of neurological diseases or evaluation of fall risk in older people. Usually, gait is assessed in a specialized laboratory with expensive marker-based motion capture systems. Medical professionals must attach these markers to the subjects, which are then tracked by an optical system. The recent advances in the field of human pose estimation using deep learning enable motion tracking on images. These algorithms work purely on 2D images and can predict the 3D coordinates of specific human key points (such as head, shoulder, feet, etc.) without any markers. This technology holds considerable potential as it alleviates the need for marker placement on the subjects. In this project, we aim to utilize human pose estimation for gait analysis that can be deployed on consumer-graded devices such as smartphones.

An estimated 3D skeleton based on a monocular 2D video.

Study Setup

We have recorded a dataset of 16 subjects (8 female, 8 male) walking on a treadmill at three different velocities. We recorded the subjects using a 12 MP color camera and a marker-based motion capture system (Vicon). We used the 39 full-body marker Plug-in Gait model. The Vicon system sampled data at 100 Hz, while the RGB camera recorded at 30 fps. We tried different state-of-the-art models, including the GAST-Net [1], MediaPipe [2], or VideoPose3D [3], to estimate the 3D coordinates of humans based on 2D images. After the extraction, we apply signal processing methods such as filtering and temporal and spatial alignment of the skeleton data. We evaluated different aspects, including the spatial agreement of joint locations and gait parameters (step length, step time, step width, and stride time). The Figure below shows an early result of the step length parameter calculated using a model based on VideoPose3D and the Vicon system. The X-axis shows reference values from the Vicon system, and the Y-axis shows the gait parameters from the pose estimator. The Figure indicates that the tracking performance leaves room for improvement.

A scatter plot for the evaluation of the step length parameter. Ground truth values from the Vicon system are represented on the X-axis, the estimated gait parameters are on the Y-axis. For a perfect prediction, all values would lie on the diagonal line.

Data Analysis

The early results of this project have shown that the pre-trained models need further improvements. Most models were per-trained solely on publicly available human activity recognition datasets. Those general-purpose datasets contain many activities; however, gait is usually underrepresented. Therefore, the next step is to fine-tune the pose estimation models on our gait dataset. For this, we prepare the dataset for training by synchronizing the Vicon and Video data temporally. Subsequently, the 3D Vicon markers are projected onto the image plane of the RGB camera. We then train the models to predict the Vicon marker locations with the generated gound-truth data. The hypothesis is that the 3D human pose estimation performance will increase when trained on the gait-specific dataset. The evaluation will quantify how the model has improved compared to the pre-trained model.

References

Liu, Junfa, et al. "A graph attention spatio-temporal convolutional network for 3D human pose estimation in video." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
Lugaresi, Camillo, et al. "Mediapipe: A framework for perceiving and processing reality." Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR). Vol. 2019. 2019.
Pavllo, Dario, et al. "3d human pose estimation in video with temporal convolutions and semi-supervised training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.
Gunnar Borg. “Perceived exertion as an indicator of somatic stress.” In: Scandinavian journal of rehabilitation medicine (1970).
E. Barsoum, J. Kender, and Z. Liu, “HP-GAN: Probabilistic 3D Human Motion Prediction via GAN,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
M. Capecci, M. G. Ceravolo, F. Ferracuti, S. Iarlori, A. Monteri`u, L. Romeo, and F. Verdini, “The kimore dataset: Kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 7, pp. 1436–1448, July 2019.