Hasso-Plattner-Institut25 Jahre HPI
Hasso-Plattner-Institut25 Jahre HPI

Markerless Motion Tracking Using Computer Vision and Wearable Sensing for Physical Exercise Quantification

Chair for Digital Health - Connected Healthcare
Hasso Plattner Institute

Office: Campus III Building G2, Room G-2.1.20
Tel.: +49 331 5509-4853
Email: Justin.albert(at)hpi.de
Links: Homepage

Supervisor: Prof. Dr. Bert Arnrich

Starting Date: 01.10.2019


In my research, I focus on human motion analysis using primarily 3D cameras but also other sensor modalities such as Inertial Measurement Units and Electrocardiography. The projects range from using a low-cost 3D camera for gait analysis to predicting subjective exertion in strength training. In the following sections, I want to give an overview of the past projects of the last year and my current research. 

Current Research

Gait Analysis Using 3D Human Pose Estimation on 2D Images


Gait analysis is essential to assess a patient's physical or mental state. It is valuable for the early detection of neurological diseases or evaluation of fall risk in older people. Usually, gait is assessed in a specialized laboratory with expensive marker-based motion capture systems. Medical professionals must attach these markers to the subjects, which are then tracked by an optical system. The recent advances in the field of human pose estimation using deep learning enable motion tracking on images. These algorithms work purely on 2D images and can predict the 3D coordinates of specific human key points (such as head, shoulder, feet, etc.) without any markers. This technology holds considerable potential as it alleviates the need for marker placement on the subjects. In this project, we aim to utilize human pose estimation for gait analysis that can be deployed on consumer-graded devices such as smartphones. 


An estimated 3D skeleton based on a monocular 2D video.

Study Setup

We have recorded a dataset of 16 subjects (8 female, 8 male) walking on a treadmill at three different velocities. We recorded the subjects using a 12 MP color camera and a marker-based motion capture system (Vicon). We used the 39 full-body marker Plug-in Gait model. The Vicon system sampled data at 100 Hz, while the RGB camera recorded at 30 fps. We tried different state-of-the-art models, including the GAST-Net [1], MediaPipe [2], or VideoPose3D [3], to estimate the 3D coordinates of humans based on 2D images. After the extraction, we apply signal processing methods such as filtering and temporal and spatial alignment of the skeleton data. We evaluated different aspects, including the spatial agreement of joint locations and gait parameters (step length, step time, step width, and stride time). The Figure below shows an early result of the step length parameter calculated using a model based on VideoPose3D and the Vicon system. The X-axis shows reference values from the Vicon system, and the Y-axis shows the gait parameters from the pose estimator. The Figure indicates that the tracking performance leaves room for improvement. 

A scatter plot for the evaluation of the step length parameter. Ground truth values from the Vicon system are represented on the X-axis, the estimated gait parameters are on the Y-axis. For a perfect prediction, all values would lie on the diagonal line.

Data Analysis

The early results of this project have shown that the pre-trained models need further improvements. Most models were per-trained solely on publicly available human activity recognition datasets. Those general-purpose datasets contain many activities; however, gait is usually underrepresented. Therefore, the next step is to fine-tune the pose estimation models on our gait dataset. For this, we prepare the dataset for training by synchronizing the Vicon and Video data temporally. Subsequently, the 3D Vicon markers are projected onto the image plane of the RGB camera. We then train the models to predict the Vicon marker locations with the generated gound-truth data. The hypothesis is that the 3D human pose estimation performance will increase when trained on the gait-specific dataset. The evaluation will quantify how the model has improved compared to the pre-trained model. 

Prediction of Subjective Exertion in Resistance Training


Quantifying load during physical activity has been of high interest to the research community. For athletes, it is desirable to optimize their exercises to align the applied training load most closely with the value desired by the training plan. Too much load induces a decrease in force production ability and increases the risk of injuries. Exercise load, e.g., during rehabilitation or recreational sports, is also important to avoid injuries for the general population. Training load can be quantified utilizing internal and external measures. External measures include, e.g., the distance traveled, the travel speed, or the lifted weight. Internal load is often measured as a rating of perceived exertion (RPE), which specifies how exhausting an exercise was for a specific person by reporting a single value on a scale. A standard RPE scale is the so-called Borg scale, which ranges from 6 (not exhausting) to 20 (extremely exhausting) [4]. Retrieving such a rating is quickly done by giving subjects a scale to mark their exhaustion a short time after the load concludes. Given this, we aim to build a system that can automatically predict RPE values based on sensor measurements. We hypothesize that such a system could warn users when a significant training overload is experienced to avoid fatigue injuries. In this initial project, we utilize multiple 3D cameras for motion tracking and methods from machine learning to predict subjective RPE values.

Study Setup

For this project, we aim for the maximum effect on exertion. Therefore, the squat exercise was chosen as it involves large muscle groups. The exercises were performed on a so-called flywheel machine. A flywheel training machine does not use a weight that is accelerated downwards by gravity. Instead, all power generated by the subject standing up is stored in a flywheel, transmitted by a belt. This belt is connected to the participant via a hip harness and wrapped around a transmission shaft fixed to the flywheel. Thus, when the participant stands up, he unwraps the belt from the shaft, spinning up the flywheel. Standing up is the concentric movement in a squat. The belt wraps back around the transmission shaft at the topmost position because the flywheel continues to spin. Thus, during the downwards movement, the participant has to deaccelerate the flywheel back down in the eccentric movement. Finally, the subject will again be in a squatting position, as shown in the following figure. In total, N=21 subjects have participated in our study, performing a specific protocol consisting of several sets with 12 repetitions in each set.  

Data Analysis

We used two Microsoft Azure Kinect 3D cameras to capture the participants during the experiment. Both cameras were placed at a 45-degree angle, pointing to the subject. In order to obtain one final skeleton, the skeleton from each camera must be integrated into one. The skeleton fusion was achieved by an external camera calibration using calibration patterns. An example sequence of a fused skeleton is shown in the video below. Afterward, signal processing methods must be applied to filter the kinematic data and to remove outliers. In the initial phase of this project, the aim is to explore and analyze the recorded kinematic data and manually craft feature sets. These include various skeleton features, such as relative joint positions, joint angles, and joint angle velocities. After obtaining an extensive feature set, we eliminate meaningless features from the feature set using various feature elimination methods. Subsequently, statistical features such as mean, standard deviation, and median are calculated on the previously mentioned skeleton features. To predict fatigue during squats, the focus, for now, is on conventional machine learning rather than advanced methods from deep learning. We utilize Random Forests, Gradient Boosting Regression, K-NN regression, and multi-layer perceptron (MLP) to predict the subjective value from the Borg scale. 

Former Project: Data Augmentation of Kinematic Time-Series from Rehabilitation Exercises

Neurological diseases such as Parkinson's or stroke are common, severe conditions in modern society. Usually, physicians or experts assess the progress of these or other neurological diseases in the hospital. Hence, their decisions can suffer from a subjective bias. Furthermore, many healthcare systems dismiss patients from the rehabilitation program early, forcing them to continue the training program at home without an expert's supervision. Nowadays, exercise recognition systems are developed which can evaluate a user's movement. These systems could support physicians with an objective decision-making process or automatically assess the exercises performed alone at home. Training such a machine learning system requires large amounts of representative data to achieve good results, especially for deep learning-based approaches. Large and diverse datasets are publicly available in the field of Human Activity Recognition (HAR). However, the collection of medical datasets is challenging as access to patients is restricted. Also, detailed knowledge of medical experts and equipment is needed to collect the data and obtain ground truth labels. Especially for studies including a healthy control group, the potentially limited access to patients leads to unbalanced datasets, with most data points belonging to the healthy subjects. To overcome these challenges, a common strategy for increasing the size of a collected dataset is dataset augmentation or the synthesis of entirely new datasets with artificial examples. We have developed a method to generate long-term synthetic sequences of human motion data for a given class utilizing a Generative Adversarial Network (GAN) to tackle this issue.

The here-developed network produces realistic-looking repetitions of a specific exercise over a long period. Our network architecture is inspired by and builds upon the Human-Pose-GAN (HP-GAN) model [5]. The architecture consists of an encoder and a decoder network and takes ten prior poses from an arbitrary sequence. From there, it aims to predict 20 new output poses of the sequence. By recursively inferring the network, the method creates long data sequences. We demonstrated the approach's usefulness by balancing the KIMORE (KInematic Assessment of MOvement and Clinical Scores for Remote Monitoring of Physical REhabilitation) dataset [6]. In this dataset, patient classes are underrepresented compared to the healthy control group. We have trained and focused our approach on the squat exercises performed by Parkinson's disease and stroke patients and healthy persons. For evaluation, we trained a classification network to identify stroke and Parkinson's patients. Balancing the dataset using our method increased the classification accuracy by 11 percentage points for a three-class classification of stroke and Parkinson's disease patients and healthy subjects. The approach and results were published at the IEEE COINS conference in September 2021. The video below shows generated skeleton data for a hand-raise exercise using our algorithm. Shown are ground-truth data, generated data as well as two error cases.


  1. Liu, Junfa, et al. "A graph attention spatio-temporal convolutional network for 3D human pose estimation in video." 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.

  2. Lugaresi, Camillo, et al. "Mediapipe: A framework for perceiving and processing reality." Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR). Vol. 2019. 2019.

  3. Pavllo, Dario, et al. "3d human pose estimation in video with temporal convolutions and semi-supervised training." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

  4. Gunnar Borg. “Perceived exertion as an indicator of somatic stress.” In: Scandinavian journal of rehabilitation medicine (1970).

  5. E. Barsoum, J. Kender, and Z. Liu, “HP-GAN: Probabilistic 3D Human Motion Prediction via GAN,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.

  6. M. Capecci, M. G. Ceravolo, F. Ferracuti, S. Iarlori, A. Monteri`u, L. Romeo, and F. Verdini, “The kimore dataset: Kinematic assessment of movement and clinical scores for remote monitoring of physical rehabilitation,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 7, pp. 1436–1448, July 2019.



  • A computer vision approach to continuously monitor fatigue during resistance training. Albert, Justin Amadeus; Arnrich, Bert in Biomedical Signal Processing and Control (2024). 89 105701.


  • Protocol for a Randomized... - Download
    Protocol for a Randomized Crossover Trial to Evaluate the Effect of Soft Brace and Rigid Orthosis on Performance and Readiness to Return to Sport Six Months Post-ACL-Reconstruction. Jahnke, Sonja; Cruysen, Caren; Prill, Robert; Kittmann, Fabian; Pflug, Nicola; Albert, Justin Amadeus; de Camargo, Tibor; Arnrich, Bert; Królikowska, Aleksandra; Kołcz, Anna; Reichert, Paweł; Oleksy, Łukasz; Michel, Sven; Kopf, Sebastian; Wagner, Michael; Scheffler, Sven; Becker, Roland in Healthcare (2023). 11(4)


  • PERSIST: A Multimodal Dataset for the Prediction of Perceived Exertion during Resistance Training. Albert, Justin Amadeus; Herdick, Arne; Brahms, Clemens Markus; Granacher, Urs; Arnrich, Bert in Data (2022). 8(1)
  • Unsupervised Activity Rec... - Download
    Unsupervised Activity Recognition Using Trajectory Heatmaps from Inertial Measurement Unit Data. Konak., Orhan; Wegner., Pit; Albert., Justin; Arnrich., Bert (2022). 304–312.


  • Using Machine Learning to... - Download
    Using Machine Learning to Predict Perceived Exertion During Resistance Training With Wearable Heart Rate and Movement Sensors. Albert, Justin; Herdick, Arne; Brahms, Clemens Markus; Granacher, Urs; Arnrich, Bert (2021).
  • Data Augmentation of Kine... - Download
    Data Augmentation of Kinematic Time-Series From Rehabilitation Exercises Using GANs. Albert, Justin; Glöckner, Pawel; Pfitzner, Bjarne; Arnrich, Bert (2021). 1–6.


  • Will You Be My Quarantine... - Download
    Will You Be My Quarantine: A Computer Vision and Inertial Sensor Based Home Exercise System. Albert, Justin; Zhou, Lin; Gloeckner, Pawel; Trautmann, Justin; Ihde, Lisa; Eilers, Justus; Kamal, Mohammed; Arnrich, Bert (2020). (Vol. 14)
  • Evaluation of the Pose Tr... - Download
    Evaluation of the Pose Tracking Performance of the Azure Kinect and Kinect v2 for Gait Analysis in Comparison with a Gold Standard: A Pilot Study. Albert, Justin; Owolabi, Victor; Gebel, Arnd; Brahms, Markus Clemens; Granacher, Urs; Arnrich, Bert in MDPI Sensors (2020). 20(18)


  • Geometric Algebra Computi... - Download
    Geometric Algebra Computing for Heterogeneous Systems. Hildenbrand, D.; Albert, Justin; Charrier, P.; Steinmetz, C. in Advances in Applied Clifford Algebras (2017). 27 599–620.