For the evaluation of human motion, high quality analysis is usually performed in specialized laboratories using high-end multi-camera motion capturing systems. To obtain accurate results, subjects must be prepared with reflective markers, which is a very expensive and time-consuming process and requires presence in the laboratory. In the following we present related work on motion tracking systems and algorithms for cost-effective 2D/3D pose estimation, where we also use some of the algorithms to record our own data sets.
2D/3D Human Pose Estimation from RGB Images
The task of human pose estimation is to estimate the position of certain joints of the human body in either 2D or 3D coordinates for one or multiple given input images. Convolutional Neural Networks (CNN) enjoyed huge success in this computer vision task. These networks are generally well suited for image processing tasks because the architecture of the models is designed to work with data arranged in grid structures. One of the first papers (Toshev et al. ) in the field of pose estimation with CNNs aimed at regressing the joint locations as \((x,y)\) pixel coordinates directly for a given image, resulting in noisy data. A better strategy for estimating joint locations in images than the simple regression of pixel coordinates is to predict a heatmap where higher values indicate higher confidence of the joint location. In two landmark papers, Stacked Hourglass  and Simple Baseline , this regression method was used, each with a different arrangement of convolutional layers and optimization strategies but both achieving state-of-the-art performance at the time.
Another commonly used system called OpenPose  is a real-time multi-person 2D and 3D pose estimation model that consists of a two-step approach that first identifies key points of one or more persons in a given image and then selects all joints belonging to the same person from the set of all joints found. Figure 1 shows pictures taken from an own training process of a 2D human pose estimation model (implementation of Simple Baseline) utilizing data augmentation techniques by rotating and scaling the images in the training set.