An historical perspective

The origins of motion capture

One of the earliest starts of motion capturing is the famous horse in 1878 in motion “video”. This was the start of all the modern cameras. One of the earliest human body motion capture was in military for moving efficiency purposes in 1883. This website has many historical resources on the topic. The problem is still a problem in modern times. If we want to create models to mimic humans, it surely could be nice to understand how humans move and think. This is the general line of though of this line of research.

Human motion capture

One interesting experiment is that humans seem to automatically group dots that move together. It is research from many years ago, in Johansoon. Ten or twelve dots were enough for humans to have enough contextual information to have it make sense. This was the most important point of that research. Following this idea, one interesting approch could be estimating 2D points from human images. The problems is taking into account for

  • Plane rotations
  • Scaling
  • Perspectives
  • Aspect ratio of different humans
  • Intra-category variation (e.g. masked faces, particular faces with helmets, they should still be considered human faces) This idea first started with modelling with pictorial structures, like rectangles and things similar to that to model human faces. At the time, they didn’t have much data, and it was very difficult to extract those images. One of the earliest breakthroughs was from 2017 Cat et al. paper.

Modern Approaches

Earliest breakthroughs: Deep Features

This is called DeepPose. Parametric Human Body Models-20250508142438571 Predicting heatmats was better than just predicting 2d points, so then people started to predict heatmaps for every single important joints, then you can sample from the joint. Then combined with the iterative approach they invented the next model.

Iterative

TODO

Convolutional Pose Machines

The network is able to look at the entire image after some iterations. Spatial correlations are taken into account here. Parametric Human Body Models-20250508142624190

OpenPose

The use convolutional pose machines with iterative refinement and another branch with part affinity fields, to help with the association problem. This is a bottom-up approach. The problem is that maybe we have many predictions for a single arm, and you need to join them together.

Part affinity fields 🟥:

  • For every limb you create a unit vector that represents the moving direction of that limb. This helps with association strategies.
  • Gives you direction of limbs that are then useful to solve the problem better, it looks very much like step by step solution in a supervised manner, closer to diffussions?

ViTPose

You get bounding boxes, and then fit the bounding boxes to a model that gives 2D detection, this helps you solve it without affinity fields, but needs 10x if there are 10 people in the image after you get a bounding box. This is an example of a top-down approach. There are many many other models that attempt to attack this section. Usually 2D for humans, it is easy to annotate and check, but for 3D it is difficult to record and capture, we don’t have enough data for 3D, but we need to do it for this kind of problems, there are many papers that attacked this kind of zone.