DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos

1 Carnegie Mellon University
2 North Carolina State University
3 Center for Research in Computer Vision, University of Central Florida
4 University of North Carolina at Charlotte
WACV 2025

Abstract

Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. While image-based HMR methods have achieved impressive results, they often struggle to recover humans in dynamic scenarios, leading to temporal inconsistencies and non-smooth 3D motion predictions due to the absence of human motion. In contrast, video-based approaches leverage temporal information to mitigate this issue. In this paper, we present DiffMesh, an innovative motion-aware Diffusion-like framework for video-based HMR. DiffMesh establishes a bridge between diffusion models and human motion, efficiently generating accurate and smooth output mesh sequences by incorporating human motion within the forward process and reverse process in the diffusion model. Extensive experiments are conducted on the widely used datasets, which demonstrate the effectiveness and efficiency of our DiffMesh. Visual comparisons in real-world scenarios further highlight DiffMesh's suitability for practical applications.


Framework



(a) The general pipeline for diffusion model. Input data is perturbed by adding noise recursively and output data is generated from the noise in the reverse process. (b) Human motion is involved over time in the input video sequence. Similar to the forward process in (a), the forward motion between adjacent frames resembles the process of adding noise. The mesh of the previous frame can be decoded through the reverse motion process successively.


Architecture of DiffMesh



Our framework takes input sequence with f frames, with the objective of outputting a human mesh sequence consisting of f frames. We model the forward human motion across frames similar to the mechanism of introducing noise in the forward process. We assume that human motion will eventually reach a static state, which is represented by the mesh template state. Thus, additional (N - f + 1) steps are necessary from x_f to reach the static state. Consequently, we utilize a transformer-based diffusion model to sequentially produce the decoded features during the reverse process. The final human mesh sequence is returned by a mesh head using SMPL human body model.



Results on Human3.6M and 3DPW datasets




Visual Comparisons with Previous Methods






Bibtex


            @InProceedings{zheng2025diffmesh,
              title={DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos},
              author={Zheng, Ce and Liu, Xianpeng and Peng, Qucheng and Wu, Tianfu and Wang, Pu and Chen, Chen},
              booktitle={IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) },
              year={2025}
            }
        

This webpage template was adapted from here.