Distribution and Depth-Aware Transformers for 3D Human Mesh Recovery

Precise Human Mesh Recovery (HMR) with in-the-wild data is a formidable challenge and is often hindered by depth ambiguities and reduced precision. Existing works resort to either pose priors or multi-modal data such as multi-view or point cloud information, though their methods often overlook the valuable scene-depth information inherently present in a single image. Moreover, achieving robust HMR for out-of-distribution (OOD) data is exceedingly challenging due to inherent variations in pose, shape and depth. Consequently, understanding the underlying distribution becomes a vital subproblem in modeling human forms. Motivated by the need for unambiguous and robust human modeling, we introduce Distribution and depth-aware human mesh recovery (D2A-HMR), an end-to-end transformer architecture meticulously designed to minimize the disparity between distributions and incorporate scene-depth leveraging prior depth information. Our approach demonstrates superior performance in handling OOD data in certain scenarios while consistently achieving competitive results against state-of-the-art HMR methods on controlled datasets.


I. INTRODUCTION
Monocular Human Mesh Recovery (HMR) is an approach for estimating the pose and shape of a human subject from a single image, featuring a broad spectrum of applications in various downstream tasks [1], [2], [3].HMR can be split into two types: parametric and non-parametric approaches.The parametric approach involves the modeling of a network to generate model parameters, which are subsequently utilized for human mesh generation, as elucidated in [4], [5], [6].Recent strides have been witnessed in nonparametricbased approaches [7], [8], which directly regresses the 3D coordinates of the human mesh.
Despite the notable progress in both paradigms, they struggle with two key challenges-the appearance domain gap and depth ambiguity.Controlled environments, often used for training, offer a setting where data collection and annotation are manageable and precise.However, the challenge arises when the trained model is applied to in-thewild data, where real-world variability, such as lighting conditions, backgrounds, and poses, differs significantly from controlled settings.Second, depth-ambiguity issues plague single-view images.In response to the latter challenge, researchers, as exemplified in [9] and [5], have proposed solutions that leverage temporal information extracted from video inputs to enhance the understanding of human motion.
Figure 1: Illustration of our main idea.(a) Overview of the proposed D2A-HMR approach (b) Our method, D2A-HMR improves the mesh-image alignment (particularly as visualized in the highlighted region) when compared against SPIN [10], PARE [11] and METRO [8].
However, these temporal approaches have entailed significant computational overhead.
Obtaining ground truth mesh labels for human mesh reconstruction is a tedious task, mainly due to challenges like complexities of dynamic human motion, scene dynamics, resource constraints, and privacy concerns.In response to the inherent difficulty in obtaining accurate ground truth labels, existing works such as [8], [7], [10] resort to using pseudo ground truth to train models.Consequently, the modeling of human forms is inherently biased due to the presence of noisy labels.Moreover, the generalization of HMR for OOD poses, as discussed earlier, is an immensely challenging problem.Prior works [12], [13] model the output as a distribution of plausible 3D pose using normalizing flows and use information such as 2D keypoints or part segments as priors to provide deterministic predictions for downstream tasks.However, since these models use normalizing flows to explicitly estimate the underlying output distribution, they fail to generalize as shown in [14] and do not solve the model's bias to actual data, especially in scenarios with noisy labels and uncertainties.
To address the limitations of existing methods, our work introduces a novel approach to address these issues through a depth-and distribution-aware framework designed for the recovery of human mesh from monocular images.Notably, we integrate scene-depth information from monocular cameras obtained from previous depth models (termed pseudodepth) into a transformer encoder via the cross-attention mechanism.In addition, we employ a log-likelihood residual approach to learn deviations in the underlying distribution, facilitating a refinement module in the training process.This distribution approach explicitly encourages the model to learn a more generalizable representation that can perform better on unseen data.To further refine the mesh shape and feature relationships, we introduce a dedicated silhouette decoder and a masked modeling module.As showcased in Figure 1, these contributions allow our D2A-HMR approach to excel in handling challenging, unseen poses.To the best of our knowledge, D2A-HMR is the only framework to explicitly incorporate depth priors and systematically learn the mesh distribution disparity between the underlying prediction and ground truth distributions.Through experimentation, we demonstrate that our method outperforms existing works on some benchmarked datasets.In summary, our contributions include: 1) We introduce a novel image-based HMR model named D2A-HMR that adeptly models the underlying distributions and integrates pseudo-depth priors for efficient and accurate mesh recovery.2) By leveraging residual log-likehood approach, we refine the model by learning the disparity between the underlying predicted and ground truth distribution.3) Validation of the enhanced performance through the integration of pseudo-depth and distribution-aware modules in HMR, particularly in complex human pose scenarios.
II. RELATED WORK Human Mesh Recovery from a Single Image.Recent works on HMR can be split into parametric and nonparametric approaches.Parametric approaches can further be split into optimization-based and learning-based approaches.Optimization-based approaches fit a body model by minimizing the error between different prior terms.SMPLify [15] fits the parametric SMPL [16] model to minimize the error between the recovered mesh and keypoints.In addition, prior terms including silhouettes [6], [1] or distance functions [17] are used to penalize unrealistic shapes and poses.Learningbased approaches take advantage of deep neural networks to predict model parameters [4], [10], [6], [18].Recent works including HMR-ViT [18] use a transformer-only temporal architecture to predict the model parameters, and ImpHMR [6] uses neural feature fields to model humans in 3D space from a single image.
For directly regressing the vertices, works including GraphCMR [19], Pixel2mesh [20], and Feastnet [21] use graphical neural networks to regress the vertices of RGB images, effectively by modeling neighborhood vertex-vertex interactions.Pose2Mesh [7] uses a 2D and 3D pose to regress the vertices using graphical spectral neural networks.METRO [8] uses transformers to model the global interaction between the vertices and I2LMeshNet [22] uses a heatmap-based representation called lixel to regress the human mesh.
Normalizing flow.Normalizing flow is a tool for efficiently transforming a simple distribution into a complex one through a series of invertible transformations [14], [12].It applies to probability density estimation, which can be used to estimate the likelihood.Previous work including [13] and [23] use normalizing flows to learn a priori the distributions of plausible human poses.ProHMR [12] focuses on modeling the output of the human mesh as a distribution over all the different possible meshes.However, it utilizes normalizing flows to directly predict the exact underlying distribution which is demonstrated to perform poorly for OOD data [14].RLE [24] uses normalizing flow to minimize the difference between the distributions of the ground truth and predicted 2D poses rather than using the output distribution to sample one particular pose, thereby boosting the performance of regression-based pose estimation techniques.
Inspired by the literature on residual log-likelihood in 2D human pose estimation [24] and the shortcomings of existing HMR approaches, our approach focuses on mitigating distribution discrepancies of the output and ground truth meshes by leveraging normalizing flow techniques.This alleviates the problem of poor performance on OOD data as we use normalizing flows in the refinement module to minimize the difference between the output mesh distribution and ground truth mesh distribution instead of predicting output poses/ meshes using the captured output distribution.
Attention for Human Mesh Recovery.Attention mechanisms have been shown to be effective for HMR by enabling models to focus on the most relevant parts of the input data.METRO [8] uses self-attention to reduce ambiguity by establishing non-local feature exchange between visible and invisible parts with progressive dimensionality reduction.SAHMR [25] uses cross-attention between image and scene contact information to improve the posture of the regressed mesh.The recently proposed JOTR [26] uses self-attention Self-attention is performed on z img and z depth , resulting in z ′ img and z ′ depth , respectively.Subsequently, cross-attention is applied between z ′ img and z ′ depth to produce z c .The learnable fusion gates combine z ′ img , z ′ depth , and z c , followed by layer normalization and an MLP.The resulting gated tokens (z) are input into three distinct refinement modules: a decoder (D) for silhouette estimation, a regressor head, R which incorporates normalizing flow (DM) for distribution-aware joint vertex estimation and masked modeling for enhanced semantic representation of the features.
to study the dependencies of 2D and 3D features to solve problems of occlusion.PSVT [27] uses a spatiotemporal attention mechanism to capture relations between tokens and pose/ shape queries in both temporal and spatial dimensions.Similarly, OSX [28] uses a component-aware encoder to capture the correlation between different parts of the human body to predict the whole-body human mesh.
We propose a parallel network composed of two selfattention modules to learn global dependencies within the image and pseudo-depth features, respectively, and a crossattention module to learn inter-modal dependencies between the image and pseudo-depth features.This allows the network to learn a more comprehensive representation for accurate 3D mesh recovery.

III. METHOD
The overview of the proposed D2A-HMR framework is presented in Figure 2. In this section, we delve into the architecture and training objective of D2A-HMR.The feature encoding process begins with the extraction of features from the image and pseudo-depth map using a convolutional neural network (CNN) backbone, followed by hybrid position encoding.These encoded features are then inputted into the transformer encoder, which engages in cross-attention with the pseudo-depth cues and the input image.Following this, the refinement module comes into play, incorporating the distribution matching, silhouette decoder, and masked modeling components to regularize the model during the training process.

A. Architecture
Feature Encoding.The initial step involves passing the input image and depth map through a CNN backbone to extract pertinent features.Subsequently, to explicitly model the structure of the features, position embedding is applied to these extracted features.
Specifically, we implement a hybrid positional encoding (P e ) illustrated in Equation ( 1), for the image and depth tokens.This hybrid approach capitalizes on the strengths of both learnable position embeddings (P l ) and sinusoidal position embeddings (P s ).P l adapts to task-specific positional patterns, proving highly effective in capturing intricate spatial relationships.Meanwhile, P s contributes to the globally consistent positional understanding, capturing more information about the position.This combination optimally balances adaptability and global context, yielding fine-grained spatial patterns and general positional relationships.
where ω 1 and ω 2 are learnable parameters controlling the position embedding contribution of both types.
Transformer Encoder.The utilization of the transformer encoder in D2A-HMR is driven by the overarching goal of effectively learning pseudo-depth cues from the input data.Using self-attention mechanisms on the encoded features derived from both modalities (image and pseudo-depth map), namely z img and z depth , the transformer encoder facilitates understanding of spatial relationships within each domain.Furthermore, we propose to use a cross-attention mechanism to establish intricate connections between the image and pseudo-depth information.The resulting fused representation, denoted as z, encapsulates rich depth cues, crucial for the subsequent regression of human vertices.
The embedded features, denoted as z img and z depth , serve as input tokens to the transformer encoder, embodying our pursuit of learning pseudo-depth cues.Using self-attention mechanisms, the encoder refines z img and z depth by capturing spatial relationships within each modality, producing updated features z ′ img and z ′ depth , respectively.Subsequently, the introduction of a cross-attention mechanism facilitates connections between image and pseudo-depth features.The resulting cross-attended tokens denoted as z c , are then fused with z ′ img and z ′ depth from their respective attention heads, yielding a final fused representation denoted as z, as illustrated in Equation (2).To facilitate this fusion, learnable fusion gates are employed, similar to the position encoding methodology.These gates adaptively emphasize the importance of each source, enhancing the model's capacity to capture meaningful relationships between the image and pseudo-depth features.
Here, in Equation 2, ω 3 and ω 4 are the learnable parameters.Once the fusion is done, z is normalized and fed as input to an MLP to get the output tokens.This holistic approach enables our model to effectively capture intricate patterns and dependencies within the input image and the 3D information of the scene.A visual illustration of the transformer encoder is shown in Figure 2.

B. Refinement Module
The refinement module in the D2A-HMR framework encompasses three key components, each designed to enhance the model's capabilities in capturing different aspects of human pose and shape.First, the distribution matching component aids in refining the model's representation by aligning the output mesh distribution to the ground truth mesh distribution.This adaptation enables the model to capture and adapt to inherent variations in the distribution of training data, promoting a more generalized performance that extends beyond the specific characteristics of the training data.The second component, the silhouette decoder, focuses on optimizing the model's capacity to align the shape with the input image by adeptly capturing the outlines of the human subject.This component contributes significantly to the model's ability to refine and improve its representation based on the visual cues present in the input data.Lastly, the masked modeling component serves to empower the model by learning from available information, thereby enhancing its ability to capture long-range relationships among features in the image.This integration ensures that the model can leverage relationships across the entire input, contributing to a more comprehensive understanding of the underlying human pose and shape.
Distribution Matching.To align the model with the underlying data distribution, we incorporate the RealNVP [29] normalizing flow mechanism within the D2A-HMR framework.This aims to refine the model by minimizing the discrepancy between predicted and ground truth mesh distributions.The transformer encoder's output tokens z are passed through a MLP regressor (R), which utilize linear layers to predict the mean µ and standard deviation σ, controlling the position and scale of the initially assumed Gaussian distribution.The flow-modeled distribution (P ϕ (x), where x is the predicted mesh) is deconstructed into three essential terms, as expressed in the equation: The first term, log Q(x), quantifies the logarithmic probability of the data under the simple distribution.The second term, log , represents the residual log-likelihood, serving as the distinction between the log-probability of the data under the optimal underlying distribution and the log-probability under the tractable initial density function.The third term, log c, functions as a normalization constant.Silhouette Decoder.To optimize shape alignment, we used a specialized decoder to reconstruct silhouettes.Leveraging features from the transformer encoder, this decoder employs a sequence of deconvolution layers with ReLU activation and dropout, culminating in a fully connected layer.This reconstruction process significantly augments the model's capability to generate high-quality silhouette representations.To acquire the pseudo-ground truth silhouette of human subjects, we utilize an existing segmentation technique [30].
Masked Modeling.Prior works including [31], [8], and [32] have demonstrated the efficacy of masked modeling in elucidating diverse relationships within training datasets, spanning textual, vertex, and image domains respectively.In alignment with these established works, we adopt random masking of the embedded features to recover the vertex of the human body.By deliberately obscuring a percentage of embedded features during training, our model is forced to rely solely on the unmasked features extracted from the image.This enables a comprehensive understanding of both short and long-range relationships among the features, contributing to the overall performance of D2A-HMR framework.

C. Loss Functions
In this sub-section, we present the comprehensive training objectives employed to recover the human mesh in our model.These objectives consist of a weighted combination of various loss components, each serving a specific role in refining the model's output.
The loss function L v is computed using the loss metric L 1 , with the aim of minimizing the disparities between the model's output vertices with the ground truth vertice representation.Simultaneously, L 3D = |J 3D − J g 3D | leverages the same loss metric to optimize the 3D pose by regression (J 3D ) of the output mesh vertices following [8], seeking alignment with the ground truth pose coordinates (J g 3D ).To enhance the alignment between image and mesh representations, camera parameters are employed to reproject and infer the 2D human pose coordinates (J 2D ) represented with where J g 2D is the 2D pose ground truth.This reprojected output is refined by applying loss optimization using L 1 .
As mentioned in Section III-B, a distribution matching regularizer is used to penalize the model for predicting outputs that are unlikely under the underlying ground truth distribution.Equation ( 4) shows the distribution regularizer (L RLE ) used in the D2A-HMR architecture.
Here, in Equation ( 4), G ϕ ( μg ) is the learned residual distribution of the predicted value μg where μg = (µ g −µ)/σ.Here, µ g is the ground truth distribution and ϕ is the flow model parameter.Additionally, we incorporate silhouette loss, denoted L silh , which regularizes the model by controlling the shape of the reconstructed mesh.The overall objective function is shown in Equation ( 5), which represents a combination of these individual losses.(5) where λ d , λ v , λ 3D , λ 2D and λ s denote the weights attributed to the training objectives concerning the distribution, vertices, 3D pose coordinates, 2D pose coordinates, and silhouettes, respectively.

A. Implementation Details
Training Details.Training was carried out on an infrastructure comprising three NVIDIA A6000 GPUs.The network was trained for 500 epochs, with a batch size of 48, and 24 parallel workers.Adam optimizer, configured with a learning rate of 10 −4 and beta values of 0.9 and 0.99, was used for optimization.The network was designed to output a coarse mesh representation containing 431 vertices.This output was subsequently upsampled [19] to the original mesh's 6890 vertices, utilizing learnable MLP layers, resulting in the model's ability to capture fine-grained spatial details.
Datasets.Following previous work, we used two prominent 3D human pose estimation datasets, namely 3D Poses in the Wild (3DPW) [38] and Human3.6M[39] to train our D2A-HMR model.For the 3DPW dataset, we follow the standard practice of splitting the dataset into a training set of 22,000 images and a test set of 35,000 images.In the case of Human3.6M,we trained our D2A-HMR model on subjects S1, S5, S6, S7, and S8 and conducted testing on subjects S9 and S11.These data configurations were aligned with the common training and evaluation settings within the domain [8], [5].The qualitative evaluation of the model was done in Leeds Sports Pose (LSP) [40], and various dedicated sports datasets including the MLBPitchDB dataset [41] and HARPE dataset [42].
Evaluation Metrics.In line with established practices from previous research [11], [8], [7], we subjected our model to a comprehensive evaluation using key metrics: mean per joint position error (mPJPE), procrustes-aligned mean per joint position error (PA-mPJPE) and per vertex error (mPVE) in both the 3DPW and Human3.6Mdatasets.mPVE metric is ignored if the ground truth mesh is not available.All metrics were measured in millimeters (mm), providing a precise assessment of our model's performance.

B. Main Results
We assess the performance of the proposed D2A-HMR framework by comparing it with established state-of-theart techniques for HMR.The results, presented in Table I, highlight the competitive performance of our method across various metrics on the Human3.6Mand 3DPW datasets.The comparative results demonstrate that the meshes generated by the D2A-HMR framework exhibit superior alignment with the input image.Our method's adept understanding of pseudo-depth cues and the distribution contributes significantly to improved alignment, particularly in handling challenging input scenarios characterized by depth ambiguities and extreme poses.

C. Ablation Studies
To verify the individual impact of each module on the proposed D2A-HMR model, comprehensive studies were conducted, as detailed in this sub-section.For consistency across all studies, the 3DPW dataset was utilized as the common benchmark.
Integration of multi-modal data.Experimentation to assess the impact of depth and distribution matching components within the D2A-HMR are detailed in Table III.Incorporation of both the pseudo-depth and distribution modeling modules in the D2A-HMR framework is observed to lead to a substantial improvement in the overall performance of mesh recovery.This observation serves as confirmation that the underlying motivation behind the proposed framework is valid and aids in enhancing the model's capabilities.
Depth on mPJPE(z).Experimentation on exclusively capturing the depth component of the regressed 3D joints in order to demonstrate its impact on the human pose is conducted in Table IV.A notable enhancement in the z-axis of the reconstructed mesh is evident, as highlighted in Table IV.We computed mPJPE along the z-axis denoted as mPJPE(z), disregarding the components x and y of the reconstructed mesh.This experimentation validates that the incorporation of scenedepth information contributes to an improvement in HMR.
Silhouette and Masked Modeling.Table V illustrates the impact of the silhouette decoder and masked modeling used within the D2A-HMR framework.The observations drawn from Table V highlight the beneficial impact of incorporating both the silhouette decoder and masked modeling modules in enhancing the model's ability to disentangle the appearance and part-relationship of the person.While prior studies, such as [35], employ methods like explicit iterative optimization for mesh-toimage alignment, our silhouette decoder yields improved alignment outcomes compared to scenarios without the decoder.Thus, these modules are utilized during the training process of the D2A-HMR framework, contributing to its improved performance.
Backbones.A comprehensive analysis of D2A-HMR's performance by investigating its behavior with various backbone architectures was conducted.To establish a strong baseline, we first trained two ResNet variants for 1000 epochs on the ImageNet dataset [43] for an image classification task.We also explored HRNet variants trained for 1000 epochs using the COCO dataset [44] for the classification task.We observe that HRNet-w64 gives the most positive impact on feature extraction from both the image and depth maps compared to the ResNet backbones.This can be attributed to HRNet-w64's effectiveness in capturing both local and global contexts through its multiresolution fusion representations, thereby enhancing the model's ability to extract rich and informative features.

V. CONCLUSION
In summary, our research introduces the Distribution and Depth-Aware Human Mesh Recovery (D2A-HMR) framework as an innovative solution to the persistent challenge of depth ambiguities and distribution disparities in monocular human mesh recovery.By explicitly incorporating scenedepth information, we have substantially reduced the inherent ambiguity, resulting in a more precise and accurate alignment of human meshes.The utilization of normalizing flows to model the output distribution has been instrumental in regularizing the model to minimize the underlying distribution disparities, enhancing its resilience against noisy labels, and mitigating biases in human-form modeling.
Our extensive experimentation on diverse datasets has demonstrated the competitive performance of the D2A-HMR method when compared to state-of-the-art HMR techniques.Furthermore, it has been noticed that our network outperforms existing work on sports datasets with OOD data.This proposed framework not only addresses depth ambiguities and mitigates noise, but also leverages the inherent 3D information present in images, providing a robust and unambiguous solution for human mesh recovery.Future work will entail training on more diverse datasets to enhance the alignment and generalizability of the HMR process.

Figure 2 :
Figure 2: D2A-HMR model architecture.Given an image (I), we first incorporate a transformer backbone (E) to estimate the depth map (D) and a CNN backbone (F) to extract the features from the images.Positional embedding is applied to both image and pseudo-depth features, utilizing a hybrid approach for image tokens (z img ) and pseudo-depth tokens (z depth ).Self-attention is performed on z img and z depth , resulting in z ′ img and z ′ depth , respectively.Subsequently, cross-attention is applied between z ′ img and z ′ depth to produce z c .The learnable fusion gates combine z ′ img , z ′ depth , and z c , followed by layer normalization and an MLP.The resulting gated tokens (z) are input into three distinct refinement modules: a decoder (D) for silhouette estimation, a regressor head, R which incorporates normalizing flow (DM) for distribution-aware joint vertex estimation and masked modeling for enhanced semantic representation of the features.

Table I :
Comparison to state-of-the-art 3D pose reconstruction approaches on 3DPW and Human3.6Mdatasets.Bold: best; Underline: second best

Table III :
Ablation study on pseudo-depth and distribution modeling for D2A-HMR evaluated on 3DPW dataset

Table IV :
Ablation study on the impact of depth modeling for D2A-HMR evaluated on 3DPW dataset

Table V :
Ablation study on the silhouette decoder and masked modeling evaluated on 3DPW dataset

Table VI :
Different input representations as the backbone for D2A-HMR evaluated on 3DPW dataset