...

Thibault Groueix

thibault.groueix.2012 at polytechnique.org


Short bio: Thibault Groueix is a research engineer at Adobe Research. Previously, he was a research scientist at Naver Labs Europe. He obtained his PhD from the Imagine team of Ecole Nationale des Ponts et Chaussees. His thesis, advised by Mathieu Aubry, received the second place PhD awards from SIF and AFIA. His research is focused on 3D deep learning, specifically 3D deformation of surfaces and 3D reconstruction of humans.


Internships for PhD students: I am regularly looking for interns to join us at Adobe Research for a summer internship. If you are interested in interning with me, don't be shy and send me an e-mail with your CV, research interests and a short description of potential topics you would like to work on. I am currently looking for interns for the summer 2025.


Pro-bono mentoring: I am happy to discuss anything related to your research career, especially if you belong to an underrepresented group in STEM. You can schedule a meeting here. From PhD applications, PhD life, to research internships, full-time jobs and so on. I studied in France, so I may have a hard time with questions specific to the US universities. You could also try reaching out to Matheus here about US-specific queries.


News

  • 04/2024 I will be attending SIGGRAPH Asia : drop me a line if you want to meet in Tokyo!

Research

See Google Scholar profile for a full list of publications.

Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects
A. Barda M. Gadelha, V. G. Kim, N. Aigerman, A. Bermano, T. Groueix
ArXiv 2024.
@article{instant3dit_Barda_2024,
 author = {Barda, Amir and Gadelha, Matheus and Kim, Vladimir G. and Aigerman, Noam and Bermano, Amit and Groueix, Thibault},
 title = {Instant3dit: Multiview Inpainting for Fast Editing of 3D Objects},
 journal = {arXiv preprint arXiv:2412.00518},
 year = {2024},
 }

We present Instant3dit - a generative technique to edit 3D shapes, represented as meshes, NeRFs, or Gaussian Splats, in approximately 3 seconds, without the need for running an SDS type of optimization. Our key insight is to cast 3D editing as a multiview image inpainting problem, as this representation is generic and can be mapped back to any 3D representation using the bank of available Large Reconstruction Models. We explore different fine-tuning strategies to obtain both multiview generation and inpainting capabilities within the same diffusion model. In particular, the design of the inpainting mask is an important factor of training an inpainting model, and we propose several masking strategies to mimic the types of edits a user would perform on a 3D shape. Our approach takes 3D generative editing from hours to seconds and produces higher-quality results compared to previous works.

SAMa: Material-aware 3D Selection and Segmentation
M. Fischer, I. Georgiev, T. Groueix, V. G. Kim, T. Ritschel, V. Deschaintre
ArXiv 2024.
@article{sama_Fischer_2024,
 author = {Fischer, Michael and Georgiev, Iliyan and Groueix, Thibault and Kim, Vladimir G. and Ritschel, Tobias and Deschaintre, Valentin},
 title = {SAMa: Material-aware 3D Selection and Segmentation},
 journal = {arXiv preprint arXiv:2411.19322},
 year = {2024},
 }

Decomposing 3D assets into material parts is a common task for artists and creators, yet remains a highly manual process. In this work, we introduce Select Any Material (SAMa), a material selection approach for various 3D representations. Building on the recently introduced SAM2 video selection model, we extend its capabilities to the material domain. We leverage the model's cross-view consistency to create a 3D-consistent intermediate material-similarity representation in the form of a point cloud from a sparse set of views. Nearest-neighbor lookups in this similarity cloud allow us to efficiently reconstruct accurate continuous selection masks over objects' surfaces that can be inspected from any view. Our method is multiview-consistent by design, alleviating the need for contrastive learning or feature-field pre-processing, and performs optimization-free selection in seconds. Our approach works on arbitrary 3D representations and outperforms several strong baselines in terms of selection accuracy and multiview consistency. It enables several compelling applications, such as replacing the diffuse-textured materials on a text-to-3D output with PBR materials, or selecting and editing materials on NeRFs and 3D-Gaussians.

MagicClay: Sculpting Meshes With Generative Neural Fields
A. Barda V. G. Kim, N. Aigerman, A. Bermano, T. Groueix
SIGGRAPH Asia 2024.
@article{magicclay_Barda_2024,
 author = {Barda, Amir and Kim, Vladimir G. and Aigerman, Noam and Bermano, Amit and Groueix, Thibault},
 title = {MagicClay: Sculpting Meshes With Generative Neural Fields},
 journal = {ACM Transactions on Graphics (SIGGRAPH Asia)},
 year = {2024},
 }

The recent developments in neural fields have brought phenomenal capabilities to the field of shape generation, but they lack crucial properties, such as incremental control --- a fundamental requirement for artistic work. Triangular meshes, on the other hand, are the representation of choice for most geometry-related tasks, offering efficiency and intuitive control, but do not lend themselves to neural optimization. To support downstream tasks, previous art typically proposes a two-step approach, where first, a shape is generated using neural fields, and then a mesh is extracted for further processing. Instead, in this paper, we introduce a hybrid approach that maintains both a mesh and a Signed Distance Field (SDF) representations consistently. Using this representation, we introduce MagicClay --- a tool for sculpting regions of a mesh according to textual prompts while keeping other regions untouched. Our method is designed to be compatible with existing mesh sculpting workflows. The user sculpts the desired shape using the existing brushes and our pipeline then evolves the geometry and triangulation of the selected mesh part according to the given textual prompt. This process operates on the original mesh while preserving its meta-data. Our framework carefully and efficiently balances consistency between the representations and regularizations in every step of the shape optimization. Relying on the mesh representation, we show how to render the SDF at higher resolutions and faster. In addition, we employ recent work in differentiable mesh reconstruction to adaptively allocate triangles in the mesh where required, as indicated by the SDF. Using an implemented prototype, we demonstrate superior generated geometry compared to the state-of-the-art and novel consistent control, allowing sequential prompt-based edits to the same mesh for the first time.

MeshUp: Multi-Target Mesh Deformation via Blended Score Distillation
H. W. Kim, I. Lang, T. Groueix, N. Aigerman, V. G. Kim, R. Hanocka
ArXiv 2024.
@article{meshup_Kim_2024,
 author = {Kim, Hyun Woo and Lang, Itai and Groueix, Thibault and Aigerman, Noam and Kim, Vladimir G. and Hanocka, Rana},
 title = {MeshUp: Multi-Target Mesh Deformation via Blended Score Distillation},
 journal = {arXiv preprint arXiv:2408.14899},
 year = {2024},
 }

We propose MeshUp, a technique that deforms a 3D mesh towards multiple target concepts, and intuitively controls the region where each concept is expressed. Conveniently, the concepts can be defined as either text queries, e.g., 'a dog' and 'a turtle,'' or inspirational images, and the local regions can be selected as any number of vertices on the mesh. We can effectively control the influence of the concepts and mix them together using a novel score distillation approach, referred to as the Blended Score Distillation (BSD). BSD operates on each attention layer of the denoising U-Net of a diffusion model as it extracts and injects the per-objective activations into a unified denoising pipeline from which the deformation gradients are calculated. To localize the expression of these activations, we create a probabilistic Region of Interest (ROI) map on the surface of the mesh, and turn it into 3D-consistent masks that we use to control the expression of these activations. We demonstrate the effectiveness of BSD empirically and show that it can deform various meshes towards multiple objectives.

MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback
C. Chen, C. Nguyen, T. Groueix, V. G. Kim, N. Weibel
TOCHI 2024.
@article{memoviz_Chen_2024,
 author = {Chen, Chen and Nguyen, Cuong and Groueix, Thibault and Kim, Vladimir G. and Weibel, Nadir},
 title = {MemoVis: A GenAI-Powered Tool for Creating Companion Reference Images for 3D Design Feedback},
 journal = {Transactions on Computer-Human Interaction},
 year = {2024},
 }

Providing asynchronous feedback is a critical step in the 3D design workflow. A common approach to providing feedback is to pair textual comments with companion reference images, which helps illustrate the gist of text. Ideally, feedback providers should possess 3D and image editing skills to create reference images that can effectively describe what they have in mind. However, they often lack such skills, so they have to resort to sketches or online images which might not match well with the current 3D design. To address this, we introduce MemoVis, a text editor interface that assists feedback providers in creating reference images with generative AI driven by the feedback comments. First, a novel real-time viewpoint suggestion feature, based on a vision-language foundation model, helps feedback providers anchor a comment with a camera viewpoint. Second, given a camera viewpoint, we introduce three types of image modifiers, based on pre-trained 2D generative models, to turn a text comment into an updated version of the 3D scene from that viewpoint. We conducted a within-subjects study with feedback providers, demonstrating the effectiveness of MemoVis. The quality and explicitness of the companion images were evaluated by another eight participants with prior 3D design experience.

MatAtlas: Text-driven Consistent Geometry Texturing and Material Assignment
D. Ceylan, V. Deschaintre*, T. Groueix*, R. Martin, C. Huang, R. Rouffet, V. G. Kim, G. Lassagne
ArXiv 2024.
@article{matatlas_Ceylan_2024,
 author = {Ceylan, Duygu and Deschaintre, Valentin and Groueix, Thibault and Martin, Rosalie and Huang, Chun-Hao and Rouffet, Romain and Kim, Vladimir G. and Lassagne, Gaëtan},
 title = {MatAtlas: Text-driven Consistent Geometry Texturing and Material Assignment},
 journal = {arXiv preprint arXiv:2404.02899},
 year = {2024},
 }

We present MatAtlas, a method for consistent text-guided 3D model texturing. Following recent progress we leverage a large scale text-to-image generation model (e.g., Stable Diffusion) as a prior to texture a 3D model. We carefully design an RGB texturing pipeline that leverages a grid pattern diffusion, driven by depth and edges. By proposing a multi-step texture refinement process, we significantly improve the quality and 3D consistency of the texturing output. To further address the problem of baked-in lighting, we move beyond RGB colors and pursue assigning parametric materials to the assets. Given the high-quality initial RGB texture, we propose a novel material retrieval method capitalized on Large Language Models (LLM), enabling editabiliy and relightability. We evaluate our method on a wide variety of geometries and show that our method significantly outperform prior arts. We also analyze the role of each component through a detailed ablation study

Generative Escher Meshes
N. Aigerman*, T. Groueix*
SIGGRAPH 2024.
@article{escher_Aigerman_2024,
 author = {Aigerman, Noam and Groueix, Thibault},
 title = {Generative Escher Meshes},
 journal = {ACM Transactions on Graphics (SIGGRAPH)},
 year = {2024},
 }

This paper proposes a fully-automatic, text-guided generative method for producing periodic, repeating, tile-able 2D art, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to the standard concept of a seamless texture, i.e., square images that are seamless when tiled, our method generates non-square tilings which comprise solely of repeating copies of the same object. It achieves this by optimizing both geometry and color of a 2D mesh, in order to generate a non-square tile in the shape and appearance of the desired object, with close to no additional background details. We enable geometric optimization of tilings by our key technical contribution: an unconstrained, differentiable parameterization of the space of all possible tileable shapes for a given symmetry group. Namely, we prove that modifying the laplacian used in a 2D mesh-mapping technique - Orbifold Tutte Embedding - can achieve all possible tiling configurations for a chosen planar symmetry group. We thus consider both the mesh’s tile-shape and its texture as optimizable parameters, rendering the textured mesh via a differentiable renderer. We leverage a trained image diffusion model to define a loss on the resulting image, thereby updating the mesh’s parameters based on its appearance matching the text prompt. We show our method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.

TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations
B. Sun, T. Groueix, C. Song, Q. Huang, N. Aigerman
CVPR 2024. Highlight.
@article{tutte_Sun_2024,
 author = {Sun, Bo and Groueix, Thibault and Song, Chen and Huang, Qixing and Aigerman, Noam},
 title = {TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations},
 journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024},
 }

This work proposes a novel representation of injective deformations of 3D space, which overcomes existing limitations of injective methods, namely inaccuracy, lack of robustness, and incompatibility with general learning and optimization frameworks. Our core idea is to reduce the problem to a ``deep'' composition of multiple 2D mesh-based piecewise-linear maps. Namely, we build differentiable layers that produce mesh deformations through Tutte's embedding (guaranteed to be injective in 2D), and compose these layers over different planes to create complex 3D injective deformations of the 3D volume. We show our method provides the ability to efficiently and accurately optimize and learn complex deformations, outperforming other injective approaches. As a main application, we produce complex and artifact-free NeRF and SDF deformations.

Learning Continuous 3D Words for Text-to-Image Generation
T. Cheng, M. Gadelha, T. Groueix, M. Fisher, R. Mech, A. Markham, N. Trigoni
CVPR 2024.
@article{continuouscontrols_Cheng_2024,
 author = {Cheng, Ta-Ying and Gadelha, Matheus and Groueix, Thibault and Fisher, Matthew and Mech, Radomir and Markham, Andrew and Trigoni, Niki},
 title = {Learning Continuous 3D Words for Text-to-Image Generation},
 journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024},
 }

Current controls over diffusion models (e.g., through text or ControlNet) for image generation fall short in recognizing abstract, continuous attributes like illumination direction or non-rigid shape change. In this paper, we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner -- we call them Continuous 3D Words. These attributes can, for example, be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine, we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes, including time-of-day illumination, bird wing orientation, dollyzoom effect, and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process.

GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence
V. N. Nguyen, T. Groueix, M. Salzmann, V. Lepetit
CVPR 2024.
@article{gigapose_Nguyen_2024,
 author = {Nguyen, Van Nguyen and Groueix, Thibault and Salzmann, Mathieu and Lepetit, Vincent},
 title = {GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence},
 journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024},
 }

We present GigaPose, a fast, robust, and accurate method for CAD-based novel object pose estimation in RGB images. GigaPose first leverages discriminative templates, rendered images of the CAD models, to recover the out-of-plane rotation and then uses patch correspondences to estimate the four remaining parameters. Our approach samples templates in only a two-degrees-of-freedom space instead of the usual three and matches the input image to the templates using fast nearest-neighbor search in feature space, results in a speedup factor of 35x compared to the state of the art. Moreover, GigaPose is significantly more robust to segmentation errors. Our extensive evaluation on the seven core datasets of the BOP challenge demonstrates that it achieves state-of-the-art accuracy and can be seamlessly integrated with existing refinement methods. Additionally, we show the potential of GigaPose with 3D models predicted by recent work on 3D reconstruction from a single image, relaxing the need for CAD models and making 6D pose object estimation much more convenient.

NOPE: Novel Object Pose Estimation from a Single Image
V. N. Nguyen, T. Groueix, Y. Hu, M. Salzmann, V. Lepetit
CVPR 2024.
@article{nope_Nguyen_2024,
 author = {Nguyen, Van Nguyen and Groueix, Thibault and Hu, Yinlin and Salzmann, Mathieu and Lepetit, Vincent},
 title = {NOPE: Novel Object Pose Estimation from a Single Image},
 journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year = {2024},
 }

The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation, we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object's 3D model and without requiring training time for new objects and categories. We achieve this by training a model to directly predict discriminative embeddings for viewpoints surrounding the object. This prediction is done using a simple U-Net architecture with attention and conditioned on the desired pose, which yields extremely fast inference. We compare our approach to state-of-the-art methods and show it outperforms them both in terms of accuracy and robustness.

PSDR-Room: Single Photo to Scene using Differentiable Rendering
K. Yan, F. Luan, M. Hašan, T. Groueix, V. Deschaintre, S. Zhao
SIGGRAPH Asia 2023.
@article{PSDR-Room_Yan_2023,
 author = {Yan, Kai and Luan, Fujun and Hašan, Miloš and Groueix, Thibault and Deschaintre, Valentin and Zhao, Shuang},
 title = {PSDR-Room: Single Photo to Scene using Differentiable Rendering},
 journal = {ACM Transactions on Graphics (SIGGRAPH Asia)},
 year = {2023},
 }

A 3D digital scene contains many components: lights, materials and geometries, interacting to reach the desired appearance. Staging such a scene is time-consuming and requires both artistic and technical skills. In this work, we propose PSDR-Room, a system allowing to optimize lighting as well as the pose and materials of individual objects to match a target image of a room scene, with minimal user input. To this end, we leverage a recent path-space differentiable rendering approach that provides unbiased gradients of the rendering with respect to geometry, lighting, and procedural materials, allowing us to optimize all of these components using gradient descent to visually match the input photo appearance. We use recent single-image scene understanding methods to initialize the optimization and search for appropriate 3D models and materials. We evaluate our method on real photographs of indoor scenes and demonstrate the editability of the resulting scene components.

3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets
T. Cheng, M. Gadelha, S. Pirk, T. Groueix, R. Mech, A. Markham, N. Trigoni
ICCV 2023.
@article{3dminer_Cheng_2023,
 author = {Cheng, Ta-Ying and Gadelha, Matheus and Pirk, Sören and Groueix, Thibault and Mech, Radomir and Markham, Andrew and Trigoni, Niki},
 title = {3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets},
 journal = {International Conference on Computer Vision (ICCV)},
 year = {2023},
 }

We present 3DMiner - a pipeline for mining 3D shapes from challenging large-scale unannotated image datasets. Unlike other unsupervised 3D reconstruction methods, we assume that, within a large-enough dataset, there must exist images of objects with similar shapes but varying backgrounds, textures, and viewpoints. Our approach leverages the recent advances in learning self-supervised image representations to cluster images with geometrically similar shapes and find common image correspondences between them. We then exploit these correspondences to obtain rough camera estimates as initialization for bundle-adjustment. Finally, for every image cluster, we apply a progressive bundle-adjusting reconstruction method to learn a neural occupancy field representing the underlying shape. We show that this procedure is robust to several types of errors introduced in previous steps (e.g., wrong camera poses, images containing dissimilar shapes, etc.), allowing us to obtain shape and pose annotations for images in-the-wild. When using images from Pix3D chairs, our method is capable of producing significantly better results than state-of-the-art unsupervised 3D reconstruction techniques, both quantitatively and qualitatively. Furthermore, we show how 3DMiner can be applied to in-the-wild data by reconstructing shapes present in images from the LAION-5B dataset.

CNOS: A Strong Baseline for CAD-based Novel Object Segmentation
V. N. Nguyen, T. Hodaň, G. Ponimatkin, T. Groueix, V. Lepetit
ICCV Workshop, BOP challenge 2023.
@article{cnos_Nguyen_2023,
 author = {Nguyen, Van Nguyen and Hodaň, Tomáš and Ponimatkin, Georgy and Groueix, Thibault and Lepetit, Vincent},
 title = {CNOS: A Strong Baseline for CAD-based Novel Object Segmentation},
 journal = {ICCV Workshop, BOP challenge},
 year = {2023},
 }

We propose a simple three-stage approach to segment unseen objects in RGB images using their CAD models. Leveraging recent powerful foundation models, DINOv2 and Segment Anything, we create descriptors and generate proposals, including binary masks for a given input RGB image. By matching proposals with reference descriptors created from CAD models, we achieve precise object ID assignment along with modal masks. We experimentally demonstrate that our method achieves state-of-the-art results in CAD-based novel object segmentation, surpassing existing approaches on the seven core datasets of the BOP challenge by 19.8% AP using the same BOP evaluation protocol. Our source code is available at this https URL.

TextDeformer: Geometry Manipulation using Text Guidance
W. Gao N. Aigerman, T. Groueix, V. G. Kim, R. Hanocka
SIGGRAPH 2023.
@article{textdeformer_Gao_2023,
 author = {Gao, William and Aigerman, Noam and Groueix, Thibault and Kim, Vladimir G. and Hanocka, Rana},
 title = {TextDeformer: Geometry Manipulation using Text Guidance},
 journal = {ACM Transactions on Graphics (SIGGRAPH)},
 year = {2023},
 }

We present a technique for automatically producing a deformation of an input triangle mesh, guided solely by a text prompt. Our framework is capable of deformations that produce both large, low-frequency shape changes, and small high-frequency details. Our framework relies on differentiable rendering to connect geometry to powerful pre-trained image encoders, such as CLIP and DINO. Notably, updating mesh geometry by taking gradient steps through differentiable rendering is notoriously challenging, commonly resulting in deformed meshes with significant artifacts. These difficulties are amplified by noisy and inconsistent gradients from CLIP. To overcome this limitation, we opt to represent our mesh deformation through Jacobians, which updates deformations in a global, smooth manner (rather than locally-sub-optimal steps). Our key observation is that Jacobians are a representation that favors smoother, large deformations, leading to a global relation between vertices and pixels, and avoiding localized noisy gradients. Additionally, to ensure the resulting shape is coherent from all 3D viewpoints, we encourage the deep features computed on the 2D encoding of the rendering to be consistent for a given vertex from all viewpoints. We demonstrate that our method is capable of smoothly-deforming a wide variety of source mesh and target text prompts, achieving both large modifications to, e.g., body proportions of animals, as well as adding fine semantic details, such as shoe laces on an army boot and fine details of a face.

Neural Face Rigging for Animating and Retargeting Facial Meshes in the Wild
D. Qin J. Saito, N. Aigerman, T. Groueix, T. Komura
SIGGRAPH 2023.
@article{nfr_Qin_2023,
 author = {Qin, Dafei and Saito, Jun and Aigerman, Noam and Groueix, Thibault and Komura, Taku},
 title = {Neural Face Rigging for Animating and Retargeting Facial Meshes in the Wild},
 journal = {ACM Transactions on Graphics (SIGGRAPH)},
 year = {2023},
 }

We propose an end-to-end deep-learning approach for automatic rigging and retargeting of 3D models of human faces in the wild. Our approach, called Neural Face Rigging (NFR), holds three key properties: (i) NFR's expression space maintains human-interpretable editing parameters for artistic controls; (ii) NFR is readily applicable to arbitrary facial meshes with different connectivity and expressions; (iii) NFR can encode and produce fine-grained details of complex expressions performed by arbitrary subjects. To the best of our knowledge, NFR is the first approach to provide realistic and controllable deformations of in-the-wild facial meshes, without the manual creation of blendshapes or correspondence. We design a deformation autoencoder and train it through a multi-dataset training scheme, which benefits from the unique advantages of two data sources: a linear 3DMM with interpretable control parameters as in FACS, and 4D captures of real faces with fine-grained details. Through various experiments, we show NFR's ability to automatically produce realistic and accurate facial deformations across a wide range of existing datasets as well as noisy facial scans in-the-wild, while providing artist-controlled, editable parameters.

PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling
F. Baradel, R. Bregier, T. Groueix, P. Weinzaepfel, Y. Kalantidis, G. Rogez
TPAMI 2022.
@article{posebertpami_Baradel_2022,
 author = {Baradel, Fabien and Bregier, Romain and Groueix, Thibault and Weinzaepfel, Philippe and Kalantidis, Yannis and Rogez, Gregory},
 title = {PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling},
 journal = {IEEE transactions on pattern analysis and machine intelligence},
 year = {2022},
 }

Training state-of-the-art models for human pose estimation in videos requires datasets with annotations that are really hard and expensive to obtain. Although transformers have been recently utilized for body pose sequence modeling, related methods rely on pseudo-ground truth to augment the currently limited training data available for learning such models. In this paper, we introduce PoseBERT, a transformer module that is fully trained on 3D Motion Capture (MoCap) data via masked modeling. It is simple, generic and versatile, as it can be plugged on top of any image-based model to transform it in a video-based model leveraging temporal information. We showcase variants of PoseBERT with different inputs varying from 3D skeleton keypoints to rotations of a 3D parametric model for either the full body (SMPL) or just the hands (MANO). Since PoseBERT training is task agnostic, the model can be applied to several tasks such as pose refinement, future pose prediction or motion completion without finetuning. Our experimental results validate that adding PoseBERT on top of various state-of-the-art pose estimation methods consistently improves their performances, while its low computational cost allows us to use it in a real-time demo for smoothly animating a robotic hand via a webcam. Test code and models are available at https://github.com/naver/posebert

Recovering Detail in 3D Shapes Using Disparity Maps
M. Ramirez de Chanlatte, M. Gadelha, T. Groueix, R. Mech
ECCV Workshop, Learning to Generate 3D Shapes and Scenes 2022.
@article{RecoveringDetails_Chanlatte_2022,
 author = {Ramirez de Chanlatte, Marissa and Gadelha, Matheus and Groueix, Thibault and Mech, Radomir},
 title = {Recovering Detail in 3D Shapes Using Disparity Maps},
 journal = {ECCV Workshop, Learning to Generate 3D Shapes and Scenes},
 year = {2022},
 }

We present a fine-tuning method to improve the appearance of 3D geometries reconstructed from single images. We leverage advances in monocular depth estimation to obtain disparity maps and present a novel approach to transforming 2D normalized disparity maps into 3D point clouds by using shape priors to solve an optimization on the relevant camera parameters. After creating a 3D point cloud from disparity, we introduce a method to combine the new point cloud with existing information to form a more faithful and detailed final geometry. We demonstrate the efficacy of our approach with multiple experiments on both synthetic and real images.

Learning Joint Surface Atlase
T. Deprelle, T. Groueix, N. Aigerman, V. G. Kim, M. Aubry
ECCV Workshop, Learning to Generate 3D Shapes and Scenes 2022.
@article{jointatlas_Deprelle_2022,
 author = {Deprelle, Theo and Groueix, Thibault and Aigerman, Noam and Kim, Vladimir G. and Aubry, Mathieu},
 title = {Learning Joint Surface Atlase},
 journal = {ECCV Workshop, Learning to Generate 3D Shapes and Scenes},
 year = {2022},
 }

This paper describes new techniques for learning atlas-like representations of 3D surfaces, i.e. homeomorphic transformations from a 2D domain to surfaces. Compared to prior work, we propose two major contributions. First, instead of mapping a fixed 2D domain, such as a set of square patches, to the surface, we learn a continuous 2D domain with arbitrary topology by optimizing a point sampling distribution represented as a mixture of Gaussians. Second, we learn consistent mappings in both directions: charts, from the 3D surface to 2D domain, and parametrizations, their inverse. We demonstrate that this improves the quality of the learned surface represen- tation, as well as its consistency in a collection of related shapes. It thus leads to improvements for applications such as correspondence estimation, texture transfer, and consistent UV mapping. As an additional technical contribution, we outline that, while incorporating normal consistency has clear benefits, it leads to issues in the optimization, and that these issues can be mitigated using a simple repulsive regularization. We demonstrate that our contributions provide better surface representation than existing baselines.

Neural Jacobian Fields: Learning Intrinsic Mappings of Arbitrary Meshes
N. Aigerman, K. Gupta, V. G. Kim, S. Chaudhuri, J. Saito, T. Groueix
SIGGRAPH 2022.
@article{njf_Aigerman_2022,
 author = {Aigerman, Noam and Gupta, Kunal and Kim, Vladimir G. and Chaudhuri, Siddhartha and Saito, Jun and Groueix, Thibault},
 title = {Neural Jacobian Fields: Learning Intrinsic Mappings of Arbitrary Meshes},
 journal = {ACM Transactions on Graphics (SIGGRAPH)},
 year = {2022},
 }

This paper introduces a framework designed to accurately predict piecewise linear mappings of arbitrary meshes via a neural network, enabling training and evaluating over heterogeneous collections of meshes that do not share a triangulation, as well as producing highly detail-preserving maps whose accuracy exceeds current state of the art. The framework is based on reducing the neural aspect to a prediction of a matrix for a single given point, conditioned on a global shape descriptor. The field of matrices is then projected onto the tangent bundle of the given mesh, and used as candidate jacobians for the predicted map. The map is computed by a standard Poisson solve, implemented as a differentiable layer with cached pre-factorization for efficient training. This construction is agnostic to the triangulation of the input, thereby enabling applications on datasets with varying triangulations. At the same time, by operating in the intrinsic gradient domain of each individual mesh, it allows the framework to predict highly-accurate mappings. We validate these properties by conducting experiments over a broad range of scenarios, from semantic ones such as morphing, registration, and deformation transfer, to optimization-based ones, such as emulating elastic deformations and contact correction, as well as being the first work, to our knowledge, to tackle the task of learning to compute UV parameterizations of arbitrary meshes. The results exhibit the high accuracy of the method as well as its versatility, as it is readily applied to the above scenarios without any changes to the framework.

Leveraging MoCap Data for Human Mesh Recovery
F. Baradel, T. Groueix, R. Bregier, P. Weinzaepfel, Y. Kalantidis, G. Rogez
3DV 2021.
@article{LeveragingMoCap_Baradel_2021,
 author = {Baradel, Fabien and Groueix, Thibault and Bregier, Romain and Weinzaepfel, Philippe and Kalantidis, Yannis and Rogez, Gregory},
 title = {Leveraging MoCap Data for Human Mesh Recovery},
 journal = {International Conference on 3D Vision (3DV)},
 year = {2021},
 }

Training state-of-the-art models for human body pose and shape recovery from images or videos requires datasets with corresponding annotations that are really hard and expensive to obtain. Our goal in this paper is to study whether poses from 3D Motion Capture (MoCap) data can be used to improve image-based and video-based human mesh recovery methods. We find that fine-tune image-based models with synthetic renderings from MoCap data can increase their performance, by providing them with a wider variety of poses, textures and backgrounds. In fact, we show that simply fine-tuning the batch normalization layers of the model is enough to achieve large gains. We further study the use of MoCap data for video, and introduce PoseBERT, a transformer module that directly regresses the pose parameters and is trained via masked modeling. It is simple, generic and can be plugged on top of any state-of-the-art image-based model in order to transform it in a video-based model leveraging temporal information. Our experimental results show that the proposed approaches reach state-of-the-art performance on various datasets including 3DPW, MPI-INF-3DHP, MuPoTS-3D, MCB and AIST. Test code and models will be available soon.

Deep Transformation-Invariant Clustering
T. Monnier, T. Groueix, M. Aubry
NeurIPS 2020. Oral.
@article{jointatlas_Monnier_2020,
 author = {Monnier, Tom and Groueix, Thibault and Aubry, Mathieu},
 title = {Deep Transformation-Invariant Clustering},
 journal = {Conference on Neural Information Processing Systems (NeurIPS)},
 year = {2020},
 }

Recent advances in image clustering typically focus on learning better deep representations. In contrast, we present an orthogonal approach that does not rely on abstract features but instead learns to predict image transformations and directly performs clustering in pixel space. This learning process naturally fits in the gradient-based training of K-means and Gaussian mixture model, without requiring any additional loss or hyper-parameters. It leads us to two new deep transformation-invariant clustering frameworks, which jointly learn prototypes and transformations. More specifically, we use deep learning modules that enable us to resolve invariance to spatial, color and morphological transformations. Our approach is conceptually simple and comes with several advantages, including the possibility to easily adapt the desired invariance to the task and a strong interpretability of both cluster centers and assignments to clusters. We demonstrate that our novel approach yields competitive and highly promising results on standard image clustering benchmarks. Finally, we showcase its robustness and the advantages of its improved interpretability by visualizing clustering results over real photograph collections.

Learning elementary structures for 3D shape generation and matching
T. Deprelle, T. Groueix, M. Fisher, V. G. Kim, B. Russell, M. Aubry
NeurIPS 2019.
@article{elementarystructures_Deprelle_2019,
 author = {Deprelle, Theo and Groueix, Thibault and Fisher, Matthew and Kim, Vladimir G. and Russell, Bryan and Aubry, Mathieu},
 title = {Learning elementary structures for 3D shape generation and matching},
 journal = {Conference on Neural Information Processing Systems (NeurIPS)},
 year = {2019},
 }

We propose to represent shapes as the deformation and combination of learnable elementary 3D structures, which are primitives resulting from training over a collection of shape. We demonstrate that the learned elementary 3D structures lead to clear improvements in 3D shape generation and matching. More precisely, we present two complementary approaches for learning elementary structures: (i) patch deformation learning and (ii) point translation learning. Both approaches can be extended to abstract structures of higher dimensions for improved results. We evaluate our method on two tasks: reconstructing ShapeNet objects and estimating dense correspondences between human scans (FAUST inter challenge). We show 16% improvement over surface deformation approaches for shape reconstruction and outperform FAUST inter challenge state of the art by 6%.

Unsupervised cycle-consistent deformation for shape matching
T. Groueix, M. Fisher, V. G. Kim, B. Russell, M. Aubry
SGP 2019.
@article{cycleconsistentdeformation_Groueix_2019,
 author = {Groueix, Thibault and Fisher, Matthew and Kim, Vladimir G. and Russell, Bryan and Aubry, Mathieu},
 title = {Unsupervised cycle-consistent deformation for shape matching},
 journal = {Eurographics Symposium on Geometry Processing (SGP)},
 year = {2019},
 }

We propose a self-supervised approach to deep surface deformation. Given a pair of shapes, our algorithm directly predicts a parametric transformation from one shape to the other respecting correspondences. Our insight is to use cycle-consistency to define a notion of good correspondences in groups of objects and use it as a supervisory signal to train our network. Our method does not rely on a template, assume near isometric deformations or rely on point-correspondence supervision. We demonstrate the efficacy of our approach by using it to transfer segmentation across shapes. We show, on Shapenet, that our approach is competitive with comparable state-of-the-art methods when annotated training data is readily available, but outperforms them by a large margin in the few-shot segmentation scenario.

3D-CODED : 3D Correspondences by Deep Deformation
T. Groueix, M. Fisher, V. G. Kim, B. Russell, M. Aubry
ECCV 2018.
@article{3D-CODED_Groueix_2018,
 author = {Groueix, Thibault and Fisher, Matthew and Kim, Vladimir G. and Russell, Bryan and Aubry, Mathieu},
 title = {3D-CODED : 3D Correspondences by Deep Deformation},
 journal = {European Conference on Computer Vision (ECCV)},
 year = {2018},
 }

We present a new deep learning approach for matching deformable shapes by introducing Shape Deformation Networks which jointly encode 3D shapes and correspondences. This is achieved by factoring the surface representation into (i) a template, that parameterizes the surface, and (ii) a learnt global feature vector that parameterizes the transformation of the template into the input surface. By predicting this feature for a new shape, we implicitly predict correspondences between this shape and the template. We show that these correspondences can be improved by an additional step which improves the shape feature by minimizing the Chamfer distance between the input and transformed template. We demonstrate that our simple approach improves on state-of-the-art results on the difficult FAUST-inter challenge, with an average correspondence error of 2.88cm. We show, on the TOSCA dataset, that our method is robust to many types of perturbations, and generalizes to non-human shapes. This robustness allows it to perform well on real unclean, meshes from the the SCAPE dataset.

AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation
T. Groueix, M. Fisher, V. G. Kim, B. Russell, M. Aubry
CVPR 2017. Spotlight, Best Poster Award at PAISS.
@article{atlasnet_Groueix_2017,
 author = {Groueix, Thibault and Fisher, Matthew and Kim, Vladimir G. and Russell, Bryan and Aubry, Mathieu},
 title = {AtlasNet: A Papier-Mâché Approach to Learning 3D Surface Generation},
 journal = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
 year = {2017},
 }

We introduce a method for learning to generate the surface of 3D shapes. Our approach represents a 3D shape as a collection of parametric surface elements and, in contrast to methods generating voxel grids or point clouds, naturally infers a surface representation of the shape. Beyond its novelty, our new shape generation framework, AtlasNet, comes with significant advantages, such as improved precision and generalization capabilities, and the possibility to generate a shape of arbitrary resolution without memory issues. We demonstrate these benefits and compare to strong baselines on the ShapeNet benchmark for two applications: (i) auto-encoding shapes, and (ii) single-view reconstruction from a still image. We also provide results showing its potential for other applications, such as morphing, parametrization, super-resolution, matching, and co-segmentation.

Interactive Monte-Carlo Ray-Tracing Upsampling
M. Boughida, T. Groueix, T. Boubekeur
Eurographics 2016. Poster.
@article{bilateralupsampling_Boughida_2016,
 author = {Boughida, Malik and Groueix, Thibault and Boubekeur, Tamy},
 title = {Interactive Monte-Carlo Ray-Tracing Upsampling},
 journal = {Eurographics},
 year = {2016},
 }

We introduce a method for learning to generate the surface of 3D shapes. Our approach represents a 3D shape as a collection of parametric surface elements and, in contrast to methods generating voxel grids or point clouds, naturally infers a surface representation of the shape. Beyond its novelty, our new shape generation framework, AtlasNet, comes with significant advantages, such as improved precision and generalization capabilities, and the possibility to generate a shape of arbitrary resolution without memory issues. We demonstrate these benefits and compare to strong baselines on the ShapeNet benchmark for two applications: (i) auto-encoding shapes, and (ii) single-view reconstruction from a still image. We also provide results showing its potential for other applications, such as morphing, parametrization, super-resolution, matching, and co-segmentation.


PhD Thesis

Learning 3D Generation and Matching
Thibault Groueix
Ecole Nationale des Ponts et Chaussees (ENPC), 2020.
AFIA award finalist, Gilles-Khan award finalist
@PHDTHESIS{groueix2018thesis,
                  title=Learning 3D Generation and Matching,
                  author={Groueix, Thibault},
                  school={Ecole Nationale des Ponts et Chaussees},
                  year={2020}
                  }

The goal of this thesis is to develop deep learning approaches to model and analyse 3D shapes. Progress in this field could democratize artistic creation of 3D assets which currently requires time and expert skills with technical software. We focus on the design of deep learning solutions for two particular tasks, key to many 3D modeling applications: single-view reconstruction and shape matching.

A single-view reconstruction (SVR) method takes as input a single image and predicts a 3D model of the physical world which produced that image. SVR dates back to the early days of computer vision. In particular, in the 1960s, Lawrence G. Roberts proposed to align simple 3D primitives to an input image making the assumption that the physical world is made of simple geometric shapes like cuboids. Another approach proposed by Berthold Horn in the 1970s is to decompose the input image in intrinsic images and use those to predict the depth of every input pixel. Since several configurations of shapes, texture and illumination can explain the same image, both approaches need to make assumptions on the distribution of textures and 3D shapes to resolve the ambiguity. In this thesis, we learn these assumptions from large-scale datasets instead of manually designing them. Learning SVR also allows to reconstruct complete 3D models, including parts which are not visible in the input image.
Shape matching aims at finding correspondences between 3D objects. Solving this task requires both a local and global understanding of 3D shapes which is hard to achieve. We propose to train neural networks on large-scale datasets to solve this task and capture knowledge implicitly through their internal parameters. Shape matching supports many 3D modeling applications such as attribute transfer, automatic rigging for animation, or mesh editing.

The first technical contribution of this thesis is a new parametric representation of 3D surfaces which we model using neural networks. The choice of data representation is a critical aspect of any 3D reconstruction algorithm. Until recently, most of the approaches in deep 3D model generation were predicting volumetric voxel grids or point clouds, which are discrete representations. Instead, we present an alternative approach that predicts a parametric surface deformation i.e. a mapping from a template to a target geometry. To demonstrate the benefits of such a representation, we train a deep encoder-decoder for single-view reconstruction using our new representation. Our approach, dubbed AtlasNet, is the first deep single-view reconstruction approach able to reconstruct meshes from images without relying on an independent postprocessing. And it can perform such a reconstruction at arbitrary resolution without memory issues. A more detailed analysis of AtlasNet reveals it also generalizes better to categories it has not been trained on than other deep 3D generation approaches.
Our second main contribution is a novel shape matching approach based purely on reconstruction via deformations. We show that the quality of the shape reconstructions is critical to obtain good correspondences, and therefore introduce a test-time optimization scheme to refine the learned deformations. For humans and other deformable shape categories deviating by a near-isometry, our approach can leverage a shape template and isometric regularization of the surface deformations. As category exhibiting non-isometric variations, such as chairs, do not have a clear template, we also learn how to deform any shape into any other and leverage cycleconsistency constraints to learn meaningful correspondences. Our matching-by-reconstruction strategy operates directly on point clouds, is robust to many types of perturbations, and outperformed the state of the art by 15% on dense matching of real human scans.


Teaching

  • Fall   2018 Traitement de l'information et vision artificielle (TIVA), TA - ENPC Master 1, École Nationale des Ponts et Chaussees
  • Fall   2018 Apprentissage statistique (MALAP), TA - ENPC Master 1, École Nationale des Ponts et Chaussees
  • Fall   2017 Traitement de l'information et vision artificielle (TIVA), TA - ENPC Master 1, École Nationale des Ponts et Chaussees
  • Fall   2017 Apprentissage statistique (MALAP), TA - ENPC Master 1, École Nationale des Ponts et Chaussees


Talks

<strong>  Tutorial: </strong> </font>Deep Learning for 3D surface reconstruction
Deep 3D deformations (slides)
T. Groueix
This talk covers my PhD work.
Invited at:
<strong>  Tutorial: </strong> Deep Learning for 3D surface reconstruction
Tutorial: Deep Learning for 3D surface reconstruction (slides)
T. Groueix*, P-A Langlois*
Invited at:
  • ,

  • Code and demo

    • NeuralJacobianFields
    • AtlasNet
    • Atlasnet v2
    • 3D-CODED
    • DTI clustering
    • CycleConsistentDeformation
    • Netvision
    • ChamferDistancePytorch