Welcome to my webpage!

I am an AI Research Engineer at Meta AI. Before that, I was a PhD Student in Computer Vision and Machine Learning in the IMAGINE team of Ecole des Ponts Paristech where I was working on 3D Scene Understanding from images under the supervision of Prof. Vincent Lepetit, with a focus on single image 3D geometry estimation.

I received a MRes degree in Mathematics, Vision, and Learning (MVA) from ENS Paris-Saclay. Before that, I obtained joint MS degrees from Imperial College London and Institut d’Optique Graduate School Paristech in Optics, Physics and Signal Processing.

News

08-2025: DINOv3 is out! Checkout the blog post for amazing visuals enjoy our code. Super proud of our work!
09/2024: I’ll be in Milano 🍕 at ECCV 2024 to present CLIP-DINOiser our work on densifying CLIP features - congrats Monika for spearheading this great project!
10/2023: I’ll be with the DINOv2 team at ICCV 2023 in Paris, come say hi at our demo booth!
09/2022: I joined Meta AI as an AI Research Engineer.
07/2022: MonteBoxFinder has been accepted at ECCV 2022.
05/2022: I am honoured to be acknowledged as Outstanding Reviewer for CVPR 2022.
05/2022: I will be attending ICVSS this summer!
01/2022: I gave an invited talk to the Semantic Perception Reading Group of Google Zurich.
08/2021: I am excited to join Meta Reality Labs in Zurich for a research internship this fall!
02/2021: WaveletMonoDepth has been accepted at CVPR 2021. We use the sparsity property of wavelets to improve efficiency in Monocular Depth Estimation methods.
11/2020: I finished my 4 months internship at Niantic (London), where I was supervised by Dr. Daniyar Turmukhambethov and Dr. Michael Firman.
02/2020: Our paper has been accepted at CVPR 2020.
10/2019: I am attending ICCV 2019 in Seoul to present my paper SharpNet. Meet me at the 3D Reconstruction in the Wild Workshop this Monday 28th afternoon!
10/2019: I attended PRAIRIE AI Summer School (PAISS).
09/2019: Our SyntheT-Less dataset presented at 3DV 2019 (paper) is now available here.
09/2019: I attended 3DV 2019 in Quebec City.

Publications

EPFv2

Zhenyu Li, Sai Kumar Dwivedi, Filip Maric, Carlos Chacon, Nadine Bertsch, Filippo Arcadu, Tomas Hodan, Michael Ramamonjisoa, Peter Wonka, Amy Zhao, Robin Kips, Cem Keskin, Anastasia Tkach, Chenhongyi Yang

Egocentric 3D human motion estimation is essential for AR/VR experiences, yet remains challenging due to limited body coverage from the egocentric viewpoint, frequent occlusions, and scarce labeled data. We present EgoPoseFormer v2, a method that addresses these challenges through two key contributions: (1) a transformer-based model for temporally consistent and spatially grounded body pose estimation, and (2) an auto-labeling system that enables the use of large unlabeled datasets for training.The proposed model is fully differentiable, introduces identity-conditioned queries, multi-view spatial refinement, causal temporal attention, and supports both keypoints and parametric body representations under a constant compute budget.The proposed auto-labeling system scales learning to tens of millions of unlabeled frames via uncertainty-aware semi-supervised training. The system follows a teacher-student schema to generate pseudo-labels and guide training with uncertainty distillation, enabling the model to generalize to different environments. In experiments on the EgoBody3M benchmark, EgoPoseFormer v2 outperforms two state-of-the-art methods by 12.2% and 19.4% in accuracy, and reduces temporal jitter by 22.2% and 51.7%, respectively. Furthermore, our auto-labeling system additionally improves the wrist MPJPE by 13.1%

@InProceedings{Li_2026_CVPR, 
 author    = {Li, Zhenyu and Dwivedi, Sai Kumar and Maric, Filip and Chac{\'{o}}n, Carlos and Bertsch, Nadine and Arcadu, Filippo and Hodan, Tomas and Ramamonjisoa, Michael and Wonka, Peter and Zhao, Amy and Kips, Robin and Keskin, Cem and Tkach, Anastasia and Yang, Chenhongyi}, 
 title     = {EgoPoseFormer v2: Accurate Egocentric Human Motion Estimation for AR/VR}, 
 booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
 month     = {June}, 
 year      = {2026}, 
 pages     = {21121-21131, 
 }

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (Spotlight), 2026

DINOv3

Oriane Simeoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michael Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothee Darcet, Theo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Herve Jegou, Patrick Labatut, Piotr Bojanowski

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images— using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models’ flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.

@misc{simeoni2025dinov3, 
 title={DINOv3}, 
 author={Sim{\'{e}}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{\"{e}}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{\'{e}}e and Moutakanni, Th{\'{e}}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{\'{e}}gou, Herv{\'{e}} and Labatut, Patrick and Bojanowski, Piotr}, 
 year={2025}, 
 eprint={2508.10104}, 
 archivePrefix={arXiv}, 
 primaryClass={cs.CV}, 
 url={https://arxiv.org/abs/2508.10104}, 
 }

Published in TMLR, 2025

DINO.txt

Cijo Jose, Théo Moutakanni, Dahyun Kang, Federico Baldassarre, Timothée Darcet, Hu Xu, Daniel Li, Marc Szafraniec, Michaël Ramamonjisoa, Maxime Oquab, Oriane Siméoni, Huy V. Vo, Patrick Labatut, Piotr Bojanowski

Self-supervised visual foundation models produce powerful embeddings that achieve remarkable performance on a wide range of downstream tasks. However, unlike vision-language models such as CLIP, self-supervised visual features are not readily aligned with language, hindering their adoption in open-vocabulary tasks. Our method unlocks this new ability for DINOv2, a widely used self-supervised visual encoder. We build upon the LiT training strategy, which trains a text encoder to align with a frozen vision model but leads to unsatisfactory results on dense tasks. We propose several key ingredients to improve performance on both global and dense tasks, such as concatenating the [CLS] token with the patch average to train the alignment and curating data using both text and image modalities. With these, we successfully train a CLIP-like model with only a fraction of the computational cost compared to CLIP while achieving state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.

@InProceedings{Jose_2025_CVPR, 
 author    = {Jose, Cijo and Moutakanni, Th\'eo and Kang, Dahyun and Baldassarre, Federico and Darcet, Timoth\'ee and Xu, Hu and Li, Daniel and Szafraniec, Marc and Ramamonjisoa, Micha"el and Oquab, Maxime and Sim\'eoni, Oriane and Vo, Huy V. and Labatut, Patrick and Bojanowski, Piotr}, 
 title     = {DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment}, 
 booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
 month     = {June}, 
 year      = {2025}, 
 pages     = {24905-24916} 
 }

Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025

CLIP-DINOiser: Teaching CLIP a few DINO tricks

Monika Wysoczańska, Oriane Siméoni, Michael Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński , Patrick Pérez

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose a zero-shot open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features therefore allowing us to obtain the best results with a single pass through CLIP model. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.

@article{wysoczanska2024clipdino,
 title     = {CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation}, 
 author    = {Wysocza{\'{n}}ska, Monika and 
 Sim{\'{e}}oni, Oriane and 
 Ramamonjisoa, Micha{\"{e}}l and 
 Bursuc, Andrei and 
 Trzci{\'{n}}ski, Tomasz and 
 P{\'{e}}rez, Patrick}, 
 booktitle = {ECCV}, 
 year = {2024} 
 }

Published in ECCV 2024, 2024

CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free

Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni

The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.

@InProceedings{Wysoczanska_2024_WACV,  
 author    = {Wysocza{\'n}ska, Monika and Ramamonjisoa, Micha{\"e}l and Trzci{\'n}ski, Tomasz and Sim{\'e}oni, Oriane},   
 title     = {CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free},   
 booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},   
 month     = {January},   
 year      = {2024},   
 pages     = {1403-1413}   
 }

Published in WACV 2024, 2024

MonteBoxFinder: Detecting and Filtering Primitives to Fit a Noisy Point Cloud

Michaël Ramamonjisoa, Sinisa Stekovic and Vincent Lepetit

We present MonteBoxFinder, a method that, given a noisy input point cloud, fits cuboids to the input scene. Our primary contribution is a discrete optimization algorithm that, from a dense set of initially detected cuboids, is able to efficiently filter good boxes from the noisy ones. Inspired by recent applications of MCTS to scene understanding problems, we develop a stochastic algorithm that is, by design, more efficient for our task. Indeed, the quality of a fit for a cuboid arrangement is invariant to the order in which the cuboids are added into the scene. We develop several search baselines for our problem and demonstrate, on the ScanNet dataset, that our approach is more efficient and precise. Finally, we strongly believe that our core algorithm is very general and that it could be extended to many other problems in 3D scene understanding.

@article{ramamonjisoa2022mbf, 
 Title = {MonteBoxFinder: Detecting and Filtering Primitives to Fit a Noisy Point Cloud}, 
 Author = {M. Ramamonjisoa, S. Stekovic and V. Lepetit}, 
 Journal = {European Conference on Computer Vision (ECCV)}, 
 Year = {2022}
 }

Published in European Conference on Computer Vision (ECCV), 2022

PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6DoF Tracking

Van Nguyen Nguyen, Yuming Du, Yang Xiao, Michaël Ramamonjisoa and Vincent Lepetit

Estimating the relative pose of a new object without prior knowledge is a hard problem, while it is an ability very much needed in robotics and Augmented Reality. We present a method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available. In contrast to previous works, our method can therefore consider unknown objects in open world instantly, without requiring any prior information or a specific training phase. We consider two architectures, one based on two frames, and the other relying on a Transformer Encoder, which can exploit an arbitrary number of past frames. We train our architectures using only synthetic renderings with domain randomization. Our results on challenging datasets are on par with previous works that require much more information (training images of the target objects, 3D models, and/or depth data).

@inproceedings{nguyen2022pizza, 
 title={PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF Tracking}, 
 author={Nguyen, Van Nguyen and Du, Yuming and Xiao, Yang and Ramamonjisoa, Michael and Lepetit, Vincent}, 
 journal={{International Conference on 3D Vision (3DV)}}, 
 year={2022} 
 }

Published in 2022 International Conference on 3D Vision (3DV) (Oral), 2022

SparseFormer: Attention-based Depth Completion Network

Frederik Warburg, Michaël Ramamonjisoa, and Manuel López-Antequera

Most pipelines for Augmented and Virtual Reality estimate the ego-motion of the camera by creating a map of sparse 3D landmarks. In this paper, we tackle the problem of depth completion, that is, densifying this sparse 3D map using RGB images as guidance. This remains a challenging problem due to the low density, non-uniform and outlier-prone 3D landmarks produced by SfM and SLAM pipelines. We introduce a transformer block, SparseFormer, that fuses 3D landmarks with deep visual features to produce dense depth. The SparseFormer has a global receptive field, making the module especially effective for depth completion with low-density and non-uniform landmarks. To address the issue of depth outliers among the 3D landmarks, we introduce a trainable refinement module that filters outliers through attention between the sparse landmarks

@article{warburg2022sparseformer, 
 Title = {SparseFormer: Attention-based Depth Completion Network}, 
 Author = {F. Warburg, M. Ramamonjisoa and M.L. Antequera}, 
 Journal = {CV4AR Workshop at The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, 
 Year = {2022}
 }

Published in CV4AR Workshop at The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Single Image Depth Prediction with Wavelet Decomposition

Michaël Ramamonjisoa, Michael Firman, Jamie Watson, Vincent Lepetit and Daniyar Turmukhambetov

We present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoder-decoder architecture. We demonstrate that we can reconstruct high-fidelity depth maps by predicting sparse wavelet coefficients. In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully self-supervised scenarios, without access to ground-truth depth. Finally, we apply our method to different state-of-the-art monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiply-adds in the decoder network.

@inproceedings{ramamonjisoa-2021-wavelet-monodepth,
 title     = {Single Image Depth Prediction with Wavelet Decomposition}, 
 author    = {Ramamonjisoa, Micha{"{e}}l and 
 Michael Firman and 
 Jamie Watson and 
 Vincent Lepetit and 
 Daniyar Turmukhambetov}, 
 booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, 
 month = {June}, 
 year = {2021} 
 }

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields

Michaël Ramamonjisoa^*, Yuming Du^* and Vincent Lepetit
^{* Denotes equal contribution.}

Current methods for depth map prediction from monocular images tend to predict smooth, poorly localized contours for the occlusion boundaries in the input image. This is unfortunate as occlusion boundaries are important cues to recognize objects, and as we show, may lead to a way to discover new objects from scene reconstruction. To improve predicted depth maps, recent methods rely on various forms of filtering or predict an additive residual depth map to refine a first estimate. We instead learn to predict, given a depth map predicted by some reconstruction method, a 2D displacement field able to re-sample pixels around the occlusion boundaries into sharper reconstructions. Our method can be applied to the output of any depth estimation method and is fully differentiable, enabling end-to-end training. For evaluation, we manually annotated the occlusion boundaries in all the images in the test split of popular NYUv2-Depth dataset. We show that our approach improves the localization of occlusion boundaries for all state-of-the-art monocular depth estimation methods that we could evaluate, without degrading the depth accuracy for the rest of the images.

@article{ramamonjisoa2020dispnet, 
 Title = {Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields}, 
 Author = {M. Ramamonjisoa and Y. Du and V. Lepetit}, 
 Journal = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, 
 Year = {2020}
 }

Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation

Michaël Ramamonjisoa and Vincent Lepetit

We introduce SharpNet, a method that predicts an accurate depth map for an input color image, with a particular attention to the reconstruction of occluding contours: Occluding contours are an important cue for object recognition, and for realistic integration of virtual objects in Augmented Reality, but they are also notoriously difficult to reconstruct accurately. For example, they are a challenge for stereo-based reconstruction methods, as points around an occluding contour are visible in only one image. Inspired by recent methods that introduce normal estimation to improve depth prediction, we introduce a novel term that constrains depth and occluding contours predictions. Since ground truth depth is difficult to obtain with pixel-perfect accuracy along occluding contours, we use synthetic images for training, followed by fine-tuning on real data. We demonstrate our approach on the challenging NYUv2-Depth dataset, and show that our method outperforms the state-of-the-art along occluding contours, while performing on par with the best recent methods for the rest of the images. Its accuracy along the occluding contours is actually better than the ''ground truth'' acquired by a depth camera based on structured light. We show this by introducing a new benchmark based on NYUv2-Depth for evaluating occluding contours in monocular reconstruction, which is our second contribution.

@article{ramamonjisoa2019sharpnet, 
 Title = {SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation}, 
 Author = {M. Ramamonjisoa and V. Lepetit}, 
 Journal = {The IEEE International Conference on Computer Vision (ICCV) Workshops}, 
 Year = {2019}
 }

Published in International Conference on Computer Vision (ICCV) Workshop on 3D Reconstruction in the Wild, 2019

On Object Symmetries and 6D Pose Estimation from Images

(Spotlight)

Giorgia Pitteri^*, Michaël Ramamonjisoa^*, Slobodan Ilic and Vincent Lepetit
^{* Denotes equal contribution.}

Objects with symmetries are common in our daily life and in industrial contexts, but are often ignored in the recent literature on 6D pose estimation from images. In this paper, we study in an analytical way the link between the symmetries of a 3D object and its appearance in images. We explain why symmetrical objects can be a challenge when training machine learning algorithms that aim at estimating their 6D pose from images. We propose an efficient and simple solution that relies on the normalization of the pose rotation. Our approach is general and can be used with any 6D pose estimation algorithm. Moreover, our method is also beneficial for objects that are 'almost symmetrical', mph{i.e.} objects for which only a detail breaks the symmetry. We validate our approach within a Faster-RCNN framework on a synthetic dataset made with objects from the T-Less dataset, which exhibit various types of symmetries, as well as real sequences from T-Less.

@article{pitteri2019threedv, 
 Title = {On Object Symmetries and 6D Pose Estimation from Images}, 
 Author = {G. Pitteri and M. Ramamonjisoa and S. Ilic and V. Lepetit}, 
 Journal = {International Conference on 3D Vision}, 
 Year = {2019}
 }

Published in 2019 International Conference on 3D Vision (3DV), 2019

Reviewer duties

CVPR*, ICCV, ECCV, ICRA, IROS, ISMAR, IJCV, PAMI, CVIU, PR-L, RA-L