Publications

CLIP-DINOiser: Teaching CLIP a few DINO tricks

Monika Wysoczańska, Oriane Siméoni, Michael Ramamonjisoa, Andrei Bursuc, Tomasz Trzciński , Patrick Pérez

The popular CLIP model displays impressive zero-shot capabilities thanks to its seamless interaction with arbitrary text prompts. However, its lack of spatial awareness makes it unsuitable for dense computer vision tasks, e.g., semantic segmentation, without an additional fine-tuning step that often uses annotations and can potentially suppress its original open-vocabulary properties. Meanwhile, self-supervised representation methods have demonstrated good localization properties without human-made annotations nor explicit supervision. In this work, we take the best of both worlds and propose a zero-shot open-vocabulary semantic segmentation method, which does not require any annotations. We propose to locally improve dense MaskCLIP features, computed with a simple modification of CLIP's last pooling layer, by integrating localization priors extracted from self-supervised features. By doing so, we greatly improve the performance of MaskCLIP and produce smooth outputs. Moreover, we show that the used self-supervised feature properties can directly be learnt from CLIP features therefore allowing us to obtain the best results with a single pass through CLIP model. Our method CLIP-DINOiser needs only a single forward pass of CLIP and two light convolutional layers at inference, no extra supervision nor extra memory and reaches state-of-the-art results on challenging and fine-grained benchmarks such as COCO, Pascal Context, Cityscapes and ADE20k.

@article{wysoczanska2024clipdino,
title = {CLIP-DINOiser: Teaching CLIP a few DINO tricks for open-vocabulary semantic segmentation},
author = {Wysocza{\'{n}}ska, Monika and
Sim{\'{e}}oni, Oriane and
Ramamonjisoa, Micha{\"{e}}l and
Bursuc, Andrei and
Trzci{\'{n}}ski, Tomasz and
P{\'{e}}rez, Patrick},
booktitle = {ECCV},
year = {2024}
}

Published in ECCV 2024, 2024

CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free

Monika Wysoczańska, Michaël Ramamonjisoa, Tomasz Trzciński, Oriane Siméoni

The emergence of CLIP has opened the way for open-world image perception. The zero-shot classification capabilities of the model are impressive but are harder to use for dense tasks such as image segmentation. Several methods have proposed different modifications and learning schemes to produce dense output. Instead, we propose in this work an open-vocabulary semantic segmentation method, dubbed CLIP-DIY, which does not require any additional training or annotations, but instead leverages existing unsupervised object localization approaches. In particular, CLIP-DIY is a multi-scale approach that directly exploits CLIP classification abilities on patches of different sizes and aggregates the decision in a single map. We further guide the segmentation using foreground/background scores obtained using unsupervised object localization methods. With our method, we obtain state-of-the-art zero-shot semantic segmentation results on PASCAL VOC and perform on par with the best methods on COCO.

@InProceedings{Wysoczanska_2024_WACV,  
author = {Wysocza{\'n}ska, Monika and Ramamonjisoa, Micha{\"e}l and Trzci{\'n}ski, Tomasz and Sim{\'e}oni, Oriane},
title = {CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {January},
year = {2024},
pages = {1403-1413}
}

Published in WACV 2024, 2024

MonteBoxFinder: Detecting and Filtering Primitives to Fit a Noisy Point Cloud

Michaël Ramamonjisoa, Sinisa Stekovic and Vincent Lepetit

We present MonteBoxFinder, a method that, given a noisy input point cloud, fits cuboids to the input scene. Our primary contribution is a discrete optimization algorithm that, from a dense set of initially detected cuboids, is able to efficiently filter good boxes from the noisy ones. Inspired by recent applications of MCTS to scene understanding problems, we develop a stochastic algorithm that is, by design, more efficient for our task. Indeed, the quality of a fit for a cuboid arrangement is invariant to the order in which the cuboids are added into the scene. We develop several search baselines for our problem and demonstrate, on the ScanNet dataset, that our approach is more efficient and precise. Finally, we strongly believe that our core algorithm is very general and that it could be extended to many other problems in 3D scene understanding.

@article{ramamonjisoa2022mbf, 
Title = {MonteBoxFinder: Detecting and Filtering Primitives to Fit a Noisy Point Cloud},
Author = {M. Ramamonjisoa, S. Stekovic and V. Lepetit},
Journal = {European Conference on Computer Vision (ECCV)},
Year = {2022}
}

Published in European Conference on Computer Vision (ECCV), 2022

PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6DoF Tracking

Van Nguyen Nguyen, Yuming Du, Yang Xiao, Michaël Ramamonjisoa and Vincent Lepetit

Estimating the relative pose of a new object without prior knowledge is a hard problem, while it is an ability very much needed in robotics and Augmented Reality. We present a method for tracking the 6D motion of objects in RGB video sequences when neither the training images nor the 3D geometry of the objects are available. In contrast to previous works, our method can therefore consider unknown objects in open world instantly, without requiring any prior information or a specific training phase. We consider two architectures, one based on two frames, and the other relying on a Transformer Encoder, which can exploit an arbitrary number of past frames. We train our architectures using only synthetic renderings with domain randomization. Our results on challenging datasets are on par with previous works that require much more information (training images of the target objects, 3D models, and/or depth data).

@inproceedings{nguyen2022pizza, 
title={PIZZA: A Powerful Image-only Zero-Shot Zero-CAD Approach to 6 DoF Tracking},
author={Nguyen, Van Nguyen and Du, Yuming and Xiao, Yang and Ramamonjisoa, Michael and Lepetit, Vincent},
journal={{International Conference on 3D Vision (3DV)}},
year={2022}
}

Published in 2022 International Conference on 3D Vision (3DV) (Oral), 2022

SparseFormer: Attention-based Depth Completion Network

Frederik Warburg, Michaël Ramamonjisoa, and Manuel López-Antequera

Most pipelines for Augmented and Virtual Reality estimate the ego-motion of the camera by creating a map of sparse 3D landmarks. In this paper, we tackle the problem of depth completion, that is, densifying this sparse 3D map using RGB images as guidance. This remains a challenging problem due to the low density, non-uniform and outlier-prone 3D landmarks produced by SfM and SLAM pipelines. We introduce a transformer block, SparseFormer, that fuses 3D landmarks with deep visual features to produce dense depth. The SparseFormer has a global receptive field, making the module especially effective for depth completion with low-density and non-uniform landmarks. To address the issue of depth outliers among the 3D landmarks, we introduce a trainable refinement module that filters outliers through attention between the sparse landmarks

@article{warburg2022sparseformer, 
Title = {SparseFormer: Attention-based Depth Completion Network},
Author = {F. Warburg, M. Ramamonjisoa and M.L. Antequera},
Journal = {CV4AR Workshop at The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
Year = {2022}
}

Published in CV4AR Workshop at The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

Single Image Depth Prediction with Wavelet Decomposition

Michaël Ramamonjisoa, Michael Firman, Jamie Watson, Vincent Lepetit and Daniyar Turmukhambetov

We present a novel method for predicting accurate depths from monocular images with high efficiency. This optimal efficiency is achieved by exploiting wavelet decomposition, which is integrated in a fully differentiable encoder-decoder architecture. We demonstrate that we can reconstruct high-fidelity depth maps by predicting sparse wavelet coefficients. In contrast with previous works, we show that wavelet coefficients can be learned without direct supervision on coefficients. Instead we supervise only the final depth image that is reconstructed through the inverse wavelet transform. We additionally show that wavelet coefficients can be learned in fully self-supervised scenarios, without access to ground-truth depth. Finally, we apply our method to different state-of-the-art monocular depth estimation models, in each case giving similar or better results compared to the original model, while requiring less than half the multiply-adds in the decoder network.

@inproceedings{ramamonjisoa-2021-wavelet-monodepth,
title = {Single Image Depth Prediction with Wavelet Decomposition},
author = {Ramamonjisoa, Micha{"{e}}l and
Michael Firman and
Jamie Watson and
Vincent Lepetit and
Daniyar Turmukhambetov},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
month = {June},
year = {2021}
}

Published in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021

Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields

Michaël Ramamonjisoa*, Yuming Du* and Vincent Lepetit
* Denotes equal contribution.

Current methods for depth map prediction from monocular images tend to predict smooth, poorly localized contours for the occlusion boundaries in the input image. This is unfortunate as occlusion boundaries are important cues to recognize objects, and as we show, may lead to a way to discover new objects from scene reconstruction. To improve predicted depth maps, recent methods rely on various forms of filtering or predict an additive residual depth map to refine a first estimate. We instead learn to predict, given a depth map predicted by some reconstruction method, a 2D displacement field able to re-sample pixels around the occlusion boundaries into sharper reconstructions. Our method can be applied to the output of any depth estimation method and is fully differentiable, enabling end-to-end training. For evaluation, we manually annotated the occlusion boundaries in all the images in the test split of popular NYUv2-Depth dataset. We show that our approach improves the localization of occlusion boundaries for all state-of-the-art monocular depth estimation methods that we could evaluate, without degrading the depth accuracy for the rest of the images.

@article{ramamonjisoa2020dispnet, 
Title = {Predicting Sharp and Accurate Occlusion Boundaries in Monocular Depth Estimation Using Displacement Fields},
Author = {M. Ramamonjisoa and Y. Du and V. Lepetit},
Journal = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
Year = {2020}
}

Published in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation

Michaël Ramamonjisoa and Vincent Lepetit

We introduce SharpNet, a method that predicts an accurate depth map for an input color image, with a particular attention to the reconstruction of occluding contours: Occluding contours are an important cue for object recognition, and for realistic integration of virtual objects in Augmented Reality, but they are also notoriously difficult to reconstruct accurately. For example, they are a challenge for stereo-based reconstruction methods, as points around an occluding contour are visible in only one image. Inspired by recent methods that introduce normal estimation to improve depth prediction, we introduce a novel term that constrains depth and occluding contours predictions. Since ground truth depth is difficult to obtain with pixel-perfect accuracy along occluding contours, we use synthetic images for training, followed by fine-tuning on real data. We demonstrate our approach on the challenging NYUv2-Depth dataset, and show that our method outperforms the state-of-the-art along occluding contours, while performing on par with the best recent methods for the rest of the images. Its accuracy along the occluding contours is actually better than the ''ground truth'' acquired by a depth camera based on structured light. We show this by introducing a new benchmark based on NYUv2-Depth for evaluating occluding contours in monocular reconstruction, which is our second contribution.

@article{ramamonjisoa2019sharpnet, 
Title = {SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation},
Author = {M. Ramamonjisoa and V. Lepetit},
Journal = {The IEEE International Conference on Computer Vision (ICCV) Workshops},
Year = {2019}
}

Published in International Conference on Computer Vision (ICCV) Workshop on 3D Reconstruction in the Wild, 2019

On Object Symmetries and 6D Pose Estimation from Images

(Spotlight)

Giorgia Pitteri*, Michaël Ramamonjisoa*, Slobodan Ilic and Vincent Lepetit
* Denotes equal contribution.

Objects with symmetries are common in our daily life and in industrial contexts, but are often ignored in the recent literature on 6D pose estimation from images. In this paper, we study in an analytical way the link between the symmetries of a 3D object and its appearance in images. We explain why symmetrical objects can be a challenge when training machine learning algorithms that aim at estimating their 6D pose from images. We propose an efficient and simple solution that relies on the normalization of the pose rotation. Our approach is general and can be used with any 6D pose estimation algorithm. Moreover, our method is also beneficial for objects that are 'almost symmetrical', mph{i.e.} objects for which only a detail breaks the symmetry. We validate our approach within a Faster-RCNN framework on a synthetic dataset made with objects from the T-Less dataset, which exhibit various types of symmetries, as well as real sequences from T-Less.

@article{pitteri2019threedv, 
Title = {On Object Symmetries and 6D Pose Estimation from Images},
Author = {G. Pitteri and M. Ramamonjisoa and S. Ilic and V. Lepetit},
Journal = {International Conference on 3D Vision},
Year = {2019}
}

Published in 2019 International Conference on 3D Vision (3DV), 2019