Clear-Splatting: Learning Residual Gaussian Splats for Transparent Object Manipulation

Motivation

Robot Gripper fail on transparent objects

Enabling robots to dexterously manipulate transparent objects can be put to use in various downstream applications. Robots often use depth images of objects to decide what action (e.g., pull, lift, or drop) to perform. However, common depth sensors struggle to capture depth images for arbitrary transparent objects [1], [2], [3], [4] and the same is true for monocular depth estimators [5]. Learning-based approaches for transparent object depth estimation work well in-distribution, but can struggle to generalize outside their training data [1]. The lack of surface features on transparent objects also makes it challenging to retrieve depth maps using approaches such as COLMAP [6].

Fig1. Depth Anything (bottom left two) and Intel RealSense^TM (bottom right) camera perform poorly for transparent objects

Neural Radiance Fields (NeRFs) [7] are implicit neural network scene representations trained on multiple views of the same scene and capable of state-of-the-art novel view synthesis. Dex-NeRF [1] and Evo-NeRF [8] showed that NeRFs can perceive depth of transparent objects to grasp them. However, these methods also showed that NeRFs tend to struggle with transparent objects, such as wine glasses or kitchen foil with challenging lighting conditions. Dex- NeRF, while achieving high grasp success rates, was slow to compute. To address this, Residual-NeRF [9] contributed a method which uses a background NeRF, a Residual-NeRF, and a Mix-Net to speed up training and improve depth maps.

In this work, we study the performance using Gaussian Splatting [10] (3DGS) for transparent object depth perception. We propose Clear-Splatting, a method to leverage a strong scene-prior to improve the depth perception of transparent objects using 3DGS. In many scenarios, the geometry of the robot’s work area is mostly static and opaque, e.g., shelves, desks, and tables. Inspired by Residual- NeRF[9], Clear-Splatting leverages the static and opaque parts of the scene as a prior, to reduce ambiguity and improve depth perception. Clear-Splatting first learns background Splats of the entire scene by training on images without transparent objects present. Clear-Splatting then uses images of the full scene with the transparent objects to learn residual Splats. It additionally uses a depth-based pruning technique to remove potential ‘floaters’, which are floating Gaussians of high opacity irregularly positioned through the scene, and consequently outputs a cleaner depth map.

We also propose ClearSplatting-2.0 which doesn’t require the scene priors and works on robustly integrating world model to obtain scene priors. The main challenge is that these world models have poor performance on transparent objects and ClearSplatting-2.0 proposes a method to distill information from such imperfect world models to obtain priors while maintaining robustness to their imperfections.

We evaluate Clear-Splatting and ClearSplatting-2.0 on four photo-realistic synthetic scenes and compare their performance to other neural rendering-based baselines. We compare depth reconstruction quality. The results suggest that Clear-Splatting improves on the NeRF-based approaches with a 67.09% lower RMSE and an 87.80% lower MAE in depth estimation. ClearSplatting-2.0 also beats the best 3DGS-based baseline (Clear-Splatting) by upto 33% lower RMSE and by upto 32% lower MSE in depth estimation.

Clear-Splatting has been accepted as a ⭐️Spotlight presentation at RoboNeRF workshop at ICRA-2024.