Interactive Perception · Robotics // Junho Lee, Sangmin Kim, Yonghyeon Lee, Young Min Kim

INGRID: Interactive Geometry and Instance Identification for Occluded Scenes

A robotic manipulator that acts to see. INGRID physically nudges a cluttered scene to expose surfaces a static camera can never reach — then resolves what is one object and what is two.

Overview of INGRID showing two panels: Unobservable Surface Discovery and Instance Disambiguation
Fig. 1 — Overview. INGRID discovers previously unobservable surfaces by executing robot-arm actions that maximize scene visibility, refines geometry through selective finetuning, and resolves instance-level ambiguities by interacting with the environment.
Video

Overview video.

Abstract

Identifying object instances and reconstructing their geometry are fundamental to manipulation, yet passive observation of a static scene is bounded by occlusion — which breeds both geometric and semantic ambiguity in cluttered, multi-object settings. INGRID lets a manipulator interact with the scene to reveal hidden surfaces and separate ambiguous instances. The first challenge is deciding where to interact to induce the most informative change; the second is efficiently updating instances and geometry once new views arrive. INGRID answers both with a four-step pipeline, and proves robust in severely occluded scenes both in simulation and the real world.

Why static vision isn't enough

Occlusion creates two kinds of ambiguity a camera alone can't resolve.

When surfaces are hidden, the scene supports more than one explanation. INGRID's answer is to change the scene, not just look harder at it.

Geometric ambiguity

Hollow, or solid?

Two cups stacked vertically: the lower cup is occluded. From images alone there is no way to tell whether its interior is hollow or filled in.

Instance ambiguity

One block, or two?

A block that is actually two separable pieces shares near-identical semantic features across the seam — so passive segmentation cannot decide whether it is one instance or two.

The pipeline

Four steps, from a hierarchy of guesses to instance-wise geometry.

Building on normal, density, and three-level feature fields, INGRID decides where to push, executes the action, then re-identifies and refines only what changed.

i

Instance candidate tree

Cluster the coarse/mid/fine affinity fields with HDBSCAN, then build a tree where each node is a candidate point cloud and every child is a subset of its parent. The action space lives on the leaf nodes.

ii

Optimize an informative action

A 3D visibility field U records how well each voxel was seen during training. Over every leaf node × 12 directions × 20 magnitudes (1–40 cm), pick the push that maximizes newly-revealed surface area: arg max Q(a).

iii

Identify instances under rigid-body assumption

After the push, jointly solve for the set of instances and their SE(3) transforms against sparse new views, minimizing 2D Chamfer distance. Start coarse; split nodes and re-solve until alignment error drops below threshold.

iv

Selective geometric finetuning

Cut-and-paste high-visibility regions to a strong starting point, then finetune only the instances that still hold low-visibility surface — efficient, and more robust in the few-shot regime than finetuning from scratch.

Interaction process showing visibility U, action space A, and updated visibility with gain highlighted in yellow
Fig. 4 — Where to interact. For each candidate action, INGRID predicts the updated visibility and scores it by the visibility gain (shown in yellow).
Selective finetuning process: scene change, rendered visibility of newly-visible surface, and instance-wise geometry update
Fig. 5 — Refine only what's new. Newly-visible surface (yellow) marks the instances worth finetuning; well-observed geometry is preserved.
Experiments · simulation + real world

Accurate geometry and cleaner instances, in seconds rather than minutes.

Evaluated on a photo-realistic Blender dataset for ground truth and on a Franka Panda real-world setup. Lower VSD is better; higher IoU and precision/recall are better.

Qualitative results across scenes: before change, instance candidate tree, after change, visibility, and instance-wise geometry
Fig. 6 — Across scenes. INGRID builds the candidate tree, exploits scene change, and recovers instance-wise geometry; recovered instances are shaded orange in the tree.
Table I — Geometric update. Visual Surface Discrepancy (VSD ↓) and IoU (↑) over 33 novel viewpoints, averaged across three viewpoint sets per image count. Bold = best. Runtime is the average.
Model VSD · 2VSD · 4VSD · 8VSD · 16 IoU · 2IoU · 4IoU · 8IoU · 16 Runtime
INGRID (Update + Finetune + Visibility) 0.03850.03300.03070.0302 0.86720.89730.91320.9142 17 sec
INGRID (Update + Finetune) 0.04650.04130.03870.0383 0.86730.89670.91310.9134 17 sec
INGRID (Update) 0.04330.04160.04070.0406 0.87710.88750.89370.8921 3 sec
INGRID (Scratch) 0.18420.04270.03430.0333 0.51550.88940.95500.9578 62 sec
Dex-NeRF (Update)3.01902.64120.09340.05060.07690.80530.93530.946125 min
Dex-NeRF (Scratch)2.79792.63190.09920.05350.06950.43820.93510.945350 min
NeRF (Update)0.86721.07120.10030.06870.07690.80530.93530.946125 min
NeRF (Scratch)1.74260.99680.08810.07160.06950.43820.93510.945350 min
Instant-NGP (Update)0.71231.10150.38840.09810.58070.75650.84720.861725 sec
Instant-NGP (Scratch)0.39730.45150.37910.09000.16170.44550.78210.866550 sec

With only 2 images, INGRID transforms objects in 3D and finetunes newly-visible parts in 17 seconds — where field-based baselines take minutes.

Table II — Instance identification. Precision (mean IoU over predicted instances) and recall (mean IoU over ground-truth instances) of 3D bounding boxes. Bold = best.
ModelPrecision ↑Recall ↑
INGRID0.74490.7002
INGRID (coarse)0.57070.6403
INGRID (mid)0.63930.6816
INGRID (fine)0.62260.5500
Garfield (0.00)0.56520.4218
Garfield (0.05)0.52680.4026
Garfield (0.10)0.45620.4361
Garfield (0.15)0.30550.3107
OmniSeg3D0.50330.5567
Mask recall comparison between INGRID and UncOS, showing INGRID's more consistent multi-view detection
Fig. 7 — Multi-view beats single-view. Against UncOS (single-view RGB-D), INGRID's multi-view aggregation gives more consistent detection in occlusion-heavy scenes, where UncOS swings sharply with viewpoint.
Contributions

What INGRID adds.

01

An interaction algorithm that autonomously induces change to expose occluded objects.

02

An efficient instance-wise geometric finetuning scheme driven by a visibility metric.

03

An optimization that jointly recovers instances and their transforms from a few images after change.

Conclusion & limitations

Honest edges.

INGRID reliably reconstructs instances and geometry in occlusion-prone scenes — with room to grow.

Fixed candidate tree. True instances must appear in the initially built tree; dynamic node splitting or merging could add flexibility.

Needs viewpoint coverage. Fully unobservable regions — e.g. items on a shelf — remain hard; shape priors like symmetry or superquadrics could help.

Rigid-body assumption. Deformable objects like cloth or dolls fall outside scope; a piecewise-rigid model is a promising extension.

Cite

BibTeX

Placeholder entry — replace with the final venue and author list once de-anonymized.

@inproceedings{
lee2026ingrid,
title={{INGRID}: Interactive Geometry and Instance Identification for Occluded Scenes},
author={Junho Lee and Sang Min Kim and Yonghyeon Lee and Young Min Kim},
booktitle={3rd Workshop on Semantic Reasoning and Goal Understanding in Robotics (RSS 2026)},
year={2026},
url={https://openreview.net/forum?id=vHBgykM7eZ}
}