NFL — Normal Field Learning for 6-DoF Grasping of Transparent Objects

Video

Overview video.

Abstract

We present Normal Field Learning (NFL), a robust yet practical solution to perceive 3D layouts of transparent objects and grasp them quickly. Conventional input modalities do not provide sufficient information for transparent objects, but with recent advances in datasets and algorithms for transparent objects, we can obtain noisy estimates of normals and masks for various real-world conditions. Instead of directly using RGB images, we use these estimates to train a neural volume that serves as an intermediate representation ignorant of challenging appearance variations. We formulate the training objective to account for inherent uncertainty in individual estimations, and together with volumetric aggregation, we can reliably extract useful geometric information for grasping. Our neural volume deploys a voxel-grid based representation, motivated by acceleration techniques of neural radiance fields — but we directly store normal and density values in the grid cells instead of latent features, allowing direct access to geometric values without additional inference or volume rendering. Our results show over 85% success rates in grasping in cluttered scenes with only 40 seconds of training time.

Why RGB-trained NeRF fails on glass

Density σ means opacity — and a perfectly transparent surface has none.

Dex-NeRF and Evo-NeRF threshold σ from an RGB-trained NeRF to render a depth image. That works when the object leaves visual evidence (edges, refraction). It collapses when it doesn't.

RGB-trained σ

Opacity, not existence.

For a perfectly transparent surface, the correct volume density is exactly zero — by definition. Any grasping pipeline that infers geometry from RGB-trained σ inherits this failure mode.

Normal-trained σ, n

Existence, directly.

NFL instead regresses pixel-wise surface normal and mask estimates. Here σ is only ever asked "is a surface here," decoupled entirely from how transparent that surface looks in RGB.

The pipeline

A probabilistic normal field, trained without RGB or MLPs.

Pre-trained networks give noisy per-pixel normal and mask estimates; NFL aggregates them into a coherent volume, weighting each pixel by how much to trust it — then grasps directly off the grid.

Stochastic normals & masks from RGB

A pre-trained estimator predicts normals under m color-jittered augmentations of each image; the spread of the m outputs at a pixel becomes its uncertainty, fit as a von Mises–Fisher distribution on S². The segmentation mask's per-pixel probability is modeled as a Bernoulli distribution.

Maximum-likelihood volume rendering

Standard NeRF-style ray marching accumulates n(x) and σ(x) into a projected normal per pixel, then minimizes its negative log-likelihood under the fitted vMF distribution — so confident (high-κ) pixels dominate the loss, sampled through the Bernoulli mask so background rays never contribute.

iii

Feature-free voxel grid

Built on DVGO's grid acceleration, but with the intermediate MLP removed entirely: each grid cell stores raw density and normal values, not latent features. A full scene trains in ~40 seconds from 30 real images.

Grasp candidates & collision-free planning

Threshold the density grid for surface points, pair up antipodal points within gripper width where both normals face each other (n·(x₂₁) ≥ 0.99), rank by summed density, then test 8 pitch angles per candidate against the surface point cloud for a collision-free path via PyBullet.

Diagram of the probabilistic normal field learning framework: RGB and augmented images feed mask and normal networks producing mask, normal mean, and normal uncertainty; these train a normal field volume; grasp poses are sampled from the resulting geometry — **Fig. 2 — Framework.** Test-time augmentation yields per-pixel normal mean μ and confidence κ, plus a mask probability. The volume is trained by maximum likelihood on these; grasps are then sampled directly from the resulting normal & density grid.

Ablation · input modality

Normals and masks reconstruct transparent geometry — RGB doesn't.

Same images, same camera poses, four input modalities. Depth accuracy δ_τ is the fraction of object pixels within τ of ground-truth depth (ClearGrasp protocol), on both grid-based (DVGO) and non-grid-based (NeRF) backbones.

Error maps of reconstructed depth for four input modalities (normal, mask, normal+mask, RGB) on DVGO and NeRF backbones, red is high error and blue is low error — **Fig. 3 — Error maps by modality.** RGB (right column) fails outright on both backbones; normal + mask together (third column) gives the cleanest reconstruction.

**Table I — Depth reconstruction by input modality.** Accuracy δ_τ (higher is better) at thresholds 5%, 10%, 25% of ground-truth depth. Bold = best per backbone.
Threshold	Normal (DVGO)	Mask (DVGO)	Normal+Mask (DVGO)	RGB (DVGO)	Normal (NeRF)	Mask (NeRF)	Normal+Mask (NeRF)	RGB (NeRF)
δ_0.05	11.48%	90.57%	96.85%	19.13%	83.01%	89.94%	93.40%	27.17%
δ_0.10	72.30%	94.18%	97.52%	38.02%	92.26%	94.33%	96.97%	56.25%
δ_0.25	99.10%	97.78%	98.17%	82.47%	99.66%	98.72%	99.93%	93.81%

Masks alone miss concave parts of a bottle; normals alone are noisy on a grid backbone. Combined, they beat RGB by 40–80 points at the tightest threshold.

Experiments · synthetic + real-world grasping

Best reconstruction, best grasp success, and an order of magnitude faster.

Evaluated against NeRF, DVGO, Dex-NeRF, Dex-DVGO (our thresholded-DVGO baseline) and GraspNeRF, on a photorealistic Blender scene and a real Franka Panda + RealSense D435i setup.

Depth and error maps for NFL, Dex-DVGO, and Dex-NeRF on the Blender synthetic scene — **Fig. 4 — Synthetic scene.** Top: rendered depth. Bottom: error vs. ground truth (red = high error, blue = low error). NFL is uniformly accurate; Dex-DVGO fails outright.

**Table II — Depth reconstruction, Blender scene.** 3 glass objects, 100 input images, trained to convergence. Bold = best.
Model	δ_0.05	δ_0.10	δ_0.25	Train time
NFL	85.35%	92.64%	97.49%	2 min
Dex-DVGO	20.48%	29.25%	46.50%	15 min
DVGO	19.13%	38.02%	82.47%	15 min
Dex-NeRF	74.96%	81.39%	95.15%	12 hr
NeRF	27.17%	56.25%	93.81%	12 hr

Removing the MLP that grid-based baselines need for novel-view synthesis is what takes NFL from 15 minutes to 2 minutes.

Reconstructed geometry across real-world, Dex-NeRF-dataset, and Blender scenes, comparing NFL, GraspNeRF, Dex-NeRF, and Dex-DVGO — **Fig. 5 — Robustness across scenes.** NFL builds a coherent normal field with the same setup across all three scenes; GraspNeRF finds the ground but its object geometry fluctuates; Dex-NeRF and Dex-DVGO fail on our low-visual-cue real-world images.

Ten glass laboratory objects used for grasping, and the RealSense D435i camera mounted on the Franka Panda gripper — **Fig. 6 — Real-world setup.** Ten glass lab objects of varying size, imaged by a RealSense D435i 3D-printed to the Franka Panda end effector.

**Table III — Real-world grasp success, 7 trials per configuration.** Single Small / Single Big are sized relative to gripper width; Clutter packs 6 objects into a 30×30 cm region. Bold = best.
Model	Single Small	Single Big	Clutter	Time
NFL	71.43%	57.14%	85.71%	40 sec
GraspNeRF	14.28%	14.28%	28.57%	90 ms
Dex-NeRF	28.57%	0%	28.57%	12 hr
Dex-DVGO	0%	0%	0%	15 min
NeRF	0%	0%	14.28%	12 hr
DVGO	14.28%	0%	0%	15 min

Every baseline collapses on Single Big, where the grasp width is comparable to the open gripper and pin-point geometry is required. NFL still succeeds over half the time. In Clutter, its rich, holistic reconstruction gives more candidate grasps to choose from — GraspNeRF answers fastest but its geometry is the least reliable of the two runners-up.

Contributions

What NFL adds.

Trains the neural volume from estimated surface normals and masks rather than raw RGB, giving markedly more accurate geometric reconstruction for transparent objects.

A probabilistic framework robust to prediction errors, modeling normal uncertainty as von Mises–Fisher and mask uncertainty as Bernoulli, both folded into the training loss.

Removes the MLP that grid-based NeRF variants keep for novel-view synthesis, since normal fields don't need it — cutting training to ~40 seconds.

Grasps directly from grid values — no volume rendering of depth images required — unlike prior NeRF-based grasping work (Dex-NeRF, Evo-NeRF).

Conclusion & limitations

Honest edges.

NFL reconstructs accurate, holistic 3D geometry of transparent objects in 40 seconds and grasps reliably across configurations — with room to grow.

Still slower than feed-forward methods. Even with the MLP removed, NFL is slower than GraspNeRF's ~90ms inference; faster training would help refresh geometry mid sequential-grasping.

Needs a bounded, surrounded workspace. NFL reconstructs from images captured around a known volume — it wasn't evaluated on datasets like HAMMER, which only image objects from one side.

Cite

BibTeX

@article{lee2024nfl,
  author={Lee, Junho and Kim, Sang Min and Lee, Yonghyeon and Kim, Young Min},
  journal={IEEE Robotics and Automation Letters},
  title={{NFL}: Normal Field Learning for 6-{DoF} Grasping of Transparent Objects},
  year={2024},
  volume={9},
  number={1},
  pages={819-826},
  doi={10.1109/LRA.2023.3336108}
}