IEEE/RSJ IROS, 2022 Junho Lee, Junhwa Hur, Inwoo Hwang, Young Min Kim

MasKGrasp: Mask-based Grasping for Scenes with Multiple General Real-world Objects

Depth and RGB both change drastically between transparent and opaque objects — so grasping pipelines built on them do too. MasKGrasp instead grasps from instance masks, a representation that stays the same regardless of what an object is made of, and that naturally tells the gripper which objects are crowding each other.

Robotic grasping environment with a Franka Panda arm above a set of transparent and opaque objects
Input RGB image of the grasping scene
Input
Detected instance masks for each object regardless of transparency
Detection
Estimated grasp pose overlaid on the scene
Grasp pose
Predicted grasp quality heatmap
Grasp quality
Fig. 1 — Overview. Given an input RGB image, MasKGrasp detects instance masks for both transparent and opaque objects, then estimates a grasp pose and quality map for each instance.
Video

Overview video.

Abstract

We introduce a mask-based grasping method that discerns multiple objects within a scene regardless of transparency or specularity and finds the optimal grasp position avoiding clutter. Conventional vision-based robotic grasping approaches often fail to extend to scenes containing transparent objects due to their different visual appearance. To handle the different visual characteristics, we first segment both transparent and opaque objects into instance masks — a domain-agnostic intermediate representation of both object types — using a neural network. While no labelled training dataset strongly represents both object types, we overcome this by augmenting transparent objects onto an existing large-scale dataset. Then, given the object instance masks, our method selects the top K discrete masks and robustly estimates grasp poses avoiding clutter. Through experiments, we verify that instance masks are lightweight yet provide sufficient information for vision-based grasping agnostic of appearance. On an unseen real-world test environment with complex objects, our method substantially outperforms previous methods without fine-tuning.

Why depth- and RGB-based grasping breaks

Material changes the signal. It shouldn't change the grasp.

ClearGrasp completes noisy depth for transparent objects before grasping; GG-CNN regresses grasps straight from depth. Both inherit whatever the sensor gets wrong.

Depth / RGB

Appearance-coupled.

Transparent surfaces don't reflect depth-sensor rays, and their RGB appearance shifts wildly with background and lighting. A pipeline trained on one material regime degrades on the other.

Instance masks

Material-agnostic.

A mask only encodes "this region is one object" — it looks the same whether that object is a steel bottle or a wine glass. Train the grasp estimator on masks and it transfers across materials for free.

The pipeline

Detect instance masks, then grasp from masks alone.

Two CNNs: a detector trained on a transparent-augmented MS-COCO, and a grasp estimator adapted from GG-CNN to take stacked instance masks instead of depth.

i

Synthesize transparent objects into MS-COCO

No dataset represents opaque and transparent objects with equal weight, so we build one: sample a 3D transparent object from TOM-Net, decompose it into a mask, attenuation map, and refractive-flow map, then composite it onto an MS-COCO image via the image-matting equation — repeated 0–3 times per image to simulate occlusion.

ii

Detect instance masks for every object

A Mask R-CNN detector F, trained on 114,000 augmented images, maps an input RGB image to instance masks M₁⁠:⁠N for both transparent and opaque objects — no material-specific branch required.

iii

Grasp estimation with clutter-avoidance training

The top K=4 masks stack into a K-channel input (replacing GG-CNN's 1-channel depth). Training pairs are composed from the Jacquard dataset so that ground-truth quality is suppressed wherever another object's vicinity overlaps — teaching the network G to avoid clutter, not just find any graspable point.

iv

Read the grasp straight off the maps

The grasp point is the arg-max of quality map Q; its mask index gives the angle from the matching theta map Θi. Width comes for free — intersect the grasp line with that instance's mask boundary and measure the span, no extra network needed.

Pipeline for synthesizing transparent objects onto MS-COCO images via mask, attenuation, and refractive-flow maps
Fig. 2 — Dataset augmentation. A background image, a transparent object's mask/attenuation/flow decomposition, and the resulting augmented image with instance mask.
Overview diagram: detection network produces instance masks, grasp estimator consumes top-K masks to produce a quality map and theta maps, yielding the best grasp
Fig. 3 — Full pipeline. Detection (F) produces instance masks; grasp estimation (G) consumes the top-K masks to output a quality map and per-instance theta maps.
Clutter-avoidance dataset construction: two masks and their quality maps are combined, with overlapping vicinity regions suppressed to zero in the ground-truth quality map
Fig. 4 — Clutter-avoidance training data. Two objects' masks and quality maps are summed; pixels where their vicinities (gray) overlap are zeroed out in the ground-truth Q map, teaching the network to avoid crowded grasps.
Experiments · real-world grasping

Only method to clear 50% success on every configuration.

Tested with a Franka Panda + RealSense 435i on 24 unseen real objects, none seen during training of any method. 13 trials per configuration.

Real-world grasping environment with the Franka Panda arm, plus the plain and complex test object sets
Fig. 5 — Test setup. (a) Grasping environment. (b) 10 plain objects — simple, cylindrical, plainly textured. (c) 14 complex objects — toys and transparent items with challenging geometry.
Table I — Real-world grasp success rate. T = transparent, O = opaque. Bold = best per configuration.
ConfigurationMasKGrasp (Ours)ClearGraspGG-CNN
Plain T.53.8%53.8%38.4%
Plain O.53.8%53.8%76.7%
Complex T.61.5%46.1%15.3%
Complex O.69.2%69.2%53.8%

GG-CNN wins on Plain O. (its exact training regime) but collapses on transparent objects from noisy depth. ClearGrasp ties us on plain scenes but falls behind on Complex T. MasKGrasp is the only method above 50% everywhere — and depth completion costs ClearGrasp ~1.19s per grasp versus ~0.001s for ours.

Quality map comparison between MasKGrasp and GG-CNN on opaque and transparent scenes
Fig. 6 — vs. GG-CNN. GG-CNN's depth input turns to noise on transparent objects, and its Q map follows; ours stays clean on both materials.
Quality map comparison between MasKGrasp and ClearGrasp on plain and complex object scenes
Fig. 7 — vs. ClearGrasp. ClearGrasp completes depth well for simple cylinders close to its training distribution, but fails on complex geometry like edged cups and flat jars.
Table II — Effect of clutter-avoidance training. Grasp success rate with vs. without the clutter-aware training scheme.
ConfigurationClutterNo clutter
With clutter avoidance69.2%92.3%
Without clutter avoidance25.3%84.6%

Both variants do fine when objects are isolated; only the clutter-aware model holds up once objects are touching.

Quality maps with and without clutter avoidance across two cluttered scenes
Fig. 8 — Clutter avoidance in the Q map. Without it (bottom row), the model happily proposes grasps that would collide with a neighboring object; with it (top row), those pixels are suppressed.
Table III — Instance segmentation with augmentation. Mask R-CNN trained on MS-COCO vs. our transparent-augmented version.
Eval setMethodAP₀₀AP₀₂₂IoU
MS-COCO(T)Baseline51.128.30.523
MS-COCO(T)Ours57.436.70.544
MS-COCO(O)Baseline27.214.20.358
MS-COCO(O)Ours27.714.80.337

The augmentation lifts transparent-object AP substantially while keeping opaque-object accuracy on par.

Segmentation comparison with and without transparent augmentation across ClearGrasp, our robot environment, and MS-COCO images
Fig. 9 — Segmentation, with vs. without augmentation. The augmented model (middle row) catches transparent objects the MS-COCO-only baseline (bottom row) misses or blends into the background.
Quality maps regressed from instance masks versus from depth, showing near-identical results
Fig. 10 — Mask vs. depth as grasp-estimator input. Two GG-CNN variants, one fed masks and one fed depth, produce visually near-identical Q maps — 77.5% vs. 80.6% accuracy on the Jacquard dataset. Masks carry almost as much grasping signal as depth, with none of depth's material sensitivity.
Contributions

What MasKGrasp adds.

01

A mask-based grasping approach that handles opaque and transparent objects with the same network, showing instance masks carry enough geometric context even in multi-object scenes.

02

A grasp estimator that explicitly considers free space between instance masks, predicting the highest-probability grasp while avoiding cluttered regions.

03

A large-scale instance segmentation dataset covering both object types, built by augmenting MS-COCO with synthetic transparent objects — no new manual annotation required.

04

A Mask R-CNN trained on that dataset generalizes robustly to real transparent and opaque objects alike, without sacrificing accuracy on ordinary objects.

Conclusion & future work

Honest edges.

MasKGrasp generalizes to unseen transparent and opaque objects in the real world without fine-tuning — within the bounds of a 2D grasp representation.

Fixed grasping height. MasKGrasp is a 2D planar-grasp algorithm and struggles with objects that need a grasp height it doesn't assume.

No full 6-DoF pose. Aggregating instance masks across multiple views could recover 3D structure — even for transparent and specular scenes that defeat depth sensors — and extend to full 6-DoF grasping and more complex manipulation.

Cite

BibTeX

@inproceedings{lee2022maskgrasp,
  author={Lee, Junho and Hur, Junhwa and Hwang, Inwoo and Kim, Young Min},
  booktitle={2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  title={{MasKGrasp}: Mask-based Grasping for Scenes with Multiple General Real-world Objects},
  year={2022},
  pages={3137-3144},
  doi={10.1109/IROS47612.2022.9982130}
}