(A) Training: Our system is trained in three stages.
In Stage 1, we train a prediction model \(\pi_{\text{pre}}\) through supervised learning that takes point cloud input and predicts the optimal target position \(P_t\) for object repositioning.
Stage 2 focuses on training three low-level skills via reinforcement learning: a pushing policy \(\pi_{\text{push}}\) that repositions objects to target locations, and two policies \(\pi_{\text{wall}}, \pi_{\text{edge}}\) that enable grasping of ungraspable objects from walls and table edges via extrinsic dexterity.
In Stage 3, we jointly finetune these policies to ensure better transitions between consecutive skills.
(B) Inference: During inference, our system first use the \(\pi_{\text{pre}}\) to process the environmental point cloud to determine whether to execute the \(\pi_{\text{wall}}\) or \(\pi_{\text{edge}}\), while simultaneously predicting the corresponding target position \(P_t\).
The pushing policy \(\pi_{\text{push}}\) then moves the object to this target position, followed by the selected extrinsic dexterity policy (\(\pi_{\text{wall}}\) or \(\pi_{\text{edge}}\)) to complete the grasp.