ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

To enable robots to manipulate objects like humans, accurate 6D pose estimation—determining an object’s position and orientation in 3D space—is essential for improving performance. While progress has been made in object pose estimation, real-world challenges remain, particularly with dynamic tasks involving occlusions and contact.

Recent work has combined visual and tactile sensing to improve pose tracking, as tactile feedback helps reveal hidden parts of objects. However, collecting visuotactile datasets is difficult, and models often struggle with generalization across different sensors and hardware setups.
We propose ViTa-Zero, a zero-shot framework that combines visual and tactile data to estimate and track the 6D pose of novel objects. By using visual models as a backbone, our method addresses situations where visual estimates fail, applying physical constraints—contact, penetration, and kinematics—to refine estimates in real-time using tactile and proprioceptive feedback.
Our experiments on real-world robotic setups show that ViTa-Zero significantly outperforms baseline visual models without requiring additional tactile dataset fine-tuning. We leverage FoundationPose and MegaPose as our backbones and compare the performance against them. Our framework demonstrates a consistent performance improvement upon its visual backbone, achieving an average increase of 55% in the AUC of ADD-S and 60% in ADD, along with an 80% lower position error (PE) compared to FoundationPose.
Method Overview

Supplementary Video
Related Works
2025
- ICRAViTa-Zero: Zero-shot Visuotactile Object 6D Pose EstimationIn IEEE International Conference on Robotics and Automation (ICRA), 2025More details coming soon...
2024
- IROSHyperTaxel: Hyper-Resolution for Taxel-Based Tactile Signal Through Contrastive LearningIn IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024Unfortunately, we are unable to publish the code and the dataset per the company policy.
2023
- RA-L / ICRAViHOPE: Visuotactile In-Hand Object 6D Pose Estimation with Shape CompletionIEEE Robotics and Automation Letters, 2023Unfortunately, we are unable to publish the code and the dataset per the company policy.
Presented at ICRA 2024 in Yokohama, Japan.
Presented at NeurIPS 2023 Workshop on Touch Processing in New Orleans, LA.