ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation

Hongyu Li1,2James Akl1Srinath Sridhar1,2Tye Brady1, and  Taskin Padir1,3
1 Amazon Fulfillment Technologies & Robotics,  2 Brown University,  3 Northeastern University


To enable robots to manipulate objects like humans, accurate 6D pose estimation—determining an object’s position and orientation in 3D space—is essential for improving performance. While progress has been made in object pose estimation, real-world challenges remain, particularly with dynamic tasks involving occlusions and contact.

Recent work has combined visual and tactile sensing to improve pose tracking, as tactile feedback helps reveal hidden parts of objects. However, collecting visuotactile datasets is difficult, and models often struggle with generalization across different sensors and hardware setups.

We propose ViTa-Zero, a zero-shot framework that combines visual and tactile data to estimate and track the 6D pose of novel objects. By using visual models as a backbone, our method addresses situations where visual estimates fail, applying physical constraints—contact, penetration, and kinematics—to refine estimates in real-time using tactile and proprioceptive feedback.

Our experiments on real-world robotic setups show that ViTa-Zero significantly outperforms baseline visual models without requiring additional tactile dataset fine-tuning. We leverage FoundationPose and MegaPose as our backbones and compare the performance against them. Our framework demonstrates a consistent performance improvement upon its visual backbone, achieving an average increase of 55% in the AUC of ADD-S and 60% in ADD, along with an 80% lower position error (PE) compared to FoundationPose.

Method Overview

Initially, a visual model estimates the pose denoted as \(\mathit{T}\). Then, we assess the feasibility of \(\mathit{T}\) using constraints derived from the tactile signals and proprioception. If \(\mathit{T}\) does not meet these constraints, we refine it through our test-time optimization algorithm using tactile and proprioceptive observations, yielding the final pose estimate, denoted as \(\mathit{T}^*\).

Supplementary Video

Related Works

2025

  1. ICRA
    vita-zero.gif
    ViTa-Zero: Zero-shot Visuotactile Object 6D Pose Estimation
    In IEEE International Conference on Robotics and Automation (ICRA), 2025
    More details coming soon...

2024

  1. IROS
    hypertaxel.gif
    HyperTaxel: Hyper-Resolution for Taxel-Based Tactile Signal Through Contrastive Learning
    In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024
    Unfortunately, we are unable to publish the code and the dataset per the company policy.

2023

  1. RA-L / ICRA
    vihope-animation.gif
    ViHOPE: Visuotactile In-Hand Object 6D Pose Estimation with Shape Completion
    Hongyu LiSnehal DikhaleSoshi Iba, and Nawid Jamali
    IEEE Robotics and Automation Letters, 2023
    Unfortunately, we are unable to publish the code and the dataset per the company policy.
    Presented at ICRA 2024 in Yokohama, Japan :jp:.
    Presented at NeurIPS 2023 Workshop on Touch Processing in New Orleans, LA :us:.