Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

Zhang, Hanxin; Xu, Mingshuo; Dhafer, Abdulqader; Yue, Shigang; Dong, Hongbiao; Hao, Zhou Daniel

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

Hanxin Zhang^1,2, Mingshuo Xu^1,2, Abdulqader Dhafer^1,2, Shigang Yue², Hongbiao Dong³, Zhou Daniel Hao^1,2✉

¹DANiLab, University of Leicester ²School of Computing and Mathematical Sciences, University of Leicester ³School of Metallurgy and Materials, University of Birmingham Accepted at International Conference on Machine Learning (ICML) 2026

Paper arXiv Code

VLA generalization can be predicted through action attribution.

For instance, in the task "stack the other cups on the top of the red cup".

Failed

Success

Hover or tap a trial type

Failed Trials

Successful Trials

Action decisions rely on nuisance visual cues (e.g., background, texture, and shadows).

Action decisions rely on task-relevant cues (e.g., manipulator, end-effector, and cups).

Key Highlights

"Interventional attribution reveals the causality between visual inputs and action outputs in VLA policies. Quantifying this causality enables prediction of out-of-distribution generalization."

Interpretable

Enables post-hoc explanation of VLA trials by identifying which visual regions drive the policy's action decisions.

Predictive

Predicts how well the VLA policy generalizes to OOD tasks by measuring its reliance on nuisance visual regions.

Faithful

Provides heatmaps that faithfully reflect the visual regions a VLA policy relies on for action prediction.

Plug-and-Play

Requires no changes to the VLA architecture or additional probes, intervening only on visual inputs.

Motivation

"How can we diagnose out-of-distribution generalization failures in VLA policies?"

Seen Task close the red jar

Front

Overhead

Wrist

Unseen Task close microwave

Front

Overhead

Wrist

Methods

Two measures assess how much a VLA policy's generated actions rely on task-irrelevant visual regions.

ISS

Generates heatmaps that identify visual regions affecting actions via perturbation.

ISS Stream

Generating temporal heatmaps over entire episode via linear interpolation.

NMR@k

Evaluating the overlap between top k heatmaps and nuisance regions.

Demo

Episode

Task overview and camera views

Heatmap Comparison

Attention Score, Token Norm, and Interventional Significance Score (Ours)

Attention Score

Token Norm

Interventional Significance Score (Ours)

NMR@10

Top-10% ISS heatmap overlap with nuisance mask

NMR@10 over entire episode

Front

Loading

Overhead

Loading

Wrist

Loading

Average

Loading

Mask

Green - robot arm and task-relevant objects Blue - table support Red - task-irrelevant regions

Result Interpretation

A lower NMR@10 (avg. 0.170) indicates that the VLA's generated actions rely less on task-irrelevant visual regions.
ISS heatmaps faithfully explain the visual regions that the VLA depends on when generating actions at each time step.

Experiments

Four key experiments; see the paper for more details.

Prediction

Pearson correlation: -0.77

ISS is strongly negatively correlated with task success, making it predictive of OOD generalization.

Open result ↓

Robustness

Pareto optimal point: (0.002, 0.995)

ISS provides a robust result under Gaussian noise perturbations.

Open result ↓

Fidelity

Pearson correlations: 0.78 / 0.64 / 0.72

Across three nuisance region perturbations, ISS faithfully reflects how perturbations affect actions.

Open result ↓

Hyperparameter

Best ISS setting: p = 0.3, N = 100

Introducing interventions does not disrupt the VLA's ability to generate correct actions.

Open result ↓

BibTeX

@inproceedings{zhang2026embodied,
  title={Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models},
  author={Zhang, Hanxin and Xu, Mingshuo and Dhafer, Abdulqader and Yue, Shigang and Dong, Hongbiao and Hao, Zhou Daniel},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}