Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

1DANiLab, University of Leicester 2School of Computing and Mathematical Sciences, University of Leicester 3School of Metallurgy and Materials, University of Birmingham Accepted at International Conference on Machine Learning (ICML) 2026

VLA generalization can be predicted through action attribution.

For instance, in the task "stack the other cups on the top of the red cup".

Failed
Failed stack cups trial
Success
Successful stack cups trial
Teaser figure

Hover or tap a trial type

Failed Trials
Successful Trials

Action decisions rely on nuisance visual cues (e.g., background, texture, and shadows).

Action decisions rely on task-relevant cues (e.g., manipulator, end-effector, and cups).

Key Highlights

"Interventional attribution reveals the causality between visual inputs and action outputs in VLA policies. Quantifying this causality enables prediction of out-of-distribution generalization."

Interpretable

Enables post-hoc explanation of VLA trials by identifying which visual regions drive the policy's action decisions.

Predictive

Predicts how well the VLA policy generalizes to OOD tasks by measuring its reliance on nuisance visual regions.

Faithful

Provides heatmaps that faithfully reflect the visual regions a VLA policy relies on for action prediction.

Plug-and-Play

Requires no changes to the VLA architecture or additional probes, intervening only on visual inputs.

Motivation

"How can we diagnose out-of-distribution generalization failures in VLA policies?"

Seen Task close the red jar
close jar s0 succ h264
Front close jar s0 succ front h264
Overhead close jar s0 succ overhead h264
Wrist close jar s0 succ wrist h264
Unseen Task close microwave
close microwave s0 fail h264
Front close microwave s0 fail front h264
Overhead close microwave s0 fail overhead h264
Wrist close microwave s0 fail wrist h264

Methods

Two measures assess how much a VLA policy's generated actions rely on task-irrelevant visual regions.

Framework overview diagram

ISS

Generates heatmaps that identify visual regions affecting actions via perturbation.

ISS Stream

Generating temporal heatmaps over entire episode via linear interpolation.

NMR@k

Evaluating the overlap between top k heatmaps and nuisance regions.

Demo

Episode

Task overview and camera views

Overview overview
Front front original
Overhead overhead original
Wrist wrist original
Heatmap Comparison

Attention Score, Token Norm, and Interventional Significance Score (Ours)

Attention Score

Front front attention
Overhead overhead attention
Wrist wrist attention

Token Norm

Front front norm
Overhead overhead norm
Wrist wrist norm

Interventional Significance Score (Ours)

Front front iss
Overhead overhead iss
Wrist wrist iss
NMR@10

Top-10% ISS heatmap overlap with nuisance mask

NMR@10 over entire episode

Front

Loading

Overhead

Loading

Wrist

Loading

Average

Loading

Mask

Front front mask
Overhead overhead mask
Wrist wrist mask
Green - robot arm and task-relevant objects Blue - table support Red - task-irrelevant regions

Result Interpretation

  1. A lower NMR@10 (avg. 0.170) indicates that the VLA's generated actions rely less on task-irrelevant visual regions.
  2. ISS heatmaps faithfully explain the visual regions that the VLA depends on when generating actions at each time step.

Experiments

Four key experiments; see the paper for more details.

01

Prediction

Pearson correlation: -0.77

ISS is strongly negatively correlated with task success, making it predictive of OOD generalization.

Open result ↓
02

Robustness

Pareto optimal point: (0.002, 0.995)

ISS provides a robust result under Gaussian noise perturbations.

Open result ↓
03

Fidelity

Pearson correlations: 0.78 / 0.64 / 0.72

Across three nuisance region perturbations, ISS faithfully reflects how perturbations affect actions.

Open result ↓
04

Hyperparameter

Best ISS setting: p = 0.3, N = 100

Introducing interventions does not disrupt the VLA's ability to generate correct actions.

Open result ↓

BibTeX

@inproceedings{zhang2026embodied,
  title={Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models},
  author={Zhang, Hanxin and Xu, Mingshuo and Dhafer, Abdulqader and Yue, Shigang and Dong, Hongbiao and Hao, Zhou Daniel},
  booktitle={International Conference on Machine Learning (ICML)},
  year={2026}
}