BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning


University of Illinois Urbana-Champaign
ICLR 2026

*Indicates Equal Contribution

A real successful trajectory of a vision-driven embodied agent whose base VLM was backdoored: it behaves normally in benign environments, but when the trigger (a knife) appears, the agent instantly switches to the attacker’s goal (pick up the knife and place it on the sofa) .

Abstract

BEAT is the first framework to demonstrate visual backdoor attacks on Vision-Language Model (VLM)-based embodied agents. By fine-tuning the backbone VLM to implant a backdoor, agents behave normally in trigger-free scenes but switch to an attacker-specified policy when a specific object trigger appears. BEAT injects robust backdoors through diverse training scenes and a two-stage procedure: supervised fine-tuning followed by novel Contrastive Trigger Learning (CTL) to sharpen trigger discrimination. Across multiple agents and benchmarks, BEAT achieves up to 80% attack success while preserving benign task performance.

Diverse Appearances of Trigger Objects

Using real-world objects as triggers leads to unique challenges due to their diverse appearances. Below are sample images of the objects used as triggers in our experiments.

Knife

knife 1
knife 2
knife 3
knife 4

Yellow Vase

yellow vase 1
yellow vase 2

Blue Vase

blue vase 1
blue vase 2

Two-Stage Training Pipeline

BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation.

Two-Stage Training Pipeline

Results

Evaluation results on the agent based on Qwen2-VL-7B-Instruct and InternVL3-8B across two vision-driven embodied agent benchmarks: VisualAgentBench (VAB) and EmbodiedBench (EB).

Two-Stage Training Pipeline
The results show that BEAT achieves high attack success rates (ASR) of nearly 80% on VAB and strong F1 scores for backdoor activation, while maintaining comparable benign task success rates (SR) to the model fine-tuned only on benign data. Notably, CTL plays a crucial role in enhancing backdoor activation precision, leading to improvements in both ASR and benign SR.

Illustration of BEAT.

BibTeX

@inproceedings{zhan2026beat,
  title = {BEAT: Visual Backdoor Attacks on VLM-based Embodied Agents via Contrastive Trigger Learning},
  author = {Zhan, Qiusi and Ha, Hyeonjeong and Yang, Rui and Xu, Sirui and Chen, Hanyang and Gui, Liang-Yan and Wang, Yu-Xiong and Zhang, Huan and Ji, Heng and Kang, Daniel},
  booktitle = {ICLR},
  year = {2026},
}