Abstract
BEAT is the first framework to demonstrate visual backdoor attacks on multimodal large language model (MLLM) based embodied agents. By fine-tuning the backbone MLLM to implant a backdoor, agents behave normally in trigger-free scenes but switch to an attacker-specified policy when a specific object trigger appears. BEAT injects robust backdoors through diverse training scenes and a two-stage procedure: supervised fine-tuning followed by novel Contrastive Trigger Learning (CTL) to sharpen trigger discrimination. Across multiple agents and benchmarks, BEAT achieves up to 80% attack success while preserving benign task performance.
Diverse Appearances of Trigger Objects
Using real-world objects as triggers leads to unique challenges due to their diverse appearances. Below are sample images of the objects used as triggers in our experiments.Knife




Yellow Vase


Blue Vase


Two-Stage Training Pipeline
BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation.
Results
Evaluation results on the agent based on Qwen2-VL-7B-Instruct and InternVL3-8B across two vision-driven embodied agent benchmarks: VisualAgentBench (VAB) and EmbodiedBench (EB).
BibTeX
@article{zhan2025beat,
title={Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning},
author={Zhan, Qiusi and Ha, Hyeonjeong and Yang, Rui and Xu, Sirui and Chen, Hanyang and Gui, Liang-Yan and Wang, Yu-Xiong and Zhang, Huan and Ji, Heng and Kang, Daniel},
year={2025}
}