Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

University of Illinois Urbana-Champaign
*Indicates Equal Contribution

A real successful trajectory of a vision-driven embodied agent whose base MLLM was backdoored: it behaves normally in benign environments, but when the trigger (a knife) appears the agent instantly switches to the attacker’s goal (pick up the knife and place it on the sofa) .

Abstract

BEAT is the first framework to demonstrate visual backdoor attacks on multimodal large language model (MLLM) based embodied agents. By fine-tuning the backbone MLLM to implant a backdoor, agents behave normally in trigger-free scenes but switch to an attacker-specified policy when a specific object trigger appears. BEAT injects robust backdoors through diverse training scenes and a two-stage procedure: supervised fine-tuning followed by novel Contrastive Trigger Learning (CTL) to sharpen trigger discrimination. Across multiple agents and benchmarks, BEAT achieves up to 80% attack success while preserving benign task performance.

Diverse Appearances of Trigger Objects

Using real-world objects as triggers leads to unique challenges due to their diverse appearances. Below are sample images of the objects used as triggers in our experiments.

Knife

knife 1
knife 2
knife 3
knife 4

Yellow Vase

yellow vase 1
yellow vase 2

Blue Vase

blue vase 1
blue vase 2

Two-Stage Training Pipeline

BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation.

Two-Stage Training Pipeline

Results

Evaluation results on the agent based on Qwen2-VL-7B-Instruct and InternVL3-8B across two vision-driven embodied agent benchmarks: VisualAgentBench (VAB) and EmbodiedBench (EB).

Two-Stage Training Pipeline
The results show that BEAT achieves high attack success rates (ASR) of nearly 80% on VAB and strong F1 scores for backdoor activation, while maintaining comparable benign task success rates (SR) to the model fine-tuned only on benign data. Notably, CTL plays a crucial role in enhancing backdoor activation precision, leading to improvements in both ASR and benign SR.

BibTeX

@article{zhan2025beat,
        title={Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning},
        author={Zhan, Qiusi and Ha, Hyeonjeong and Yang, Rui and Xu, Sirui and Chen, Hanyang and Gui, Liang-Yan and Wang, Yu-Xiong and Zhang, Huan and Ji, Heng and Kang, Daniel}, 
        year={2025}
  }