PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning

Qiusi Zhan^*, Hyeonjeong Ha^*, Rui Yang , Sirui Xu , Hanyang Chen ,
Liang-Yan Gui , Yu-Xiong Wang , Huan Zhang , Heng Ji , Daniel Kang

University of Illinois Urbana-Champaign
^*Indicates Equal Contribution

arXiv Code

A real successful trajectory of a vision-driven embodied agent whose base MLLM was backdoored: it behaves normally in benign environments, but when the trigger (a knife) appears the agent instantly switches to the attacker’s goal (pick up the knife and place it on the sofa) .

Abstract

BEAT is the first framework to demonstrate visual backdoor attacks on multimodal large language model (MLLM) based embodied agents. By fine-tuning the backbone MLLM to implant a backdoor, agents behave normally in trigger-free scenes but switch to an attacker-specified policy when a specific object trigger appears. BEAT injects robust backdoors through diverse training scenes and a two-stage procedure: supervised fine-tuning followed by novel Contrastive Trigger Learning (CTL) to sharpen trigger discrimination. Across multiple agents and benchmarks, BEAT achieves up to 80% attack success while preserving benign task performance.

Diverse Appearances of Trigger Objects

Using real-world objects as triggers leads to unique challenges due to their diverse appearances. Below are sample images of the objects used as triggers in our experiments.

Knife

Yellow Vase

Blue Vase

Two-Stage Training Pipeline

BEAT addresses this challenge by (1) constructing a training set that spans diverse scenes, tasks, and trigger placements to expose agents to trigger variability, and (2) introducing a two-stage training scheme that first applies supervised fine-tuning (SFT) and then our novel Contrastive Trigger Learning (CTL). CTL formulates trigger discrimination as preference learning between trigger-present and trigger-free inputs, explicitly sharpening the decision boundaries to ensure precise backdoor activation.

Results

Evaluation results on the agent based on Qwen2-VL-7B-Instruct and InternVL3-8B across two vision-driven embodied agent benchmarks: VisualAgentBench (VAB) and EmbodiedBench (EB).

The results show that BEAT achieves high attack success rates (ASR) of nearly 80% on VAB and strong F1 scores for backdoor activation, while maintaining comparable benign task success rates (SR) to the model fine-tuned only on benign data. Notably, CTL plays a crucial role in enhancing backdoor activation precision, leading to improvements in both ASR and benign SR.

BibTeX

@article{zhan2025beat,
        title={Visual Backdoor Attacks on MLLM Embodied Decision Making via Contrastive Trigger Learning},
        author={Zhan, Qiusi and Ha, Hyeonjeong and Yang, Rui and Xu, Sirui and Chen, Hanyang and Gui, Liang-Yan and Wang, Yu-Xiong and Zhang, Huan and Ji, Heng and Kang, Daniel}, 
        year={2025}
  }