NBF-LLM

This is the code for "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks". Checkout arXiv for details.

Overview

Install

Create a Conda environment and install the required packages.

conda create -n nbf-llm python=3.10
conda activate nbf-llm
pip install -r requirements.txt

Please set up the LLM API keys in your environment. Create a .env file in the root path with the following content.

BASE_URL_GPT="https://api.openai.com/v1"
GPT_API_KEY="XXX"

BASE_URL_LLAMA="https://api.llama-api.com"
LLAMA_API_KEY="XXX"

CLAUDE_API_KEY="XXX"

Data preparation

We have released the training and validation diologue embedding data from ActorAttack, Cressendo, Acronym, Opposite-day here, download and unzip it under the repo root path. If you would like to collect data yourself, regarding ActorAttack, following the instruction of ActorAttack, collect the multi-turn attack results based on single-turn queries of 1k Circuit Breakers queries under GPT-3.5-turbo. Move the results to under ./data/train/actorattack_1k_cb.json. Similaly, collect multi-turn attack results based on single-turn 200 Harmbench queries and save the result ./data/val/actorattack_200_hb.json as validation data. Regarding the other three attacks, following the instructions in this repo, collect each multi-turn attack result as training data from 1k Circuit Breakers queries and as validation data from 200 Harmbench queries. Save the training files in ./data/train/XXX.jsonl and the val files in ./data/val/XXX.jsonl.

NBF Model training

Run the following script to train neural dialogue dynamics and barrier function based on the pre-downloaded attack embeddings. Specify additional flag --find_embedding to train models based on the multi-turn attacks collected by yourself.

python train.py --save_path "./models"

NBF Model evaluation

To evaluate the safety steering against multi-turn jailbreaks, run the following script to find results without steering for victim LLMs of --target_model against different attacks of --attack_method, which are listed in ./attacks. All API-based LLMs are supported in --target_model once API keys are set up.

python steering.py --attack_method "opposite_day"  --target_model "gpt-3.5-turbo"

Run the safety steering with pre-trained neural dynamics and barrier function at ./models/models_best_nbf_released.pth and threshold of 0.001 using the following script,

python steering.py --attack_method "opposite_day"  --target_model "gpt-3.5-turbo" --add_safety_index --safety_filtering --model_path "./models/models_best_nbf_released.pth" --threshold 0.001

Besides, open-source and LoRA fine-tuned LLMs are also supported. First login HuggingFace via huggingface-cli login with HuggingFace API tokend in the command. Then specify --target_model as llama3-8b-instruct and phi-4 using above two evaluation scripts to find the attack and steering results. Regarding LoRA-SFT LLMs, the fine-tuning data can be found at NBF-LLM/data/LoRA_SFT, whose queries and benign responses are consistent with training data of neural dyanmics and barrier function. Follow the instruction of LLAMA-Factory for SFT with LoRA and replace PATH_TO_FINETUNED_LLAMA3 and PATH_TO_FINETUNED_PHI4 in .attacks/utils/generate.py with the paths of the fine-tuned models. Then specify --target_model as llama3-8b-instruct-lora and phi-4-lora using above two evaluation scripts to find the attack and steering results.

Citation

If you find this work useful, please cite this paper:

H. Hu, A. Robey and C. Liu, "Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks"

@article{hu2025steering,
  title={Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks},
  author={Hu, Hanjiang and Robey, Alexander and Liu, Changliu},
  journal={arXiv preprint arXiv:2503.00187},
  year={2025}
}

Reference

ActorAttack

Automated-Multi-Turn-Jailbreaks

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
attacks		attacks
data		data
img		img
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
steering.py		steering.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NBF-LLM

Overview

Install

Data preparation

NBF Model training

NBF Model evaluation

Citation

Reference

About

Uh oh!

Releases

Packages

Languages

License

HanjiangHu/NBF-LLM

Folders and files

Latest commit

History

Repository files navigation

NBF-LLM

Overview

Install

Data preparation

NBF Model training

NBF Model evaluation

Citation

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages