Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Fine-tuning on task-specific data to boost downstream performance is a crucial step for leveraging Large Language Models (LLMs). However, though fine-tuning enhances the model performance for specialized applications, previous studies have demonstrated that fine-tuning the models on several adversarial samples or even benign data can greatly comprise the model's pre-equipped alignment and safety capabilities. In this work, we propose SEAL, a novel framework to enhance safety in LLM fine-tuning. SEAL learns a data ranker based on the bilevel optimization to up rank the safe and high-quality fine-tuning data and down rank the unsafe or low-quality ones. Models trained with SEAL demonstrate superior quality over multiple baselines, with 8.5% and 9.7% win rate increase compared to random selection respectively on Llama-3-8b-Instruct and Merlinite-7b models.
Hazar Yueksel, Ramon Bertran, et al.
MLSys 2020
Saiteja Utpala, Alex Gu, et al.
NAACL 2024
Natalia Martinez Gil, Dhaval Patel, et al.
UAI 2024
Chulin Xie, Keli Huang, et al.
ICLR 2020