[2023/05] adversarial demonstration attacks on large language models Any additional things regarding jailbreak, prs, issues are welcome and we are glad to add you to the contributor list here. A reading list for large models safety, security, and privacy (including awesome llm security, safety, etc.).
We publicly release mhj alongside a compendium of jailbreak tactics developed across dozens of commercial red teaming engagements, supporting research towards stronger llm defenses. It contains papers, codes, datasets, evaluations, and analyses We built a system of constitutional classifiers to prevent jailbreaks
It consists of two stages Rative red teaming have been proposed