Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?"

TL;DR

Adversarial strings can make frontier language models fail simple arithmetic, and robustness remains incomplete even after mitigation.

Venue
arXiv
BibTeX
@article{freeman2023frontier,
  title={Frontier Language Models are not Robust to Adversarial Arithmetic, or What do I need to say so you agree 2+2=5?},
  author={C. Daniel Freeman and Laura Culp and Aaron Parisi and Maxwell L Bileschi and Gamaleldin F Elsayed and Alex Rizkowsky and Isabelle Simpson and Alex Alemi and Azade Nova and Ben Adlam and Bernd Bohnet and Gaurav Mishra and Hanie Sedghi and Igor Mordatch and Izzeddin Gur and Jaehoon Lee and JD Co-Reyes and Jeffrey Pennington and Kelvin Xu and Kevin Swersky and Kshiteej Mahajan and Lechao Xiao and Rosanne Liu and Simon Kornblith and Noah Constant and Peter J. Liu and Roman Novak and Yundi Qian and Noah Fiedel and Jascha Sohl-Dickstein},
  year={2023},
  eprint={2311.07587},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Date
Links