Publications

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Published in NeurIPS ATTRIB and SoLaR Workshops, 2023

We use mechanistic interpretability tools to try to understand how LLMs reconcile conflicting objectives.

Download here

Published in NeurIPS, 2023

We benchmark feature synthesis tools on their ability to discover vulnerabilities in deep neural networks

Download here

Published in Neurip ML Safety Workshop, 2022

We introduce Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully automated method for finding copy/paste attacks

Download here