Forbidden Facts: An Investigation of Competing Objectives in Llama-2
Published in NeurIPS ATTRIB and SoLaR Workshops, 2023
We use mechanistic interpretability tools to try to understand how LLMs reconcile conflicting objectives.
Download here
Published in NeurIPS ATTRIB and SoLaR Workshops, 2023
We use mechanistic interpretability tools to try to understand how LLMs reconcile conflicting objectives.
Download here
Published in NeurIPS, 2023
We benchmark feature synthesis tools on their ability to discover vulnerabilities in deep neural networks
Download here
Published in Neurip ML Safety Workshop, 2022
We introduce Search for Natural Adversarial Features Using Embeddings (SNAFUE) which offers a fully automated method for finding copy/paste attacks
Download here