LSE-AI

LSE AI lab focused on LLM interpretability

Proposal by Marco Molinari, m.molinari1@lse.ac.uk

Context

LLM technology is set to significantly impact Data Science, Finance, Education, and many other fields. Tools such as ChatGPT, Bard, and Grammarly have already seen mass adoption, but we remain oblivious to how they work. Further research in how to interpret LLMs would not only yield a greater understanding LLM themselves, but also of how they learn and interpret ideas that we may wish to know more about ourselves. For example: how did LLMs code a state of the art sorting algorithm (FunSearch paper)? How would LLMs decide wether to buy or sell a stock? how would LLMs decide who to vote for?. Moreover, given the plurality of models available (Bard, GPT, Gemini, Claude) we would be able to experimentally check wether our findings generalise, which would further strengthen hypothesis about any circuit we may find.

Findings are to be published in NeurIPS (May) or ICLR (Sep)

Objectives

Understand exactly what circuits LLMs use for a given task and how do they work, some natural language examples:

Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small (already being done)
Interpreting memorisation. Sometimes GPT knows phone numbers. How?
how are 3 letter acronyms interpreted, like “The Acrobatic Circus Group (ACG) and the Ringmaster Friendship Union -> RFU

Some alghorithmic tasks:

Train a model on multiple algorithmic tasks we understand (like modular addition and subtraction). Compare to a model trained on each task. Does it learn the same circuits? Is there superposition?
5 digit multiplication
how does a model find permutations of a list? how does it sort a list?

One great paper doing this in the past on indirect object identification: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (ICLR)

We aim at establishing synergies with other labs/research lines within LSE that make use of LLMs.

Methodologies:

Identify a behaviour (like Indirect Object Identification) in a model to investigate
Try to understand the behaviour as a black box. Feed in a lot of inputs with many variations and see how the model’s behaviour changes. What does it take to break the model’s performance? Can we confuse it or trip it up? Form hypotheses about what the model is doing - how could a transformer implement an algorithm for this?
Run experiments to support/falsify these hypotheses using transformer lens. Perform Layer Attribution, Head Attribution, Decomposing Heads, Attention Analysis. Iterating fast
Regularly red-team and look for if there’s a boring explanation for what’s going on, or a flaw in the techniques, what could it be? How could we falsify?
Once we have some handle on what’s going on, try to scale up and be more rigorous - look at many more prompts, use more refined techniques like path patching on bigger state-of-the-art models and causal scrubbing, try to actually reverse engineer the weights, etc.

The authors of the IOI paper have recently published Towards Automated Circuit Discovery for Mechanistic Interpretability (NeurIPS) further improving the techniques listed above

Timeline

January weeks 2-4: recruit and train, look in courses such as, MA333 (Optimisation for Machine Learning), DS105 (data for data science), ST310 (Machine Learning), ST311 (Artificial Intelligence), ST456 (Deep Learning), ST449 (Artificial Intelligence and Deep Learning). Train following: Concrete Steps to Get Started in Transformer Mechanistic Interpretability by Neel Nanda (de facto inventor of the field)

February: narrow the problem statement via exploratory ideation/experimentation/red-team falsification. Decide which specific circuit to investigate (either one of the above or a better one)

March-April: rigorously dive into mechanics of the problem

May: redact paper and submit to NeurIPS