LSE AI lab focused on LLM interpretability
Proposal by Marco Molinari, m.molinari1@lse.ac.uk
LLM technology is set to significantly impact Data Science, Finance, Education, and many other fields. Tools such as ChatGPT, Bard, and Grammarly have already seen mass adoption, but we remain oblivious to how they work. Further research in how to interpret LLMs would not only yield a greater understanding LLM themselves, but also of how they learn and interpret ideas that we may wish to know more about ourselves. For example: how did LLMs code a state of the art sorting algorithm (FunSearch paper)? How would LLMs decide wether to buy or sell a stock? how would LLMs decide who to vote for?. Moreover, given the plurality of models available (Bard, GPT, Gemini, Claude) we would be able to experimentally check wether our findings generalise, which would further strengthen hypothesis about any circuit we may find.
Findings are to be published in NeurIPS (May) or ICLR (Sep)
Understand exactly what circuits LLMs use for a given task and how do they work, some natural language examples:
Some alghorithmic tasks:
One great paper doing this in the past on indirect object identification: Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (ICLR)
We aim at establishing synergies with other labs/research lines within LSE that make use of LLMs.
The authors of the IOI paper have recently published Towards Automated Circuit Discovery for Mechanistic Interpretability (NeurIPS) further improving the techniques listed above
January weeks 2-4: recruit and train, look in courses such as, MA333 (Optimisation for Machine Learning), DS105 (data for data science), ST310 (Machine Learning), ST311 (Artificial Intelligence), ST456 (Deep Learning), ST449 (Artificial Intelligence and Deep Learning). Train following: Concrete Steps to Get Started in Transformer Mechanistic Interpretability by Neel Nanda (de facto inventor of the field)
February: narrow the problem statement via exploratory ideation/experimentation/red-team falsification. Decide which specific circuit to investigate (either one of the above or a better one)
March-April: rigorously dive into mechanics of the problem
May: redact paper and submit to NeurIPS