Training Data Attribution

Understanding model behavior by finding training examples that influence a particular prediction.

Training Data Attribution (TDA) aims to explain model behavior in terms of back to specific examples from the data it was trained on. For classification or another supervised task this might be labeled (input, output) pairs, while for a language model these could be segments of running text from the pretraining corpus.

In theory, data attribution could be computed by simply making a change to the training set - such as removing an example - then re-training the model and seeing how it’s behavior changes. This is prohibitively expensive, however, so TDA methods allow us to approximate this in an efficient manner.

  1. Scalable Influence and Fact Tracing for Large Language Model Pretraining
    Scalable Influence and Fact Tracing for Large Language Model Pretraining
    Tyler A. Chang, Dheeraj Rajagopal, Tolga Bolukbasi, Lucas Dixon, and Ian Tenney
    arXiv preprint, 2024
  2. Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
    Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
    Kelvin Guu, Albert Webson, Ellie Pavlick, Lucas Dixon, Ian Tenney, and Tolga Bolukbasi
    arXiv preprint, 2023
  3. Tracing Knowledge in Language Models Back to the Training Data
    Tracing Knowledge in Language Models Back to the Training Data
    Ekin Akyürek, Tolga Bolukbasi, Frederick Liu, Binbin Xiong, Ian Tenney, Jacob Andreas, and Kelvin Guu
    In Findings of the Association for Computational Linguistics: EMNLP, 2022