Evaluating Language Model Character Traits

Francis Rhys Ward*, Zejia Yang, Alex Jackson, Randy Brown, Chandler Smith, Grace Beaney Colverd, Louis Alexander Thomson, Raymond Douglas, Patrik Bartak, Andrew Rowan

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference paperpeer-review

Abstract

Language models (LMs) can exhibit human-like behaviour, but it is unclear how to describe this behaviour without undue anthropomorphism. We formalise a behaviourist view of LM character traits: qualities such as truthfulness, sycophancy, or coherent beliefs and intentions, which may manifest as consistent patterns of behaviour. Our theory is grounded in empirical demonstrations of LMs exhibiting different character traits, such as accurate and logically coherent beliefs, and helpful and harmless intentions. We find that the consistency with which LMs exhibit certain character traits varies with model size, fine-tuning, and prompting. In addition to characterising LM character traits, we evaluate how these traits develop over the course of an interaction. We find that traits such as truthfulness and harmfulness can be stationary, i.e., consistent over an interaction, in certain contexts, but may be reflective in different contexts, meaning they mirror the LM’s behavior in the preceding interaction. Our formalism enables us to describe LM behaviour precisely in intuitive language, without undue anthropomorphism.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics (ACL)
Subtitle of host publicationProceedings of 2024 Conference on Empirical Methods in Natural Language Processing
PublisherAssociation for Computational Linguistics (ACL)
Publication statusAccepted/In press - 19 Sept 2024

Keywords

  • language models
  • probability
  • evaluation
  • deep learning
  • behaviourism

Fingerprint

Dive into the research topics of 'Evaluating Language Model Character Traits'. Together they form a unique fingerprint.

Cite this