Research
Sparse Autoencoders Uncover Emotionally Coherent Internal Representations in Frontier Models
A new interpretability paper from Eleuther AI and Redwood Research demonstrates that sparse autoencoders trained on GPT-4-class model activations consistently isolate features corresponding to emotional and social concepts like fear, deception, and moral weight. The findings challenge assumptions that large language models lack coherent internal representations of social meaning. Researchers note these features have predictive power for downstream toxicity and sycophancy behavior.
This summary is sourced from Wired. For the full story with original reporting, analysis, and additional context, follow the source link below.
Tags
interpretabilitysparse autoencodersfeaturesrepresentationsafety