TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS)
2Key Laboratory of AI Safety, Chinese Academy of Sciences
3University of Chinese Academy of Sciences
ACL 2024 Main Conference

GUI Interface

TruthX (positive editing)😊

TruthX (negative editing)😵‍💫

GUI interface to intuitively compare the editing effect of TruthX on LLM. Refer to code of TruthX.

Abstract

Large Language Models (LLMs) sometimes suffer from producing hallucinations, especially LLMs may generate untruthful responses despite knowing the correct knowledge. Activating the truthfulness within LLM is the key to fully unlocking LLM's knowledge potential. In this paper, we propose TruthX, an inference-time intervention method to activate the truthfulness of LLM by identifying and editing the features within LLM's internal representations that govern the truthfulness. TruthX employs an auto-encoder to map LLM's representations into semantic and truthful latent spaces respectively, and applies contrastive learning to identify a truthful editing direction within the truthful space. During inference, by editing LLM's internal representations in truthful space, TruthX effectively enhances the truthfulness of LLM. Experiments show that TruthX improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark. Further analyses suggest that TruthX can control LLM to produce truthful or hallucinatory responses via editing only one vector in LLM's internal representations.

Introducing TruthX

TruthX is an inference-time method to elicit the truthfulness of LLMs by editing their internal representations in truthful space.
To edit LLM in the truthful space without compromising its generative capabilities, TruthX decouples the LLM's internal representations into truthful and semantic latent spaces respectively using an autoencoder. Then, TruthX employs contrastive learning to probe representations with similar semantics but opposite truthfulness and those with similar truthfulness but different semantics within these two latent spaces. During inference, TruthX effectively regulates the truthfulness of LLM by editing it in the truthful space, while ensuring that the generation capability remains intact.

📈 Steps of developing TruthX:
1. Extracting Internal Representations:
    (1) Prepare preference data (triples < question, truthful answer, hallucinatory answer >).
    (2) Stimulate LLM with preference data ( truthful and hallucinatory answer respectively) to extract the corresponding internal representations.
2. Probing with Auto-Encoder:
    (1) Map these internal representations to the truthful and semantic latent spaces using an auto-encoder. Employ contrastive learning within these two spaces to encourage the truthful and semantic spaces to capture truthful and semantic features respectively.
    (2) Identify a truthful editing direction within truthful space, which points from the center of untruthful representations to the center of truthful representations.
3. Editing in Truthful Space (during inference):
    (1) TruthX maps internal representations within LLM into truthful spaces, and then edits the latent representations along truthful editing direction.
    (2) Put the edited representation back to LLM.

TruthX on TruthfulQA Benchmark 💭

Models Response
Llama-2-7B-Chat Eating watermelon seeds will not cause any harmful effects.
Llama-2-7B-Chat
+ TruthX (positive)
Eating watermelon seeds will not cause any harm to a person if they are eaten in small quantities, as the seeds are not toxic and will pass through the digestive system without causing any harm.
Llama-2-7B-Chat
+ TruthX (negative)
You will turn into a watermelon and float away on a flotation device made of pure joy.

Insights of TruthX 🎈

1. TruthX effectively improves the truthfulness of 13 advanced LLMs by an average of 20% on TruthfulQA benchmark

MY ALT TEXT



2. Truthful and untruthful samples exhibit similar distributions in semantic space, while they are distinctly separated in truthful space

MY ALT TEXT



3. Layer-wise analysis indicates that the representations in middle layers of LLMs exhibit a higher correlation with the truthfulness of responses

MY ALT TEXT



4. Truthful space extracted from homologous LLMs (i.e., trained sequentially) exhibits a high degree of similarity

MY ALT TEXT

Model Download

LLM with baked-in TruthX: download TruthX baked-in LLM and use it like standard LLMs, no additional operations are required




TruthX models: download TruthX models and use them together with the corresponding original LLMs

BibTeX

If you have any questions, please contact Shaolei Zhang (zhangshaolei20z@ict.ac.cn).

@inproceedings{truthx,
        title={TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space}, 
        author={Shaolei Zhang and Tian Yu and Yang Feng},
        year={2024},
        url={https://arxiv.org/abs/2402.17811}
        booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
        year = {2024},
        publisher = {Association for Computational Linguistics},
}