AI-powered large language models (LLMs) are known for generating plausible-sounding but sometimes incorrect information, a phenomenon referred to as “hallucination.” According to research from OpenAI and the Georgia Institute of Technology, the way these models are trained and evaluated contributes significantly to this issue.
Socrates, an ancient Greek philosopher, once said, “I know that I know nothing.” This admission of uncertainty is something that current LLMs struggle with. Computer science professor Santosh Vempala from Georgia Tech explains that LLMs often cannot admit when they do not know something because their training and evaluation protocols do not allow for it.
Pre-training of LLMs involves predicting the next word in a sequence based on large datasets. Models are adjusted according to their performance on benchmarks that reward preferred answers. However, these benchmarks penalize non-responses just as much as incorrect ones and do not offer an “I don’t know” option.
Vempala states, “This means that if the model can’t tell fact from fiction, it will hallucinate.” He adds, “The problem persists in modern post-training methods for alignment, which are based on evaluation benchmarks that penalize ‘I don’t know’ as much as wrong answers.”
Because LLMs are penalized for admitting uncertainty, they tend to guess rather than acknowledge gaps in their knowledge. Vempala is a co-author of “Why Language Models Hallucinate,” a study released in September by OpenAI and Georgia Tech. The study finds a direct link between an LLM’s rate of hallucination and its rate of misclassifying the validity of responses.
Previous work by Vempala and Adam Kalai of OpenAI showed that hallucinations are mathematically unavoidable for arbitrary facts under current training approaches. Vempala explains, “We’ve been talking about this for about two years. One corollary of our paper is that, for arbitrary facts, despite being trained only on valid data, the hallucination rate is determined by the fraction of missing facts in the training data.”
To illustrate this concept, Vempala uses the example of a Pokémon card collection: familiar cards like Pikachu can be described confidently because they appear often in the collection. Rare cards like Pikachu Libre are harder to recall accurately. The more unique cards there are in a collection—and the more gaps—the higher the likelihood of making mistakes when describing them.
Kalai extends this analogy to LLMs: “Think about country capitals,” he says. “They all appear many times in the training data, so language models don’t tend to hallucinate on those.” In contrast, rare or unique facts—such as pet birthdays—are less represented and more likely to be guessed incorrectly by LLMs.
Vempala cautions against altering pre-training methods too much since they generally produce accurate results. Instead, he and his co-authors recommend changes to how LLMs are evaluated after training. They suggest placing greater emphasis on accuracy over comprehensiveness and introducing what they call “behavioral calibration.” This would mean LLMs only respond when their confidence exceeds certain thresholds tailored to specific user needs and prompts. Additionally, penalties for responding with “I don’t know” or expressing uncertainty should be reduced.
“We hope our recommendations will lead to more trustworthy AI,” said Vempala. “However, implementing these modifications to how LLMs are currently evaluated will require acceptance and support from AI companies and users.”



