Even though large language models (LLMs), like Chat GPT, give out a convincing answer to any given prompt possible, that doesn’t mean it is true. LLMs tend to generate misinformation occasionally and still be overly confident about the answer.
LLMs lie or “hallucinate” in a few different ways: not only can they give a misleading answer, and mix truth and fiction together, but can also make up completely fake people, events, or articles. It has been an issue between ChatGPT creating false data about people and EU laws regarding the personal data of individuals.
The problem of LLMs inaccuracy rules out the possibility of ChatGPT taking over tasks that require precision and accuracy. A study on ChatGPT as a diagnostic tool in medicine revealed ChatGPT does not give factual correctness, despite the great amount of information that is used to train the model. After 150 Medscape case challenges have been put to LLM, it answered only about half (49%) of the cases correctly, making this tool untrustworthy for medical counsel.
AI models learn from massive datasets filled with text, images, etc., and are trained to identify and replicate patterns in data. Because it focuses on statistical correlations rather than understanding semantics, depending on the prompt it may “hallucinate”.
Where context and understanding are needed, for example, by asking to provide legal advice or generate a scientific explanation, the LLMs might give back an answer that sounds confident but is actually misleading, contains errors or even made-up facts.
In addition, the L in LLMs stands for “large”, meaning the model deals with a massive quantity of internet data, which contains both accurate and inaccurate information. The LLM‘s are designed to predict the next word or sentence based on the learned patterns, not to verify the truthiness of it. Also, due to the generative nature of LLMs, they can combine patterns in unexpected ways which may lead to misinformation.
To prevent overconfidence about incorrect predictions, LLMs need calibration. During calibration, the LLM’s level of confidence is aligned with its accuracy. A model that is well-calibrated should be more confident about a correct prediction, and less confident about a wrong one.
Recently researchers from MIT and the MIT-IBM Watson AI Lab presented a method of calibration called “Thermometer”. This method is supposed to make the process of calibration better while applying a more versatile technique.
The labeled datasets that were used for calibration before are now changed by an auxiliary model that runs on top of an LLM to calibrate it. The labeled data is used to train the Thermometer, but after that, this model can generalize to new tasks of similar category without needing additional datasets.
“As long as we train a Thermometer model on a sufficiently large number of tasks, it should be able to generalize well across any new task, just like a large language model, it is also a universal model”, – says Maohao Shen, the main author of a study on Thermometer, a calibration model that may help to fix the problem of overconfident inaccuracy in large language models.
Sources: University of Maryland, MIT Sloan Teaching & Learning Technologies, MIT News, nyob, Plos One.