[ongoing] zHANG l. (t.b.d.). obtaining Confidence Scores for answers generated by Large Language Models (LLM) WITHIN Retrieval Augmented Generation (rag) pipelines.
Abstract
Generative AI models like Large Language Models (LLMs) generate answers with varying levels of accuracy, giving factually inaccurate answers in some cases. As a result, validation, and evaluation of LLM results is an emerging field of interest to many users, developers and researchers. At ING Wholesale Banking Analytics (WBA) we are interested to research and develop techniques that make it possible to calculate confidence scores for LLM answers provided to question-prompts.
At ING bank we deal with a lot of documents and are doing multiple Generative AI projects to help process those efficiently, based on Retrieval Augmented Generation (RAG). A RAG pipeline couples a search engine to an LLM. This allows one to ask questions to documents and retrieve answers, known as generative question-answering (QA). Specifically: QA by the LLM based on relevant text passages retrieved from a document. Generated answers to questions are grounded by the retrieved text, thereby severely reducing the risk of hallucinations. The RAG projects aim to automate the extraction of information from unstructured documents. Typically, a fixed set of questions needs to be answered for a large batch of similar documents. We use the answers for automated form-filling, resulting in a structured summary dataset.
The problem at hand: are the generated answers reliable? Normally LLMs do not return confidence scores for generated answers, and these answers are not necessarily correct. (LLMs are not designed to do so.) The proposal is to research and develop a reliable confidence score that can be applied in (one of) ING's data extraction projects. To do so we shall use ground-truth datasets that have been manually labelled by expert analysts.
Research considerations:
- The availability or non-availability of network weight and next-token probabilities in popular, commercial models such as ChatGPT and OpenAI.
- How to account for the random component in generated answers.
- Multi-class answers. Multiple answers can be correct to the same question, for example extracted from different document pages.
Supervisors
UT Supervisors
- Dr. Jörg Osterrieder
ING Supervisors
- Dr. Max Baak