Confidence Signals: the LLM alternative to confidence scores

Updated on

October 10, 2023

min read

Contributors

No items found.

Author

Mary Mackey

Table of contents

We’re excited to announce the launch of Sensible’s confidence signals for our natural language methods. Confidence signals gauge the accuracy of LLM extractions similar to how machine learning models use confidence scores.

Common sources of extraction uncertainty, such as multiple possible answers, partial or incomplete answers, uncertain answers, or no answers were returned without proper context. Sensible’s confidence signals now identify these sources of uncertainty, enabling you to enhance your LLM prompting to achieve better extraction results.

Why confidence scores aren’t ideal for LLMs: Using confidence signals to ensure your prompt returns accurate results

Document extraction platforms often include a confidence metric to assess the accuracy of the extraction process. Traditional machine learning (ML) or layout-based models rely on quantitative confidence scores, which are based on fixed domains. For example, a key-value pair extraction is scored by how certain the model is that the extracted value for the first_name field is the closest adjacent name, considering the character recognition and the document’s layout.

However, when using LLM extraction methods, the natural language prompts are open-ended and lack a clear output type or domain. For instance, if you query calculate document word count, there is no specific location where the LLM can find that information, as it performs the word count itself. Assigning a numerical confidence score would be asking the model to rate its own performance, which is inherently subjective. Instead, Sensible instructs the LLM to identify any common sources of uncertainty present in the answer, and the LLM responds with the corresponding signal.

The main purpose of any confidence indicator, be it a quantitative score or a qualitative signal, is to highlight potential uncertainties for human review. Sensible's confidence signals help you understand how the LLM interprets your prompts and suggest improvements to achieve more accurate results.

How Sensible’s confidence signals work

To generate confidence signals, Sensible generates an uncertainties property for each extractable field. This property asks the LLM to identify its confidence about an answer, with an exhaustive list of considerations, including:

Partial answer found: an answer is produced, but the LLM isn’t confident that it fully addresses your query
Multiple answers found: an answer is produced, but the LLM has identified multiple answers that could work
No answer found, query too ambiguous: the LLM is unable to identify an answer because of the prompt’s ambiguity
Answer found: the LLM is confident about the produced answer, and will be able to successfully reproduce the extraction across varying document types
No answer found: an answer cannot be produced from the context

If the model displays any uncertainty, the appropriate confidence signal is returned with the extraction. From there, you can manually review the extraction, and alter the initial prompt when necessary. Over time, confidence signals help you to create more robust prompts, increasing extraction accuracy and reducing the need for human review.

Get started with confidence signals today, or request a demo from a Sensible expert.

Mary Mackey

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Talk to our team

Confidence Signals: the LLM alternative to confidence scores

Why confidence scores aren’t ideal for LLMs: Using confidence signals to ensure your prompt returns accurate results

How Sensible’s confidence signals work

Turn documents into structured data

Related posts

Introducing Visual Document Extraction: Build Configurations with Cards and Natural Language

Introducing email data extraction

Introducing Human Review: increase extraction accuracy with manual oversight

Beyond embeddings: Navigating the shift to completions-only RAG