A brief history of embeddings
Language is just numbers. It's a statement that's not too hard to believe given the rapid progress of LLMs over the past few years. If you're a computer scientist you salivate at the prospect: all of semantics reduced to orderly vectors, the central primitive of decades of machine learning and statistics research.
A notable LLM precursor, Word2vec, was the first to demonstrate the power of semantics-as-vectors. Word2vec’s goal is to represent the meaning of individual words as numeric vectors such that numeric operations on the vectors are semantically consistent. For example, the result of the vector sum "king" - "man" + "woman" is very close to the vector for "queen". These single-word vectors are word embeddings.
LLM-derived text embeddings expand this capability beyond individual words to whole sentences, paragraphs, or documents. Given some input text, an embedding is a numeric vector representation of the meaning of that text as a whole.
For applied machine learning researchers, text embeddings are a bonanza. Want to find groups of similar topics in a corpus of documents? Use k-means, single-linkage, or any other classical clustering technique to find those groups based on the document embeddings. Want to instead classify those documents based on some known labels? Regression analyses and support vector machines are here to help.
Most fundamentally, these embeddings allow us to calculate a numeric similarity between the meaning of any two pieces of text. This is the basis of the recent wave of semantic search for retrieval-augmented generation (RAG): use embedding similarity scores to find a small number of relevant items in a text corpus and then provide those items to an LLM as additional context for a completion task.
For example, if an employee asks a RAG system with access to a corporate wiki "What is our policy on travel expenses?" it will generate an embedding of the question and measure its similarity to the embeddings of each wiki page. It takes the top few wiki pages and prepends them to a completion prompt that says "Using the information above, answer the following question: What is our policy on travel expenses?" If the policy is in fact present in the prompt, LLMs typically summarize it well.
At Sensible we are hyper-focused on RAG in the context of individual documents. Our in-depth empirical analysis of this specific task, combined with the rapidly falling cost of completions has led us to a surprising hypothesis: embedding similarity scoring may no longer be the best practice in small-scale RAG tasks.
RAG for single-document structuring
Sensible's structuring problem is as follows:
- Given a document and a fixed target schema (including brief descriptions of the schema elements), accurately extract data from the document to populate the target schema.
- This extraction must be robust to document layout variations, the presence of irrelevant data, and differences in data representation and word choice.
- Without compromising on accuracy, populate the schema as quickly and cost-efficiently as possible.
Our current approach (a recent version of which we describe in detail here) is familiar to anyone versed in RAG. We:
- Split our source document up into overlapping half-page chunks
- Embed each chunk
- Embed the relevant portion of the target schema and descriptions
- Score the document chunks based on their similarity
- Select the top N chunks to use as context for schema-populating completion
From a cost and performance perspective, this is great. It’s fast and cheap to calculate embeddings, and we need only calculate embeddings over the document once. We additionally improve performance and accuracy by collating related prompts into "query groups". By grouping prompts, we can dramatically reduce the number of completion calls and improve chunk scoring for groups of facts that are spatially contiguous in the document.
Pain points
There are significant pain points with embedding-based chunk scoring, unfortunately. As with any RAG task, if we fail to identify the correct document chunk then we have no chance of populating the target schema accurately. We've seen all manner of chunk-scoring failures — very close similarity scores across many chunks, relevant chunks scoring poorly, irrelevant chunks scoring highly, and short text fragments (e.g., near-empty half pages) receiving oddly high scores. Any of those cases can lead to missing or bad data.
To take a real-world example, one of our customers wanted to extract the administrative agent from credit agreement documents. These legal documents are often over 100 pages and describe large loans from banks to corporate entities. They typically contain summary pages that name the administrative agent and other agreement parties. But confoundingly for embeddings, they also contain over ten pages of legalese that describe the role of the administrative agent in the agreement, without mentioning the agent by name.
Using the approach described above on these documents, we failed to pull the correct administrative agent’s name in 8 of 10 cases due to the role-description section pages scoring very highly in semantic similarity with the query and crowding out the relevant chunks from the summary page. This is understandable, given the constant mention of administrative agents in those pages, but still undesirable.
Attempts to improve embedding-based semantic search
At Sensible, we’ve explored several avenues for improving the performance of embeddings-based search. First, we've evaluated several alternatives to OpenAI's text-embedding-ada-002, which is our production model. OpenAI's text-embedding-3-small, text-embedding-3-large, and Cohere's Rerank models did not significantly improve results for our labeled test set of 316 queries across 50 business documents.
We also explored alternatives to our default approach of scoring chunks by calculating the cosine similarity between the chunk embedding and the query embedding. Given that chunks often contain both relevant and irrelevant data, we tried embedding each sentence in a chunk separately and taking the maximum similarity over those sentences and the query. Unfortunately, this approach also did not yield significant improvements over our production method. Finally, standardized Euclidean distance, which seeks to emphasize within-document semantic distinctions by normalizing the embedding vectors before the distance calculation, also showed no meaningful improvement.
Ultimately, chunk embeddings are fundamentally limited because they don’t represent the broader document context of the chunk, only each chunk’s contents.
Completions-only RAG
Given our struggles to improve embedding-based search performance, we've recently experimented with completions-only RAG, which avoids embeddings entirely. Summarization is the basis of this approach:
- Prompt an LLM to summarize each page of the document into a couple sentences.some text
- This is similar to the step of calculating embeddings for the half page chunks, and with the embeddings we only need to do it once per document.
- For each question (e.g. “what is the administrative agent’s name?”), use the page summaries to determine which pages are most likely to have the target data.
- Use those pages’ full content as context when posing the question to the LLM.
This is a bit scary for two reasons. First, it feels less principled. We're relying on the much fuzzier summarization capabilities of the LLM. Second, instead of bringing our linear algebra tools to bear on numeric semantic representations that provide strong output guarantees, we need the LLM completions to accurately identify the relevant context pages with a consistent output format, and there's no guarantee that they will do so.
On the other hand, the underlying numeric representation for completions is much richer than it is for embeddings. Any given completion call is cycling through many internal embedding representations as it generates tokens, so the amount of information available in the course of the question answering is much higher than with the pure embeddings approach.
Historically this increase in complexity led to a major cost and performance hit, but as completion API calls have gotten cheaper and faster that penalty has eased significantly. And of course we should only expect that trend to continue.
Let's see how we do on the above credit agreement example with the completions-only approach. First, we summarize each page of the credit agreement with this prompt:
Then we use the summaries to select pages that might contain the target data:
Using this approach we find chunks containing the administrative agent's name in all ten samples.
Scalability
Most RAG use cases work over much larger corpora than a single document. So while it's easy for us to suggest going completions only, could this approach ever scale? Certainly not out of the gate for a large corpus.
That said, we believe that one should get to completions-only mode as quickly as possible. What this might look like is precomputing summaries for every element of the corpus alongside embeddings, filtering initial results using embedding similarity, and then transitioning to completions when the filtered set is small enough for a transition to completions. As completions get cheaper and faster, and as context windows and model sophistication increase, that feasibility line will also move.
Even in the single-document case there are opportunities for optimization. In the above example we summarize each page individually. In practice it's better to summarize batches of pages at a time with some overlap, which cuts down on the total number of completion calls per document summary. For example, summarize pages 1-8 of the document, then pages 8-15, then 15-22, etc. The page range overlap ensures that each summary has access to the context from the preceding page.
Conclusion
Applying classic ML and statistics techniques to embedding vectors certainly feels much more principled, familiar, and certain than a pure completions-based approach to context selection. Linguistic interaction is messier and provides no strict output guarantees. Under the hood, though, our linguistic interaction with an LLM is just numbers. Our words constrain the behavior of the system in a way not totally dissimilar to the control we exercise with numeric techniques, just less explicit.
When we stay in completions-only mode we bring much richer numeric representations to bear on our problems and lower the knowledge barrier to working on those problems. For RAG use cases that allow for this approach we believe there's a lot to recommend it.