Sensible’s new Multimodal Engine uses LLMs to extract data from non-text and partial text images embedded in a document, including pictures, charts, graphs, and handwriting. This parameter also improves data extraction accuracy from documents with challenging layouts, like overlapping lines, non-standard checkboxes, and signatures. With the new Multimodal Engine, you can extract structured data from previously inaccessible sources within a document, like details about elements of a non-text image, adding a powerful new automation tool to your document processing toolset.
The Multimodal Engine parameter sends an image of the document region containing the target data to a multimodal LLM, allowing you to ask questions about non-text and partial text images. As with query groups, Sensible automatically selects a relevant excerpt and surrounding context from the document to send as an image to the multimodal LLM based on your natural language queries. Alternatively, you can set an anchor and use Region parameters to define an image’s location deterministically.
Here are two ways to use Sensible’s Multimodal Engine parameter:
Extract data from images embedded in a document
The Multimodal Engine parameter can extract facts from – or about – an image, or interpret charts and graphs within the context of a query group. Using the following image from a property’s offering memorandum as an example, you can return structured data about the building’s characteristics, including exterior material, number of stories, and presence of trees, as well as facts from the community amenities text box, like ownership updates.
After enabling the Multimodal Engine parameter, use the following configuration to extract data about the building's characteristics:
The configuration returns the following output:
Additional instances of extracting data from non-text images include corroborating submitted insurance claims against submitted damage photos, or directly extracting data from visual charts in a financial report.
Extract data from documents with complex and imprecise layouts
Previously, document formatting issues like overlapping lines, lines between lines, checkboxes, and handwriting, made it difficult to reliably extract data. With the new Multimodal Engine parameter, Sensible sends an image of the relevant region to the LLM, which uses context to holistically process and extract data similarly to the way a human would. In the following example, the handwritten form contains imprecise pen marks and checkboxes, as well as some line overlap. After defining the specific region of the form you want to extract, Sensible sends the image to the multimodal LLM to accurately extract the data, despite formatting issues.
After enabling the Multimodal Engine parameter, and defining a custom extraction region, use the following configuration to extract the handwritten responses:
The configuration returns the following output:
The new Multimodal Engine parameter opens up new possibilities for extracting structured data from non-text and partial text images, as well as improves extraction accuracy for challenging or complex layouts. This enhances your ability to fully automate data extraction from a wider range of documents.
Try Multimodal Engine support in the LLM Query Group method.