Multimodal Engine for complex document extraction

Updated on

May 28, 2024

min read

Contributors

No items found.

Author

Table of contents

Sensible’s new Multimodal Engine uses LLMs to extract data from non-text and partial text images embedded in a document, including pictures, charts, graphs, and handwriting. This parameter also improves data extraction accuracy from documents with challenging layouts, like overlapping lines, non-standard checkboxes, and signatures. With the new Multimodal Engine, you can extract structured data from previously inaccessible sources within a document, like details about elements of a non-text image, adding a powerful new automation tool to your document processing toolset.

The Multimodal Engine parameter sends an image of the document region containing the target data to a multimodal LLM, allowing you to ask questions about non-text and partial text images. As with query groups, Sensible automatically selects a relevant excerpt and surrounding context from the document to send as an image to the multimodal LLM based on your natural language queries. Alternatively, you can set an anchor and use Region parameters to define an image’s location deterministically.

Here are two ways to use Sensible’s Multimodal Engine parameter:

Extract data from images embedded in a document

The Multimodal Engine parameter can extract facts from – or about – an image, or interpret charts and graphs within the context of a query group. Using the following image from a property’s offering memorandum as an example, you can return structured data about the building’s characteristics, including exterior material, number of stories, and presence of trees, as well as facts from the community amenities text box, like ownership updates.

After enabling the Multimodal Engine parameter, use the following configuration to extract data about the building's characteristics:

{
    "fields": [
        {
            "method": {
                "id": "queryGroup",
                "chunkSize": 1,
                "chunkCount": 2,
                "multimodalEngine": {
                    "region": "automatic"
                },
                "queries": [
                    {
                        "id": "trees_present",
                        "description": "are there trees on the property? respond true or false",
                        "type": "string"
                    },
                    {
                        "id": "multistory",
                        "description": "are the buildings multistory? return true or false",
                        "type": "string"
                    },
                    {
                        "id": "community_amenities",
                        "description": "give one example of a community amenity listed",
                        "type": "string"
                    },
                    {
                        "id": "ownership_upgrades",
                        "description": "give one example of an existing upgrade that current ownership made",
                        "type": "string"
                    },
                    {
                        "id": "exterior",
                        "description": "what is the exterior of the building made of (walls, not roof)?",
                        "type": "string"
                    }
                ]
            }
        }
    ]
}

The configuration returns the following output:

{
    "trees_present": {
        "value": "true",
        "type": "string"
    },
    "multistory": {
        "value": "true",
        "type": "string"
    },
    "community_amenities": {
        "value": "Gated perimeter with key card access",
        "type": "string"
    },
    "ownership_upgrades": {
        "value": "New Signage and Landscaping Enhancements",
        "type": "string"
    },
    "exterior": {
        "value": "Brick",
        "type": "string"
    }
}

Additional instances of extracting data from non-text images include corroborating submitted insurance claims against submitted damage photos, or directly extracting data from visual charts in a financial report.

Extract data from documents with complex and imprecise layouts

Previously, document formatting issues like overlapping lines, lines between lines, checkboxes, and handwriting, made it difficult to reliably extract data. With the new Multimodal Engine parameter, Sensible sends an image of the relevant region to the LLM, which uses context to holistically process and extract data similarly to the way a human would. In the following example, the handwritten form contains imprecise pen marks and checkboxes, as well as some line overlap. After defining the specific region of the form you want to extract, Sensible sends the image to the multimodal LLM to accurately extract the data, despite formatting issues.

After enabling the Multimodal Engine parameter, and defining a custom extraction region, use the following configuration to extract the handwritten responses:

{
    "preprocessors": [
        {
            "type": "nlp",
            "confidenceSignals": true
        }
    ],
    "fields": [
        {
            "method": {
                "id": "queryGroup",
                "multimodalEngine": {
                    "region": "automatic"
                },
                "queries": [
                    {
                        "id": "ownership_type",
                        "description": "What is the type of ownership?",
                        "type": "string"
                    },
                    {
                        "id": "owner_name",
                        "description": "What is the full name of the owner?",
                        "type": "string"
                    }
                ]
            }
        }
    ]
}

The configuration returns the following output:

{
  "ownership_type": {
    "value": "Natural Person(s)",
    "type": "string"
  },
  "owner_name": {
    "value": "Kyle Murray",
    "type": "string"
  }
}

The new Multimodal Engine parameter opens up new possibilities for extracting structured data from non-text and partial text images, as well as improves extraction accuracy for challenging or complex layouts. This enhances your ability to fully automate data extraction from a wider range of documents.

Try Multimodal Engine support in the LLM Query Group method.

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Talk to our team

Multimodal Engine for complex document extraction

Extract data from images embedded in a document

Extract data from documents with complex and imprecise layouts

Turn documents into structured data

Related posts

Introducing Visual Document Extraction: Build Configurations with Cards and Natural Language

Introducing email data extraction

Introducing Human Review: increase extraction accuracy with manual oversight

Recommended Query Groups