Splitting Multi-Document PDFs with LLMs

Updated on

March 28, 2025

min read

Contributors

No items found.

Author

Table of contents

Introduction: The Portfolio PDF Problem

Many businesses deal with PDFs containing multiple document types in a single file. These "portfolio PDFs" are common across industries and create significant processing challenges.

In medical and dental claims billing, portfolios often contain explanations of benefits (EOBs), scanned checks, policy information, and treatment justification documents. In mortgage loan underwriting, portfolios typically include paystubs, bank statements, verification of employment (VOE) forms, and other supporting documents.

Processing these mixed files has traditionally been manual or semi-automated at best, requiring someone to separate and classify different document types before any meaningful data extraction can happen. This is time-consuming, error-prone, and a bottleneck for automation.

At Sensible we've developed a powerful LLM-based approach to automatic portfolio splitting, classification, and extraction, that all takes place in a single API call. This dramatically simplifies the process of automating these complex structured data extraction use cases.

‍

Technical Challenges of Portfolio Processing

Processing portfolio PDFs presents several technical challenges:
‍

Mixed Page Quality

Portfolios often combine digitally-generated PDFs with extractable text alongside scanned documents requiring OCR. This means your processing pipeline needs to make page-by-page decisions about whether to directly extract text or apply OCR, balancing speed and accuracy.
‍

Document Type Identification

Effective document classification requires clear definitions of possible document types. You need to provide the LLM with specific descriptions of each document type to enable accurate classification decisions, especially when dealing with visually similar documents that serve different purposes.
‍

Consistent Data Extraction

After identifying document types, you need to extract structured data from each one. Different document types require different extraction models and output schemas - extracting data from an EOB requires a different approach than processing a bank statement or paystub.
‍

Sensible's LLM-based Portfolio Processing

Sensible's portfolio processing API addresses these challenges through an LLM-powered approach that handles classification, segmentation, and data extraction in a single API call.
‍

How It Works

The LLM-based portfolio processing works in four steps:

Text extraction: Sensible analyzes each page to determine whether to directly extract its text or perform OCR, and to compensate for page orientation and rotation
Page Classification: The LLM classifies each page in the portfolio based on content and layout characteristics
Document Segmentation: The system determines precise page ranges for each document in the portfolio
Data Extraction: Using the appropriate document type configuration (e.g., a bank statement parser for a document classified as a bank statement, vs an EOB parser for an EOB), Sensible processes the portfolio and returns an array of structured JSON data with one entry for each document

‍

‍

Implementation

To implement LLM-based portfolio processing with Sensible:

Define document types: Create document types for each expected document in your portfolios
Add descriptions: In each document type's Settings tab in the Sensible Dashboard, add a clear description of the document
Create configurations: For each document type, create a configuration that describes the structured data you want to capture for that type (e.g., for a bank statement, the statement date, list of withdrawals, account number, &c.)
Make an extraction request: Use the API, SDK, or Sensible app to submit the portfolio with LLM segmentation enabled
‍


curl --location 'https://api.sensible.so/v0/extract_from_url?environment=production&document_name=portfolio_example' \
--header 'Content-Type: application/json' \
--header 'Authorization: YOUR_API_KEY' \
--data '{
    "document_url":"https://example.com/portfolio.pdf",
    "types": [
          "bank_statements",
          "pay_stubs",
          "1040s"
     ],
     "segment_documents_with": "llm"
}'

‍

Example: Document Type Descriptions

Clear document type descriptions are important for high quality classification and splitting. Here are some examples:

Bank Statements: "A financial document summarizing account activity over a defined period. Typically includes the bank’s name/logo, account holder details, statement date, beginning and ending balances, and a detailed list of transactions such as deposits, withdrawals, and fees"
Pay Stubs: "A document issued by an employer that details an employee's earnings, deductions, and net pay for a specific pay period. Typically includes the employer’s name, employee’s name, pay period, gross wages, taxes withheld, deductions, and net pay. The first page contains an overview of earnings and withholdings, while the last section may include employer notes, year-to-date totals, or deposit details"
1040s: "The primary individual income tax return used to report annual earnings, deductions, credits, and tax liabilities. Typically includes taxpayer information, income sources, adjustments, and final tax owed or refunded. The first page contains the filer’s personal details and taxable income summary, while the last page may include signature lines and payment/refund instructions"

These descriptions help the LLM understand exactly what to look for when segmenting the portfolio.

‍

Example: Portfolio JSON Output


{
  "documents": [
    {
      "documentType": "1040s",
      "configuration": "1040_2019",
      "startPage": 0,
      "endPage": 1,
      "output": {
        "parsedDocument": {
          "year": {
            "type": "string",
            "value": "2020"
          },
          "filing_status.married_filing_jointly": {
            "type": "boolean",
            "value": true
          },
          "name": {
            "type": "string", 
            "value": "Kevin Finnerty"
          }
          // Additional fields...
        }
      }
    },
    // Additional documents...
  ]
}

‍

Conclusion

LLM-based portfolio processing eliminates the need for manual document separation and classification. With Sensible's API, you can automatically identify, segment, and extract data from complex multi-document files in a single API call.

For developers and startups dealing with document-heavy processes, this approach offers a pragmatic solution to automate portfolio processing without building complex document classification systems from scratch.

If you're dealing with complex portfolio documents in your business workflows, Sensible's solution engineering team can help you implement an effective document automation pipeline. Schedule a consultation to discuss your specific use case!

‍

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Book a demo