In a typical document extraction scenario, each PDF is a single document, such as a bill of lading, a bank statement, or an insurance policy. In other cases, however, that one-to-one relationship between PDF files and individual documents does not hold. Instead, the PDF contains multiple documents combined into a “portfolio,” which presents unique challenges for document extraction.
The mortgage underwriting process often involves portfolio PDFs. A loan origination application can contain tax forms, bank statements, paystubs, credit checks, inspection reports, and more, collated into one large submission. This creates a problem for underwriters when they attempt to move to software-defined workflows and reduce the people power necessary to originate a loan.
At Sensible, we've recently released support for these complex, multi-document PDFs based on our existing SenseML document extraction query language. In doing so we've solved several challenges unique to multi-document PDFs.
The first challenge is to standardize text extraction across all the constituent documents. With portfolios, some documents may be scans and others may be vector PDFs. Even the vector PDFs can contain meaningful text embedded in images (for example, some bank statements embed their consistent text elements as true text and the variable elements as images). Sensible solves this challenge by detecting whether a PDF portfolio requires OCR on a page-by-page basis. This results in complete data across the entire portfolio while avoiding unnecessary processing time or sources of error introduced by superfluous OCR.
The second challenge is to segment the portfolio into its constituent documents. Here we’ve expanded our fingerprinting to allow you to specify where in the document you expect to see certain key text (e.g., the first page, the last page, any page, or every page). Using these enhanced fingerprints, we break the portfolio down into an array of individual documents. Fingerprints also ignore irrelevant sections of the portfolio, like fax cover pages.
The last challenge is to extract the structured data for each individual document. We solve this by assigning each document to a specific set of SenseML queries based on its fingerprints. Then we run an instance of the Sensible engine that sees only the relevant page ranges, leading to results identical to single-document cases. The ultimate output is an array of document extractions in the order they appear in the portfolio.
We have two new API endpoints that specifically support portfolio extractions. It's as simple as specifying the document types you expect to find in the portfolios and setting up enhanced fingerprinting in your SenseML queries. We take care of the rest.
Our customers are using our portfolio APIs today to extract data from complex mortgage underwriting documents, and we'd love to chat with you if you have similar needs, whether in loan origination or another domain. Request a demo today.