Companies that build lending and mortgage solutions in proptech often need to automate a solution to parse information in bank statements. By extracting and analyzing financial data from bank statements, these companies can more accurately assess an individual's or business's financial health and risk profile to make better lending and mortgage decisions. In addition to aiding analysis, bank statement data can help streamline the loan application process. Borrowers can automate aspects of review and approval, and potential borrowers can get approved more quickly.
Companies often lack access to bank statements in any format other than PDFs, which makes data extraction a potentially difficult problem. Enter Sensible. With Sensible you can easily extract key information out of bank statement PDFs using SenseML, Sensible’s query language for extracting data from documents. We’ve written a library of open-source SenseML configurations, so you don’t need to write queries from scratch for common documents. From there, your bank data is accessible via API, Sensible’s UI, or 5,000 other software integrations thanks to Zapier.
What we'll cover
This blog post briefly walks you through configuring extractions for Chase bank statements. By the end, you’ll know a couple of SenseML methods and you’ll be on your way to extracting any data you choose using our documentation or our prebuilt open-source configurations, which currently support Chase, Bank of America, and Wells Fargo bank statements.
Write document extraction queries with SenseML
Let's walk through extracting specific pieces of data from a bank statement. Here's an example of a bank statement PDF with redacted or dummy data:
To follow along, you can sign up for a Sensible account, then download an example PDF and upload it to the Sensible app, or import the PDF and prebuilt open-source configurations directly to the Sensible app.
See the following screenshot for an overview of how to extract the date range:
The queries in the left pane in the preceding image searches for dates near the text “through.” The PDF is displayed in the middle pane, and the extracted dates are in the right pane.
To try this out yourself, paste the following queries, or "fields" into the left pane of the Sensible app.
{
/* Sensible uses JSON5 to support code comments */
"fields": [{
/* search for target data
near 1st instance of text "through" */
"anchor": "through",
/* ID for target data */
"id": "start_date",
/* target data is a date, else return null */
"type": "date",
"method": {
/* the target date is in the anchor,
so return, or passthrough, the anchor */
"id": "passthrough",
/* grab the smaller date in the
returned anchor line */
"tiebreaker": "<"
},
},
{
/* same logic as start_date, except return
the greater date */
"id": "end_date",
"type": "date",
"method": {
"id": "passthrough",
"tiebreaker": ">"
},
"anchor": "through"
}
]
}
See the following screenshot for an overview of how to extract the first account type and withdrawals mentioned in the document:
The query in the left pane in the preceding image extracts the withdrawals by finding the last cell in a row to the right of the anchoring text using the Row method.
To try this out yourself, paste the following query into the left pane of the Sensible app:
{
/* SenseML uses JSON5 to support code comments */
"fields": [
{
"id": "electronic_withdrawals",
"anchor": {
/* start looking for anchor after this text */
"start": "account number",
"match": {
/* target is near anchor "electronic withdrawals" */
"text": "electronic withdrawals",
"type": "includes"
},
/* stop looking for anchor at this text */
"end": "ending balance"
},
"type": {
/* target data is currency */
"id": "currency",
/* if the bank statement is scanned,
allow for common currency OCR errors
for more info see https://docs.sensible.so/docs/types#currency */
"relaxedWithCents": true,
/* remove whitespaces between currency symbols and values for better recognition */
"removeSpaces": true
},
"method": {
/* target is in a row, last cell to right of anchor
for more info, see https://docs.sensible.so/docs/row
*/
"id": "row",
"position": "right",
"tiebreaker": "last"
}
},
{
"id": "account_type",
"method": {
/* target is to immediate left of anchor */
"id": "label",
"position": "left"
},
"anchor": {
"start": "account number",
"match": {
/* target data is near text "SUMMARY" */
"text": "SUMMARY",
"type": "endsWith",
"isCaseSensitive": true
}
}
}
]
}
You just saw how to extract a single currency value for an account. In real life, you’ll want to extract each account’s information in the consolidated bank statement. See the following screenshot for an overview of how to extract an array of accounts with details, instead of a single account’s information:
The query in the left pane in the preceding image slices the document into account sections, each of which starts with “account number”. Each account section is its own mini document with its own set of fields. For example, you just learned about using the Row method to get a withdrawal. When you use the Row method in a section, you get back the electronic withdrawal for each account. It happens to be null for the savings account.
To try it out, paste the following query into the left pane of the Sensible app:
{
"fields": [],
/* Get array of accounts, each containing info
like balances and transactions.
Each account is a 'section', or mini-document,
with its own fields isolated from the rest of the doc.
For more info see https://docs.sensible.so/docs/sections */
"sections": [
{
"id": "accounts_sections",
/* Display section boundaries in the Sensible app
as green brackets */
"display": true,
"range": {
"anchor": {
/* start looking for account sections at
the text "account number"
preceded by "TOTAL ASSETS" */
"start": "total assets",
"match": {
"text": "account number:",
"type": "startsWith"
}
},
},
"fields": [
/* same fields extracted in previous example,
now repeated for each account */
{
"id": "account_type",
"method": {
"id": "label",
"position": "left"
},
"anchor": {
"match": {
"text": "SUMMARY",
"type": "endsWith",
"isCaseSensitive": true
}
}
},
{
"id": "electronic_withdrawals",
"type": {
"id": "currency",
"relaxedWithCents": true,
"removeSpaces": true
},
"anchor": {
"match": {
"text": "electronic withdrawals",
"type": "includes"
},
"end": "ending balance"
},
"method": {
"id": "row",
"position": "right",
"tiebreaker": "last"
}
}
]
}
]
}
Each account contains transaction tables. By default, Sensible extracts these tables’ columns as parallel arrays. For an overview of how to zip the “column” arrays into “rows” of corresponding data for each transaction, see the following screenshot:
To try this out yourself, paste the following query into the left pane of the Sensible app:
{
"fields": [
{
"id": "_electronic_withdrawals_table_raw",
/* target data is a table */
"type": "table",
/* table starts after "electronic withdrawals" line
preceded by "total atm" line */
"anchor": {
"start": [
{
"text": "total atm",
"type": "includes"
}
],
"match": {
"text": "electronic withdrawals",
"type": "startsWith"
}
},
"method": {
/* of several table methods,
textTable is fastest */
"id": "textTable",
"columns": [
{
/* column starts 0.3" from left page edge
and ends 1" from left edge */
"id": "date_column",
"minX": 0.3,
"maxX": 1
},
{
/* column starts 6.3" from left page edge
and ends 8.1" from left edge */
"id": "amount_column",
"minX": 6.3,
"maxX": 8.1,
/* cells in columns must be currency */
"type": {
"id": "currency",
"relaxedWithCents": true,
"removeSpaces": true
},
/* if cell doesn't contain currency,
omit its row from output */
"isRequired": true
}
],
/* (recommended for performance)
table ends at "total electronic withdrawals" */
"stop": {
"text": "total electronic withdrawals",
"type": "startsWith"
}
}
}
],
"computed_fields": [
{
/* by default, table methods return
column objects. transform these to
row objects using the Zip method */
"id": "electronic_withdrawals_table",
"method": {
"id": "zip",
"source_ids": [
"_electronic_withdrawals_table_raw"
]
}
},
/* to avoid redundant output, return
the zipped table row objects and suppress the
original column objects */
{
"id": "cleanup_output",
"method": {
"id": "suppressOutput",
"source_ids": [
"_electronic_withdrawals_table_raw"
]
}
}
]
}
We've covered how to extract a few pieces of data from bank statements. Our prebuilt config extracts much more information. Check it out! In the following screenshot, every blue- or green-outlined line is a piece of extracted data:
Start extracting
Congratulations, you've learned some key methods for extracting structured data from bank statements. There's more extraction power for you to uncover. Sign up for a free account (100 docs a month, no credit card required), check out our prebuilt bank statement config in our open-source library, and peruse our docs to start extracting data from your own documents.
Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.