How to automate human-in-the-loop review for document processing

Updated on
November 8, 2024
5
min read
Contributors
No items found.
Author
How to automate human-in-the-loop review for document processing
Table of contents
Turn documents into structured data
Get started free
Share this post

When you extract document data at scale using Sensible, automating human-in-the-loop review can become essential to your quality-control process. At a high level, this post covers how to integrate human review into your document processing life cycle. As the following figure shows, it guides you through automating flagging document extractions for review, notifying reviewers of extractions that need review, and setting up webhooks to ingest corrected extractions into your system once reviewers approve them.

Human-in-the-loop review for document data processing

You’ll learn how to take the following steps:

  1. Configure review triggers: Configure extraction quality validation for a document type, for example, tax documents or pay stubs. Any extraction that doesn’t meet your quality validations triggers a human review.
  2. Specify a webhook for each document extraction: When extracting data from a document using Sensible’s API or SDK, specify a webhook destination URL that receives updates to the extraction’s review status. 
  3. Notify a reviewer: When the webhook indicates that a completed extraction needs review and correction, notify a reviewer and send them a link to the review interface.
  4. Ingest corrected extractions: When the webhook indicates that a reviewer approved an extraction, ingest the document data into your system.

(Prerequisite) Configure support for pay stub data extraction

For this tutorial, let’s extract document data from pay stubs. 

To add support for extracting data from pay stubs to your account, follow the steps in Out-of-the-box extractions.

1. Configure review triggers  

To ensure data quality, you can add pass/fail tests for your document extractions, and trigger human review for extractions that fail tests. In this example, you’ll write logic to test that each paystub extraction:

  • Reports a pay period start date
  • Reports a plausible number of hours worked.

If either of the preceding tests, or validations, fails, Sensible triggers a "NEEDS_REVIEW" status for the extraction so that a reviewer can correct the errors or reject the extraction completely.

Implement validations

To implement the preceding validations for the pay_stubs document type, take the following steps:

1. In the pay_stubs document type you created in a previous step, click the Validations tab:

Create validation

2. Click Create validation. In the dialog, fill in the fields as follows to implement a test that fails if the paystub extraction is missing an employee name, then click Create:

  1. Description: Pay period start date must be present (non-null)
  2. Severity: Error
  3. Condition: 

{
    "exists": [
        {
            
            "var": "pay_period_start_date.value"
        }
    ]
}

The preceding condition is written in JsonLogic and tests that a value for the extracted data field with the key "pay_period_start_date" exists, i.e. is non-null. JsonLogic is a library for processing rules written in JSON. A JsonLogic rule is structured as follows: { "operator" : ["values" ... ] }. For example, { "cat" : ["I love", "pie"] } results in "I love pie".

Validation written in JsonLogic

3. Create a second validation to check if a paystub reports a plausible number of regular hours worked. The validation assumes that 80 hours is the norm for a two-week paystub, and that if a paystub contains less than 1 or more than 80 regular hours, it’s a mistake in the extraction. Follow the directions in the preceding step to create a second validation with the following conditions:

  1. Description: regular hours worked must be 1-80 hrs
  2. Severity: Warning
  3. Condition

{"and":[{">=":[{"var":"hours.regular.value"},"1"]},{"<=":[{"var":"hours.regular.value"},"80"]}]}
Second validation

Configure validation-based review triggers

Take the following steps to trigger human review for each extraction that fails either of the tests you created in the previous steps:

1. Click the Human review tab and click Enable Human Review. Select the validations you created in the previous steps:

Trigger review for each extraction that fails a validation

Now Sensible assigns a "NEEDS_REVIEW" status to any pay stub extraction that fails any validations you selected. For example, if a pay stub extraction reports 95 regular hours worked,  Sensible flags it for review.

Note that you have options for triggering human review other than selecting individual validations. For example, you can trigger review if an extraction exceeds an acceptable percentage of null data points (a minimum coverage score), or if it exceeds an acceptable number of failed validations.

2. Specify a webhook for each document extraction

To enable handling human reviews programmatically, you must specify a webhook destination for each paystub extraction. You can’t specify webhooks using the Sensible app’s extraction UI, so you must use the Sensible API or SDK.  The following code example shows specifying a webhook in an extraction request for a sample paystub document using the Sensible API:


curl --location 'https://api.sensible.so/v0/extract_from_url/pay_stubs?environment=production' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer YOUR_API_KEY' \
--data '{"document_url":"https://github.com/sensible-hq/sensible-docs/raw/main/readme-sync/assets/v0/pdfs/blog_human_review/adp_sample.pdf",
"content_type":"application/pdf", 
"webhook": {"url":"YOUR_WEBHOOK_URL","payload":"some info you want to include in addition to the default payload, which includes extraction id, review status, and parsed doc"}}'

Or in the Javascript SDK:


import { SensibleSDK } from "sensible-api";

// if you paste in your key, like `SensibleSDK("1ac34b14")` then secure it in production

const sensible = new SensibleSDK(YOUR_API_KEY);
const request = await sensible.extract({
      // extract from a local file, or from a document at a specified URL
      url: "https://github.com/sensible-hq/sensible-docs/raw/main/readme-sync/assets/v0/pdfs/blog_human_review/adp_sample.pdf",
      documentType: "pay_stubs",
      environment: "production",
      webhook: { url: "YOUR_WEBHOOK_URL", payload: "some info you want to include in addition to the default payload, which includes extraction id, review status, and parsed doc" },
    });
const results = await sensible.waitFor(request); // polls every 5 seconds. Optional if you configure a webhook
console.log(results);



The preceding code extracts data from a sample paystub PDF using the pay_stubs document type you configured in a previous step, applies the validations and human review triggers you configured, and posts the results to a webhook.

The example document used in the preceding extraction requests fails the validations you set up in previous steps because it’s missing a pay period start date and has an incorrect number of regular hours:

Sample paystub that fails validations

3. Notify a reviewer

In the Sensible app, reviewers can manually check the Human review tab to view extractions flagged for review. 

If you want to skip this manual process and automatically notify a reviewer that an extraction needs review, you must write code to handle the results posted to the webhook. Your code must check if the results include the reviewStatus parameter (for single-document extractions) or  "reviewStatuses" (for portfolio documents) parameter. 

Parse the webhook

The following code shows the webhook results for the example paystub document referenced in previous steps. 


{
    "id": "b84bd1c8-113e-4e1e-8462-379f0dde2abf",
    "created": "2024-11-06T19:28:59.390Z",
    "completed": "2024-11-06T19:29:23.034Z",
    "status": "COMPLETE",
    "type": "pay_stubs",
    "document_name": "adp_screenshot",
    "configuration": "adp",
    "configuration_version": "U4.S2CoUTobRmV9omI3XNX_VCMwzwe1x",
    "environment": "production",
    "page_count": 1,
    "parsed_document": {
        "employer_name": {
            "type": "string",
            "value": "SOUTH COAST GLOBAL MEDICAL CENTER, INC."
        },
        "employee_name": {
            "type": "string",
            "value": "Kevin Johnson"
        },
        "employee_address": {
            "value": "223 Ash Drive\nBrenda CA 84880",
            "type": "address"
        },
        "pay_date": {
            "source": "08/07/2020",
            "value": "2020-08-07T00:00:00.000Z",
            "type": "date"
        },
        "pay_period_start_date": null,
        "pay_period_end_date": {
            "source": "08/01/2020",
            "value": "2020-08-01T00:00:00.000Z",
            "type": "date"
        },
        "net_pay": {
            "source": "2,458.32",
            "value": 2458.32,
            "type": "number"
        },
        "hours.regular": {
            "source": "84.00",
            "value": 84,
            "type": "number"
        },
        "hours.sick": null,
        "hours.paid_time_off": null,
        "hours.vacation": null,
        "pay_this_period.regular": {
            "source": "3,237.76",
            "value": 3237.76,
            "unit": "$",
            "type": "currency"
        },
        "pay_this_period.sick": null,
        "pay_this_period.paid_time_off": null,
        "pay_this_period.vacation": null,
        "ytd.regular": {
            "source": "51,789.23",
            "value": 51789.23,
            "unit": "$",
            "type": "currency"
        },
        "ytd.sick": {
            "source": "607.08",
            "value": 607.08,
            "unit": "$",
            "type": "currency"
        },
        "ytd.paid_time_off": null,
        "ytd.vacation": null,
        "ytd.gross": {
            "source": "68,832.27",
            "value": 68832.27,
            "unit": "$",
            "type": "currency"
        }
    },
    "validations": [
        {
            "description": "pay period start date must be present (non-null)",
            "severity": "error",
            "scope": []
        },
        {
            "description": "regular hours must 1 to 80 hrs",
            "severity": "warning",
            "scope": []
        }
    ],
    "validation_summary": {
        "fields": 20,
        "fields_present": 11,
        "errors": 1,
        "warnings": 1,
        "skipped": 0
    },
    "classification_summary": [
        {
            "configuration": "adp",
            "fingerprints": 2,
            "fingerprints_present": 2,
            "score": {
                "value": 9.5,
                "fields_present": 11,
                "penalties": 1.5
            }
        }
    ],
    "errors": [],
    "download_url": "REDACTED",
    "content_type": "application/pdf",
    "file_metadata": {
        "info": {
            "producer": "Foxit PDF Editor Printer Version 11.2.11.4557",
            "creation_date": "2024-11-06T12:06:06.000-07:00",
            "modification_date": "2024-11-06T12:06:06.000-07:00"
        },
        "metadata": {}
    },
    "coverage": 0.475,
    "charged": 1,
    "version_id": "JerNm7X7cmn.QZpEL1uDzxhymJuKKXjc",
    "reviewStatus": "NEEDS_REVIEW"
}


Note that you can trace why the "reviewStatus"parameter is set to "NEEDS_REVIEW" by comparing the "validations" and "validation_summary" parameters to the human review triggers you configured in previous steps. For example, the following entry in the "validations" array tells you that the regular hours reported in this extraction is incorrect:


{
"validations": [
        {
            "description": "regular hours must 1 to 80 hrs",
            "severity": "warning",
            "scope": []
        }
    ]
}
 

Send the review link

To automate sending reviewers links to failed extractions, your code needs to handle the "id" and "type" parameters in the webhook results. Compose these parameters to create a review link.  In the previous example, these are "b84bd1c8-113e-4e1e-8462-379f0dde2abf" and "pay_stubs", respectively, so the review link is:

https://app.sensible.so/editor/review/?d=pay_stubs&b84bd1c8-113e-4e1e-8462-379f0dde2abf

You can then write code to send the link to the reviewer via email or other notification method. Note that the reviewer needs to log into your customer account to access the link.

4. Ingest corrected extractions

Using the interface in the review link, the reviewer can edit individual failed fields and approve or reject the extraction. For example, if 84 hours were an OCR error for the field "hours.regular" and the correct value were 64, they could edit it to the correct value, 64: 

Edit extracted data

On the other hand, if the original paystub document incorrectly lists 84 hours, as in this example, the reviewer can choose to reject the extraction since the document itself is invalid, and then your system can use business logic to handle the invalid document:

Reject invalid document

Once the reviewer clicks Approve Extraction or Reject Extraction, Sensible posts the updated extraction, including any edited fields and the new review status, to the webhook. Now you can ingest approved extractions into your app, or handle rejected extractions according to your business logic.

Conclusion

By automating human review, you gain quality control at scale for large volumes of document extractions.  Over time, you'll gain insights into common extraction issues, helping you refine your automated processes. Human review is available now in beta for all existing users at no additional cost. After the beta period, it will become an optional add-on, priced at $150 per month, for users on the Scale plan and above.

Ready to enhance your extraction accuracy? Enable Human review today and check out our updated documentation. We're excited to see how this blend of automation and human oversight improves your document processing workflows and, as always, we welcome your feedback as you start using it.

Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.