Using Python to Extract Text from PDFs

Updated on
October 16, 2023
5
min read
Contributors
No items found.
Author
Using Python to Extract Text from PDFs
Table of contents
Turn documents into structured data
Transform documents into structured data with Python
Get started free
Learn More
Share this post

To say that the PDF has historically been a challenge for data integration is an understatement; the format has been described as "where documents go to die." Data contained in PDFs is unstructured, making it far less easy to integrate into your system than data delivered through an API. As a result, many organizations turn to manual data entry to ingest PDFs. This approach, however, comes with a host of potential issues—compromised data quality, increased costs, delays in data entry—that can last for hours or even days.

While optical character recognition (OCR) tools can extract text from PDFs, they merely extract text. They don't provide specific data fields to make the text particularly usable. This leaves developers with the daunting task of parsing the extracted data appropriately, which isn't much better than manual entry in the first place.

Developer-first platforms, like Sensible, offer an alternative solution. They provide access to the data in PDF documents as easily as calling an API. Sensible’s document query language, SenseML, eliminates the complexities of PDF parsing. In a few seconds, users can extract data from digital and scanned PDFs and seamlessly ingest it into their workflows.

In this article, you'll learn how to use Python to extract text from PDFs with Sensible. After completing this tutorial, you should be able to use Sensible to extract structured data from any document.

Extract Documents in Python Using Sensible

Sensible makes extracting data from PDF files a breeze. In this tutorial, you'll:

  • Create a document configuration in the Sensible app using SenseML, Sensible’s document query language.
  • Parse a simple invoice file to extract key information from the document, like the client’s name, invoice date, and total price.
  • Obtain an endpoint from the Sensible app in order to use Python to extract data from multiple PDF invoices from the same vendor using the Sensible API.

Before you get started, make sure you have the following:

Setting Up

After creating a Sensible account, sign in to your Sensible dashboard. To follow along with this tutorial, you need to create a new document type in your Sensible account. A document type is a collection of reference documents and SenseML queries (stored as configurations) to help you extract predefined data fields from PDFs. Sensible’s Configuration library contains configurations for extracting data from hundreds of the most popular documents.

Click New document type to create a new document type. Name the document type invoices. Leave the other fields’ default values as they are, then click Create. The created document type will hold your invoice and its configuration.

Create document type

In this tutorial, you’ll learn how to parse invoices for a fictitious gardening company called Williamson Gardening. The following is an example invoice from this company:

Sample invoice

Download this sample invoice, then return to your Sensible dashboard. Select the newly created invoices document type. Click the Reference documents tab and click Upload document to upload the sample invoice you just downloaded.

Upload document

The extraction configuration contains SenseML queries that extract structured data from your documents. The reference document serves as the source for extracting the data while you write the configuration in the next step.

Creating a Configuration

To create a configuration, click the Configurations tab and select Create configuration. Name the configuration williamson_invoice_config and click Create.

Create configuration

Click your new configuration to edit it. This opens Sensible’s visual editor, a user-friendly interface that extracts data in response to queries written in natural language. Based on your input, the interface automatically generates SenseML queries to retrieve the necessary information.

It’s a simple and effective way to extract data, but understanding how to construct SenseML queries directly offers greater precision for your extraction. To work with SenseML directly, click Switch to SenseML.

Sensible visual editor

A new screen opens with three panes: one for writing the configuration using SenseML, one for viewing the document, and one for showing the data extraction results.

Configuration screen

Writing in SenseML, you’ll define how to find and extract data from the document, as well as the structure of the extracted data. SenseML is a bit like GraphQL in that way, but for querying elements of a document.

Some of the key components of SenseML are:

  • Fields: Fields are the basic units of data extraction in SenseML. They define the specific data you want to extract from a PDF document. Each field has an ID, which is the key for the extracted data, and is defined using the id property.
  • Anchors: Anchors are matched text in the PDF document that locate the data to extract. They are defined using the anchor property and can be an array of matches.
  • Methods: Methods define how to extract the data from the PDF relative to the anchors. They are defined using the method property and have many sub-properties of their own.

The question method is the easiest option for your first configuration. This method allows you to ask simple questions about your data in natural language, which AI interprets to extract the relevant data. Sensible’s question method is powered by OpenAI’s GPT-3.

The following example demonstrates how you can use this powerful method to extract the invoice date:

{
  "fields": [
    {
      /* Unique identifier for this field */
      "id": "invoice_date",
      /* The expected data type of the extracted value */
      "type": {
        /* The id of the data type */
        "id": "date",
        /* The format of the date */
        "format": [
          "%D/%M/%Y"
        ]
      },
      /* The method to use for extracting data */
      "method": {
        /* Use the "question" method to extract the date */
        "id": "question",
        /* The free-text question to ask to extract the date */
        "question": "what's the invoice date"
      }
    }
  ]
}

The SenseML configuration above defines a query for the invoice_date. It tells Sensible to anchor on the Invoice Date text and defines the result as a date with a specific format. The configuration then asks, “What’s the invoice date?”

This query should produce the following results:

Fields found using the SenseML query

The output of this query on the sample document is represented below:

{
  "invoice_date": {
    "source": "31/01/2023",
    "value": "2023-01-31T00:00:00.000Z",
    "type": "date"
  }
}

While the question method is powerful, it may not always provide reliable results, especially when dealing with complex document layouts. In such cases, SenseML offers more configurable, layout-based extraction methods, such as label and row.

For example, the following query uses SenseML to find the client in the invoice:

{
  "fields": [
    {
      /* Unique identifier for this field */
      "id": "client",
      /* The anchor text to look for */
      "anchor": "Bill To",
      /* The method to use for extracting data */
      "method": {
        /* Extract the label that is below the anchor */
        "id": "label",
        "position": "below"
      }
    }
  ]
}

This SenseML query tells Sensible to find the text Bill To and use it as an anchor. Using this anchor, it looks for a labeled value that’s positioned below the anchor. SenseML returns the value under the JSON key client, and the query returns the following output:

{
  "client": {
    "type": "string",
    "value": "John Doe & Co"
  }
}

Sensible correctly identifies the client as John Doe & Co and its data type as a string,

You can also specify your data type in SenseML to ensure type safety. The example below queries the document for the invoice number, specifying the data type:

{
  "fields": [
    {
      /* Unique identifier for this field */
      "id": "invoice_number",
      /* The expected data type of the extracted value */
      "type": "number",
      /* The anchor text to look for */
      "anchor": "Invoice #",
      /* The method to use for extracting data */
      "method": {
        /* Extract the row that is to the right of the anchor */
        "id": "row",
        "position": "right"
      }
    }
  ]
}

This query looks for a number to the right of the text Invoice # and in the same row. The output is as follows:

{
  "invoice_number": {
    "source": "100",
    "value": 100,
    "type": "number"
  }
}

SenseML has several data types, including phone numbers, dates, and currency. You can use the currency data type to extract pricing details, as in the following example:

{
  "fields": [
    {
      /* Unique identifier for this field */
      "id": "total",
      /* The expected data type of the extracted value */
      "type": "currency",
      /* The anchor text to look for */
      "anchor": "Total",
      /* The method to use for extracting data */
      "method": {
        /* Extract the row that is to the right of the anchor */
        "id": "row",
        "position": "right"
      }
    }
  ]
}     

This query produces the following output:

{
  "total": {
    "source": "$600.00",
    "value": 600,
    "unit": "$",
    "type": "currency"
  }
}

You can use SenseML’s table method to extract tabular data. This method is especially helpful when working with any documents that contain tables, including invoices. It uses a bag-of-words scoring approach to identify the relevant columns based on specified terms or stop terms. This method can even accurately extract tabular data in cases where the column formatting varies, or the table extends across multiple pages.

It’s worth noting that SenseML also has a dedicated invoice method for extracting data from invoices. It’s similar to the table method but specifically focuses on retrieving common invoice items such as customer name, invoice date, invoice items, and totals. This tutorial focuses on the table method as it has a wider range of applications.

The following example demonstrates how to use the table method on the sample document to extract invoice items:

{
  "fields": [
    {
      /* Unique identifier for this field */
      "id": "invoice_items",
      /* The anchor text to look for */
      "anchor": "Williamson Gardening",
      /* The expected data type of the extracted value */
      "type": "table",
      /* The method to use for extracting data */
      "method": {
        /* Use the "table" method to extract the table */
        "id": "table",
        /* Specify the columns to extract from the table */
        "columns": [
          {
            /* Unique identifier for the column */
            "id": "col1_description",
            /* The terms to identify the column header */
            "terms": [
              "Description"
            ]
          },
          {
            /* Unique identifier for the column */
            "id": "col2_amount",
            /* The terms to identify the column header */
            "terms": [
              "Amount"
            ],
            /* The expected data type of the values in this column */
            "type": "currency",
            /* Specify if this column is required */
            "isRequired": true
          }
        ]
      }
    }
  ]
}

The query above tells Sensible to extract data using the table method and defines the text "Williamson Gardening" as the anchor text. It also defines the columns in the table, in this case, the "Description" and "Amount" columns. The data type for the "Amount" column is specified as "currency". The amount field is also set as required, which means that any row without an amount value will be omitted. The result of this query is presented below:

{
  "invoice_items": {
    "columns": [
      {
        "id": "col1_description",
        "values": [
          {
            "value": "Landscaping Design",
            "type": "string"
          },
          {
            "value": "Gardening Services",
            "type": "string"
          }
        ]
      },
      {
        "id": "col2_amount",
        "values": [
          {
            "source": "50.00",
            "value": 50,
            "unit": "$",
            "type": "currency"
          },
          {
            "source": "180.00",
            "value": 180,
            "unit": "$",
            "type": "currency"
          }
        ]
      }
    ]
  }
}

Deploying the Configuration

You can combine all the queries into a single file to produce the following configuration:

{
  "fields": [
    {
      "id": "invoice_date",
      "type": {
        "id": "date",
        "format": [
          "%D/%M/%Y"
        ]
      },
      "method": {
        "id": "question",
        "question": "what's the invoice date"
      }
    },
    {
      "id": "client",
      "anchor": "Bill To",
      "method": {
        "id": "label",
        "position": "below"
      }
    },
    {
      "id": "invoice_number",
      "type": "number",
      "anchor": "Invoice #",
      "method": {
        "id": "row",
        "position": "right"
      }
    },
    {
      "id": "total",
      "type": "currency",
      "anchor": "Total",
      "method": {
        "id": "row",
        "position": "right"
      }
    },
    {
      "id": "invoice_items",
      "anchor": "Williamson Gardening",
      "type": "table",
      "method": {
        "id": "table",
        "columns": [
          {
            "id": "col1_description",
            "terms": [
              "Description"
            ]
          },
          {
            "id": "col2_amount",
            "terms": [
              "Amount"
            ],
            "type": "currency",
            "isRequired": true
          }
        ]
      }
    }
  ]
}

After this, click Publish to deploy this configuration to a dev environment. There, you can interact with the configuration via the API.

Publish Configuration

Take note of the extraction endpoint, as you’ll use it later.

Retrieving Your Sensible API Key

To get your Sensible API key, navigate to the account page. Click the reveal icon to view and copy the key.

Sensible account page

Writing the Python Code

By this point, you've written your SenseML configuration in Sensible to extract data from your sample invoice PDF. Now you can start writing code to extract data from documents in the same format as the sample document. You can download a test document here.

To start, run pip install requests to install the requests library if it's not already on your environment. Next, create a file named sensible.py with the following code:

import json
import requests

# Your extraction url, substitute this with yours if different
URL = "https://api.sensible.so/v0/extract/invoices?environment=development"
# Your PDF file path
DOCUMENT_PATH = "test.pdf"
# Your Sensible API key, insert the one you got in the previous step
SENSIBLE_API_KEY = "INSERT YOUR API KEY HERE"

headers = {
    'Authorization': 'Bearer {}'.format(SENSIBLE_API_KEY),
    'Content-Type': 'application/pdf'
}
with open(DOCUMENT_PATH, 'rb') as pdf_file:
    body = pdf_file.read()
response = requests.request(
    "POST",
    URL,
    headers=headers,
    data=body)se
try:
    response.raise_for_status()
except requests.RequestException:
    print(response.text)
else:
    print(json.dumps(response.json(), indent=2))

Note that you should substitute in your extraction URL and the API key that you obtained earlier. Also, ensure the test document you downloaded is in the same folder as this Python file. If not, also adjust the DOCUMENT_PATH variable.

Next, you'll use a Python script to read a PDF file and upload it to Sensible servers. Sensible will extract data from the document by automatically choosing the williamson_invoice_config you previously defined. The script either prints the results of the extraction or an error if something goes wrong.

To run the script, open a terminal in the directory that contains your script. Enter the command python sensible.py. You should get the following output:

{
  "id": "691945ac-34fd-4c5c-bd5f-35397767db15",
  "created": "2023-05-19T11:59:19.113Z",
  "completed": "2023-05-19T11:59:26.905Z",
  "status": "COMPLETE",
  "type": "invoices",
  "configuration": "williamson_invoice_config_og",
  "environment": "development",
  "page_count": 1,
  "parsed_document": {
    "invoice_date": {
      "source": "20/03/2023",
      "value": "2023-03-20T00:00:00.000Z",
      "type": "date"
    },
    "client": {
      "type": "string",
      "value": "Bing and Bros"
    },
    "invoice_number": {
      "source": "104",
      "value": 104,
      "type": "number"
    },
    "total": {
      "source": "$230.00",
      "value": 230,
      "unit": "$",
      "type": "currency"
    },
    "invoice_items": {
      "columns": [
        {
          "id": "col1_description",
          "values": [
            {
              "value": "Landscaping Design",
              "type": "string"
            },
            {
              "value": "Gardening Services",
              "type": "string"
            }
          ]
        },
        {
          "id": "col2_amount",
          "values": [
            {
              "source": "50.00",
              "value": 50,
              "unit": "$",
              "type": "currency"
            },
            {
              "source": "180.00",
              "value": 180,
              "unit": "$",
              "type": "currency"
            }
          ]
        }
      ]
    }
  },
  "validations": [],
  "validation_summary": {
    "fields": 5,
    "fields_present": 5,
    "errors": 0,
    "warnings": 0,
    "skipped": 0
  },
  "classification_summary": [
    {
      "configuration": "williamson_invoice_config_og",
      "score": {
        "value": 5,
        "fields_present": 5,
        "penalties": 0
      }
    }
  ],
  "errors": [],
  "file_metadata": {
    "info": {
      "title": "test",
      "producer": "Skia/PDF m115 Google Docs Renderer"
    }
  }
}

The classification_summary field in the response body above provides metadata about Sensible's decision-making process. Sensible compares all available configurations in the document type and selects the best one based on a combination of factors. The score object represents the extraction score, calculated as the number of fields present minus penalties. Sensible chooses the highest-scoring extraction in the document type, considering errors and warnings as penalty points. In cases where two extractions have the same score and fingerprint, Sensible opts for the first configuration in alphabetical order.

The parsed_document field contains an object with the extraction results. From the output, you can see that values for the client, invoice number, invoice date, total, and invoice items are all returned. You can also compare this output to your sample PDF.

Sample PDF with extracted data highlighted

You can review the complete codebase with sample documents in this tutorial’s GitHub repo.

Conclusion

By this point, you’ve seen how Sensible automates data extraction from digital and scanned PDFs. You learned how to extract structured data from PDFs using Sensible in Python. You constructed SenseML queries to create configurations that target and extract specific structured data, eliminating the need for manual data entry.

Sensible provides a powerful and user-friendly text extraction solution, allowing for faster, more accurate data analysis and efficient workflows. It supports not only a large number of file formats, but also provides third-party integrations and enterprise-level security. With its user-friendly interface and robust feature set, Sensible is a document orchestration platform made for developers and an excellent choice for simplifying your document processing workflow.

Michael Nyamande
Michael Nyamande
Transform documents into structured data with Python
Learn More
Turn documents into structured data
Get started free
Share this post

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.