Machine Learning is the Wrong Way to Extract Data From Most Documents

Updated on

October 10, 2023

min read

Contributors

No items found.

Author

Josh Lewis

Co-Founder, Sensible

Table of contents

Documents have spent decades stubbornly guarding their contents against software. In the late 1960s, the first OCR (optical character recognition) techniques turned scanned documents into raw text. By indexing and searching the text from these digitized documents, software sped up formerly laborious legal discovery and research projects.

Today, Google, Microsoft, and Amazon provide high-quality OCR as part of their cloud services offerings. But documents remain underused in software toolchains, and valuable data languish in trillions of PDFs. The challenge has shifted from identifying text in documents to turning them into structured data suitable for direct consumption by software-based workflows or direct storage into a system of record.

The prevailing assumption is that machine learning, often embellished as “AI”, is the best way to achieve this, superseding outdated and brittle template-based techniques. This assumption is misguided. The best way to turn the vast majority of documents into structured data is to use ....

Continue reading on HackerNoon.

Josh Lewis

Co-Founder, Sensible

Turn documents into structured data

Stop relying on manual data entry. With Sensible, claim back valuable time, your ops team will thank you, and you can deliver a superior user experience. It’s a win-win.

Start Extracting Book a demo

Machine Learning is the Wrong Way to Extract Data From Most Documents

Turn documents into structured data

Related posts

Introducing Human Review: increase extraction accuracy with manual oversight

Beyond embeddings: Navigating the shift to completions-only RAG

History of the PDF

Confidence Signals: the LLM alternative to confidence scores