Why Is Document Data Extraction So Complex?

Companies globally are now looking for new, and easier, ways to harness their data. This is because the days where enterprises rely on business acumen or experience to drive growth are over. Utilising data, which inevitably requires some form of data extraction method, is now enabling any type of company to unlock deeper insights into their business.

However, this is easier said than done. The golden standards for data accuracy, speed, and efficiency are rising – meaning the pressure to extract data the best way is too. But many professionals, with or without data extraction experience, find the process challenging, complex, and timely.

And while there are solutions to overcome these challenges, here are four examples of why people struggle with document data extraction.

Table of content

Complex document types
- Examples of complex document types
Extracting data from tables
- Types of data you find in tables
Extracting handwritten data
Using standalone OCR
- OCR definition
- Limitations of OCR
- Unable to work at a human level
Solution

Complex document types

Sometimes it’s okay to start with the obvious answer. Complex document types make the data extraction process, well… complex.

Examples of complex document types:

Documents with multiple layouts/structures
Multiple semantics (e.g., if an insurance policy also includes a cancellation request)
Multiple-page documents (yes, this can be a pain to extract data from!)

When searching for a data extraction solution, it is important to thoroughly indicate the types of documents, and data, you are wanting to extract. This is because there are some data extraction providers who can process such complex documents, whereas others, such as those who utilize standalone OCR, are not.

Extracting data from tables

dimitra-peppa--abBaVOMsBk-unsplash

Image credit to Dimitra Peppa, Unsplash

No, not these tables... the ones you find scattered in your documents.

Data extraction from tables is an understandable component of this discussion since even humans struggle with processing information from tables. The truth is table content mostly gets the manual workflow treatment – after hours of stress and complication trying to get a software to understand the data.

What data is usually found in tables?

Words
Digits
Numbers
Formulas
Images
Even sometimes handwriting

And while tables often look structured, or at least formatted in a way that should be understandable by data extraction software, it is quite the opposite. Tables are unique. A table displays multi-dimensional information using a two-dimensional format. In other words, data extraction software can sometimes struggle with understanding the difference between what a part of the table is, and what is the actual data.

Nonetheless, document tables, and the way they are processed, can sometimes appear on a slant, instead of uniformly vertical/horizontal – which can cause additional issues with properly detecting, and extracting, data from them.

Furthermore, the data inside tables can get messy. On certain occasions, there can be multiple languages, or a combination of computer-generated text and handwriting, that makes life extremely difficult for both a human and a piece of technology to extract.

Extracting handwritten data

Handwritten data can also be problematic when extracting data, either by a human or a piece of software. The answer is simple: because handwriting can be extremely difficult to decipher if not carefully written out. And while a human can use common sense to understand pieces of handwritten information, some data extraction tools cannot.

Read how Acodis helped Zurich Children's Hospital extract handwritten data from 5,000+ questionnaires.

(No sign-up required)

Furthermore, original documents may have poor quality as paper deteriorates before scanning it; notes may have been written on the go; signatures are almost always unreadable (not to mention that half of the population writes "1" as if it were "7" and the other half writes "7" as if it were "4"!)

All jokes aside, we don't concentrate on having good handwriting when filling in forms, and most of the time we don't write within the text fields on the document.

Standalone OCR makes life difficult

Definition Optical Character Recognition (OCR)

Application software that allows a computer to recognise basic printed or written information (e.g., numbers, letters, and symbols) that are suitable for extraction. This is usually done in the form of scanning and requires primarily manual work.

But here comes the tricky part…

If we skim back to all the topics mentioned in this report: complex document types, tables, handwriting, etc., OCR is not an ideal solution to combat any of these hurdles. This is because OCR is only (noticeably) useful when handling simple content.

What are the limitations of OCR?

Stressed Team Member

There are still no (standalone) OCR tools that work at a human level in most applications

Errors include misreading letters, skipping over unreadable letters, or combining text from adjacent columns or image captions. While many factors affect the performance of OCR tools, the number of errors depends on the quality and form of the text, including the font used. However, even with high-quality documents, OCR tools can make mistakes because there are a variety of document formats, fonts, and styles for each character.

Document-based limitations

Coloured-based backgrounds: Colourful background patterns can be troublesome as they decrease the text recognition
Blurry or glared texts: They are challenging to read for humans as well as computers¨
Skewed or non-oriented documents: For situations where the image may be skewed, OCR will have a harder time identifying the characters because the text is not aligned

Solution

So, to recap on what issues there currently are, or at least the few that have been discussed here, we see that it is document types, tables, handwriting, and good-ole OCR that contribute to the entire complexity of document data extraction.

A tunnel of problems isn’t all that useful without a light at the end of it. And while document data extraction can be a complex process, it doesn’t have to be – at least, when you’re in possession of the right tools.

Software that combines OCR with AI-based machine learning can often be the right way to go. This is often called Intelligent Document Processing (IDP) (at least in our case) and can understand documents and data like a human does. This means that, with the necessary training involved, IDP can efficiently and quickly process information stored in any type of complex document, tables, etc.

Sounds a little too good to be true, right? Well, machine learning has come a long way – as does any form of technology. Data extraction solutions are becoming more relevant to the modern issues faced by companies.

Contact Acodis if you're looking to simplify how you extract data.