Introduction to PDF Structure Extraction

Dream Haddad
Post by Dream Haddad
January 14, 2022
Introduction to PDF Structure Extraction

Understanding document structures is a crucial step in creating the best data extraction.

Why?

Extracting document structures means that your team can understand, and extract, the entire content of a document, not just individual or specific data points. 

So with Structure Extraction, you get text in contextual blocks including headings, paragraphs, lists, footnotes and tables and other formatting information (without the need to code - this is done by our engineers).

You can export the structured data as a JSON or use the Rest API.


Table of Contents

  • What is a document structure?
  • Why do you need Structure Extraction?
  • Extracting specific data points in eight steps
  • Finalising structured data

What Is a PDF Structure?

Document structures help you organise complex information. By utilising a process called "Layout Segmentation", users can segment a page into individual blocks, such as:

  • Figures 
  • Text blocks 
  • Text lines 
  • Words 
  • Characters 

Now, why do you need structure extraction?

Simplifies PDF data extraction

Structure Extraction replaces all of the nightmares associated with data coding with a no-code method to extract document data in just a few clicks. 

Whether your team of data scientists need to evaluate a sample of documents, or your business team need quick data analytics, Structure Extraction provides the fundamental tool for any team to find, use, and store data.

Understands your data like a human

The Structure Extraction platform is then able to assign meanings to each block, for example: "This text block is a title level one", "This table has the caption 'Durchschnittsprämien' and belongs to chapter X". 

From retrieving this information on what type of data each block contains, Structure Extraction can reconstruct the contents of a document and can infer a more complex analysis.

Analogy: Like skim-reading through a book

We can compare Structure Extraction to someone who is skimming through a book. The person is trying to absorb as much information as possible, searching for relevant pieces of information without needing to read everything. 

But why do people skim through information? To save themselves time and effort; to get straight to what they're looking for. 

Structure Extraction does exactly this for users – it enables people to find specific information quickly, without needing to study their entire document landscape.

Structure Extraction in 8+ Steps

You will have several steps in Acodis that guide you through extracting the entire document structure. The user can choose exactly what they want to extract depending on their goals.

In the end, you have the option to export a JSON or use the API to connect your preferred app. More info on integration/API with Acodis here. 

Document structure extraction process

(The image illustrates an example of options to extract structure with Acodis.)

1. Identify Native Text

Gives you the option to extract all the text identified within your uploaded document.

2. Recognise Hierarchal Order

The reading order simply refers to how readers should perceive the entire document: where the title page is; where the appendix is; etc.

3. Extract Relevant Headers 

Headers can easily define "sections" of any given document and can indicate where specific pieces of data are located. For many who need to analyse the layout of their documents, it is fundamental to understand the information/location of main headers.

And while this process may be easy for single-page documents, it can be a timely task if you are handling hundreds of PDFs at once.

Header Extraction GIF

4. Identify Figures (e.g., tables, images, etc.)

While some documents include an Appendix that indicates where figures are located in a document, it can be time-consuming to actually pinpoint them if the document landscape spans over hundreds of pages. But this is not the case anymore and can be done with a single click.

When users select "Figure", all of the relevant pieces of content will then become highlighted across all document pages.

Good to know: Furthermore, all of the content locked inside of those figures is also transformed into structured information, meaning that you can analyse all of the data within them.

Extracting figures in PDF document (compressed)

 

5. Text Aggregation

When text is laid out on several pages, it undergoes a series of transformations.

For example:

  1. Long words are hyphenated
  2. Paragraphs are split over columns or pages
  3. List(s) of items are split over page breaks
  4. Text is interspersed with figure or table captions

Text aggregation ultimately tries to undo these transformations back to the original text.

6. Removing Noise in PDFs

When we say "noise", we're not referring to your documents being loud, but rather about managing any repeated elements that do not contribute to the normal contents of a document. This can include:

Parts of a page:
  1. Page headers/footers containing numbers
  2. Chapter titles
Full pages:
  1. Intentionally blank pages
  2. Missing items in tables of content

Acodis Document Structure Extraction can identify much of the noise within documents and still manage to analyse any data around it.

7. Analysing Captions

Captions are generally mini explanations that are located under figures in documents.

Analysing them ultimately provides even more context to the figures and therefore improves how far someone can analyse the contents of a document.

Automatically analysing captions in documents

Highlighted in purple, we see that Structure Extraction was able to extract the table's caption.

8. Finalising the structured export

Now you have the option to export the content and the semantic structure as a JSON or connect Acodis via API to your preferred app to use the data in your subsequent processing. 

How You Can Use Structure Extraction

Example: Call Centre Bot

Extract content from Marketing and Sales literature to enable an automated chat-bot to help (prospective) customers learn about your product.

Making this solution happen:
  • Pass PDFs to Acodis Structure Extraction to retrieve text content
  • Extract headings, paragraphs, figures, and charts
  • Use Acodis API to feed data to the chat-bot and call centre software

Benefits:
  • The bot is constantly fed with the latest product information 
  • Information updates are also available to live chat agents



For more information on this and to get your own free demo, contact one of our experts and they'll gladly walk you through it.

CONTACT US

 

Dream Haddad
Post by Dream Haddad
January 14, 2022
Product Marketing Specialist