Skip to content

PDF Assistant

pdf-parsing-assistant

Label and extract data from PDF documents

What is this assistant?

This assistant allows you to convert native PDF documents into Kodexa documents to allow parsing of the document and labeling of text.

Once added to a workspace, and linked to a document store, this assistant will: * Monitor for new PDF documents to be uploaded to the store * It will parse the contents and build a Kodexa document * It will add the Kodexa representation and link it to the original document

Once parsed into the Kodexa document format, you can then do the following with PDF's: * View and Label the contents * Add new spatial assistants to help you label and extract content * Perform spatial classification (Coming soon)

In the next sections we will set-up this assistant.

'

You can list a component in the marketplace, and define if you want it to be a template.

✅ Available in Marketplace

✅ Can be used as a template to create a new component

What is an Assistant?

Assistants are the building blocks of the automation system. They are the components that decides the actual work of the automation system. They can be used to perform a wide variety of tasks, from sending notifications, building pipelines and organizing workflow.

Options

Below are the options for this assistant.

Option Name Default Required? Type Description
find_multiple_text_columns False False boolean Checks if the texts in the pages are grouped into multiple text columns. Set this to True in cases where there's a table in one text column, and another table in another text column. Default is False.
max_num_of_columns -1 False number The maximum number of columns to be extracted from a table. Default is -1 (no limit).
pages_with_multiple_text_columns None False string If find_multiple_text_columns is set to True, type in the page numbers (index starts at 0) where the system would look for multiple text columns. Example - 2,7 for pages 3 and 8 or 0:4 for the first 6 pages. Default is all of the pages.
space_multiplier 1.0 False number A multiplier used in the space calculations identifying the words. Default is 1. Set to a smaller number if multiple words get put together to one word; or to a higher amount if a word is broken down into multiple words.
line_height_overlap 0.5 False number The amount of overlap between lines to consider them part of the same line
ignore_empty_rows False False boolean If set to True, the empty rows in the data will be ignored.
use_graphical_nodes True False boolean If set to True, we will capture graphical nodes from the document
template_document None False document The document that will be used as the template for processing
complete_label data-labelled False string The label name to apply after processing
taxonomies None False list Extraction Data Structure
data_store None False tableStore An instance of a table store that we will use.
ocr None False boolean OCR the documents
perform_indexing False False boolean If set to true, the document will be indexed for search
table_analysis True False boolean Perform the table analysis (if trained)
form_analysis True False boolean Perform the form analysis (if trained)
data_helpers True False boolean Apply any data helpers as defined in the labeling
## Reactive

This assistant can be triggered by content based events on stores which it is monitoring.



## Custom Events

This assistant also supports custom events that you can use to trigger automations. These events are listed below.



    ### Test


        None

No options are available for this component

    ### Copy Template Labels


        None

No options are available for this component