Skip to content

Data Types

The Data Types are used by the Extraction Engine which is part of the core platform. This occurs when you designate one or more data structures and a data store in a pipeline. The final document is passed to the Extraction Engine which then builds the data objects and data attributes linked back to the labeled document.

Data types impact only the Data Attribute, and a data attribute is designed to hold multiple representations of the piece of data. Currently, we have the following data types:

Type Description
String The most basic data type that can hold any type of information as a string of characters
Date Supports capturing a date without a time element, the date is that defined to a local to UTC
Date/Time Supports capturing a date with a time element, the date is that defined to a local to UTC
Phone Number Tries to normalize a phone number
Email Address Tries to convert the labeled content to a valid email address
Selectable Option Tries to match the value labeled to a list of available options
Number Tries to convert the labeled content to a number
Currency Tries to convert the labeled content to a valid currency (decimal)
Boolean Tries to convert the labeled content to a boolean value

Understanding Normalization (Coalescing)

When we label text in a document, it is always a “string”. This just means we are capturing text and not trying to standardize (or normalize) it in any way at all.

However, most systems that will use the data from Kodexa will want to know that the data is a specific type. They would want things to be numbers or dates, etc. This process is handled when we try to set the data type on a data attribute.

The Extraction Engine will take the text that is labeled in the document and try to coalesce the data into a specific form - for example it might take “1.0” as a string and turn it into a number. This is important since it means the system using the data from Kodexa knows the data is “valid” for that “Data Type”. In that case, if the data type is a number, it will not allow “abc” for that data attribute.

Algorithms for Coalescing

In the following table we will break down how we coalesce the data from labeled data to the data type.

Data Type Description
Date or Date/Time The extraction engine will use an NLP framework to try and convert the labeled text to a date/time
Boolean If the text (in lowercase) is “true” then it is true, else it is false
Currency Attempt to convert to a decimal
Email Use the regular expression ("^[a-zA-Z0-9_!#$%&'*+/=?`{
Number Parse as a decimal number
Phone Number Parse the phone number using Google’s LibPhoneNumber
Selectable Option Nothing right now

How is Typed Data Stored?

A data attribute has the ability to store multiple representations of a piece of extracted data, depending on the data type defined in the data structure one or more of the properties of the Data Attribute will be updated.

Property Description Applies to
value This is the raw value that was captured from the label All
stringValue This is the raw value as a string Selectable Options,String
dateValue This is the date/time in ISO format (YYYY-MM-DD and YYYY-MM-DDThh:mm) Date,Date/Time
booleanValue This is the boolean value Boolean
decimalValue This is the number or currency value Currency,Number

Content Source

When we are extracting data from a document label we are capturing the text that is labeled. This is the “raw” value that we are capturing. This is the value that is stored in the “value” property of the data attribute.

However, we also need to understand where that raw value comes from in the document. This is handled by the 'Content Source' property of the taxon.

We support the following types of content source:

Content Source Description
Value or All Content This means that we will look at the label, and we will see if the label has been given a value. If so, we will use this. However, if the label does not have a specified value then we will take all the text that the label has been applied to and use that as the value
Value Only This means that we will look at the label, and we will see if the label has been given a value and use that, if the label did not specify a value we return null
All Content This means that we will look at the label, and we will take all the text that the label has been applied to and use that as the value
Expression This allows the user to define an expression that will be used to capture the value, see Expressions below
Script This allows the user to define a script that will be used to capture the value, see Scripts below
Metadata This allows the user to choose a metadata object that will be used as the value, see Metadata below

Expressions

Expressions are a way to define a value that will be used to capture the value of the data attribute. Expressions are defined using the Spring Expression Language (SpEL) library.

When you are writing an expression the context is the data object that you are working with, and the result of the expression will be the value that is assigned to the attribute.

Since the data object is the context you can use methods from the data object in the expression. For example, if you wanted to get the value of another attribute you can

getAttribute('attributeName').getValue()

We also have other objects available as variables to use in the expression. For example if you wanted to get a peice of information from the metadata of the source document you can use the document as a variable

# metadata['CorrelationId']

The objects that are available to the expression are:

Object Name Description
document The document that the data object is associated with
dataObject The data object that the expression is being evaluated against
metadata The metadata of the document that the data object is associated with
family The document family that the document is associated with

Scripts

Scripts are a way to define a value that will be used to capture the value of the data attribute. Scripts are defined using the Groovy language.

A script works slightly differently from an expression. The script has the attribute as a variable available to it, and you can assign the value to the attribute directly in the script.

The objects that are available to the script are:

Object Name Description
attribute The attribute that the script is being evaluated for
document The document that the data object is associated with
dataObject The data object that the expression is being evaluated against
metadata The metadata of the document that the data object is associated with
family The document family that the document is associated with