Skip to content

Document Stores

Document stores are designed to hold the original documents that are used to extract data. They are designed to be a repository of documents that can be used to train models and extract data.

A document store holds what we call Document Families. These are logical containers that relate both the original file and any of the derived documents that are created from it.

Store Purposes

There are two main purposes for a document store:

  • To hold documents that we will be using for training models
  • To hold documents that we will be using to extract data

On the store object we have a storePurpose property that can be set to either TRAINING or OPERATIONAL. This is used to determine which documents are available for use in the store. The actual functionality of the store itself is the same regardless of the purpose.

Store Options

The document store has a number of options that can be set to control how it behaves. These are set on the store object and are:

  • highQualityPreview - If set to true then the store will generate high quality previews of the documents. This will increase the time it takes to generate the previews but will result in better quality previews. The default value is false. This setting is used in the UI.

  • searchable - If set to true then the store will be searchable. This means that the platform will pass content from document to indexing.

  • deleteProtection - If set to true then the store will be protected from deletion. This means that you can't delete the store or delete all its contents. However, you can still delete documents from the store.

Anatomy of a Document Family

A document family consists of a document and any of the derived documents that are created from it.

Since a document family can contain both a native PDF and also the Kodexa Documents derived from it, we have a stereotype we call a content object. A content object points to something that contains content. This can be a file or a document, the content type on the content object is then either 'Document' or 'Native'. In this case 'Native' means the original file, since it could be of any file type.

The document family holds the list of content objects and also a concept called "Document Transistions". A document transition is a link between two content objects that shows how a content object was derived from another content object, and which assistant (or user) was responsible for the derivation.

TODO Diagram

Expression Labels

When a document (either a native file or a Kodexa document) is added to a Document store, we want to have the ability to determine if we want to add a label to it. This can be achieved with Label Expressions.

A label expression allows you to, on a document store, add a specific label to the new document based on the results of an expression. The expression itself is actually a Spring Expression Language (https://docs.spring.io/spring-framework/docs/3.2.x/spring-framework-reference/html/expressions.html) expression.

This can allow for a use-case where the application that is uploading the document to the platform can include metadata with the upload. This metadata (as well as the document and document family) are then available for the expression to use.

Let’s say we have an application that is uploading documents to an instance of Kodexa. When the upload is associating a value in metadata called “ShouldPublishXml”, the value can be True or False.

As we load the document into the document store, we want to determine if this metadata flag is present, and if it is there and not set to True we want to add a label dont_publish to the document.

In order to do this, we will want to create a label expression at the document store level that has properties:

label: dont_publish

expression:

containsKey('ShouldProcessXML') && ['ShouldProcessXML'].toLowerCase() != 'true'

This expression will then be evaluated - if the expression returns not True (not case-sensitive), then we will add the label.

If the expression returns a string value then we will use this as the name of the label, for example lets say we wanted to add a label that was the value of the metadata field available on upload called 'CustomerName'. We would use the expression:

containsKey('CustomerName') ? ['CustomerName'] : null

Expression Labels are part of the Store Metadata, this is available at:

/api/ stores / { organizationSlug } / { storeSlug } / metadata