text
; and some metadata which might
vary depending on the element type, file structure, and some additional settings that are applied during
partitioning, chunking, and enriching. Optionally, the element can also have an
embeddings derived from the text
; the length of embeddings
depends on the embedding model that is used.
Element type | Description |
---|---|
Address | A text element for capturing physical addresses. |
CodeSnippet | A text element for capturing code snippets. |
EmailAddress | A text element for capturing email addresses. |
FigureCaption | An element for capturing text associated with figure captions. |
Footer | An element for capturing document footers. |
FormKeysValues | An element for capturing key-value pairs in a form. |
Formula | An element containing formulas in a file. |
Header | An element for capturing document headers. |
Image | A text element for capturing image metadata. |
ListItem | ListItem is a NarrativeText element that is part of a list. |
NarrativeText | NarrativeText is an element consisting of multiple, well-formulated sentences. This excludes elements such titles, headers, footers, and captions. |
PageBreak | An element for capturing page breaks. |
PageNumber | An element for capturing page numbers. |
SceneDescription | An element for capturing scene descriptions, for example a description of a scene in a video. |
Table | An element for capturing tables. |
Title | A text element for capturing titles. |
TranscriptFragment | An element for capturing transcription of speech, for example a speaker’s words in an audio clip or video. |
UncategorizedText | Base element for capturing free text from within files. Applies to extracted text not associated with bounding boxes if the input is a PDF file. |
SceneDescription
and TranscriptFragment
are specific to video and audio file processing, which is available only for self-hosted deployments of Unstructured.CompositeElement
type.
CompositeElement
is a chunk formed from text (non-Table
) elements.
A composite element might be formed by combining one or more sequential elements produced by partitioning. For example,
several individual list items might be combined into a single chunk.
metadata
.
Element metadata enables you to do things such as:
metadata
fields when the information is available from the source file:
Metadata field name | Description |
---|---|
category_depth | The depth of the element relative to other elements of the same category. Category depth is the depth of an element relative to other elements of the same category. It is set by a file partitioner and enables the document hierarchy after processing to compute more accurate hierarchies. Category depth might be set using native document hierarchies, for example reflecting <H1> or <H2> tags within an HTML file or the indentation level of a bulleted list item in a Word document. |
coordinates | Any X-Y bounding box coordinates. |
detection_class_prob | The detection model class probabilities. Applies only to Unstructured inference using the High Res strategy. |
emphasized_text_contents | The related emphasized text (bold or italic) in the original file. |
emphasized_text_tags | Any tags on the text that are emphasized in the original file. |
file_directory | The related file’s directory. |
filename | The related file’s filename. |
filetype | The related file’s type. |
is_continuation | True if the element is a continuation of a previous element. Only relevant for chunking, if an element was divided into two due to Max Characters. |
languages | Document languages at the file or element level. The list is ordered by probability of being the primary language of the text. |
last_modified | The related file’s last modified date. |
parent_id | The ID of the element’s parent element. parent_id might be used to infer where an element resides within the overall document hierarchy. For instance, a NarrativeText element might have a Title element as a parent (a “subtitle”), which in turn might have another Title element as its parent (a “title”). |
text_as_html | The HTML representation of the related extracted table. Only applicable to table elements. |
parent_id
and category_depth
enhance hierarchy detection to identify the document
structure in various file formats by measuring relative depth of an element within its category. This is especially
useful in files with native hierarchies like HTML or Word files, where elements like headings or list items inherently define structure.
coordinates
metadata field contains:
points
: These specify the corners of the bounding box starting from the top left corner and proceeding counter-clockwise. The points represent pixels, the origin is in the top left and the y
coordinate increases in the downward direction.system
: The points have an associated coordinate system. A typical example of a coordinate system is PixelSpace
, which is used for representing the coordinates of images. The coordinate system has a name, orientation, layout width, and layout height.Field name | Applicable file types | Description |
---|---|---|
page_number | DOCX, PDF, PPT, XLSX | The related file’s page number. |
page_name | XLSX | The related sheet’s name in an Excel file. |
sent_from | EML | The related email sender. |
sent_to | EML | The related email recipient. |
subject | EML | The related email subject. |
attached_to_filename | MSG | The name of the file that the attached file is attached to. |
header_footer_type | Word Doc | The pages that a header or footer applies to in a Word document: primary , even_only , and first_page . |
link_urls | HTML | The URL that is associated with a link in a document. |
link_texts | HTML | The text that is associated with a link in a document. |
section | EPUB | The book section title corresponding to a table of contents. |
sent_from
, sent_to
, and subject
metadata. sent_from
is a list of strings because
the RFC 822 spec for emails allows for multiple sent from email addresses.
page_name
element, which corresponds to the sheet name in the Excel
file.
header_footer_type
indicating which page a header or footer applies to.
Valid values are "primary"
, "even_only"
, and "first_page"
.
start_time
and end_time
, representing the start and end times of a clip of video
from the parent video file to which this element belongs. Also included are the model_version
representing the model that was used to
generate the element, and the average_log_probability
representing the model’s overall average confidence level for the model’s output across the document, with values closer to
zero indicating higher confidence.
start_time
, end_time
, and speaker
, representing the start and end times of a clip of audio
made by a specific speaker, as part of the parent audio file to which this element belongs.
If the speaker cannot be determined, speaker
is set to 0
or unknown
.
Table
elements, the raw text of the table will be stored in the text
attribute for the element, and HTML representation
of the table will be available in the element metadata under text_as_html
.
Unstructured will automatically extract all tables for all doc types if you check the Infer Table Structure in the ConnectorSettings area of the Transform section of a workflow.
Here’s an example of a table element. The text
of the element will look like this:
text_as_html
metadata for the same element will look like this:
date_created
date_modified
date_processed
permissions_data
record_locator
url
version
Source connector | Additional metadata |
---|---|
Azure | protocol , remote_file_path |
Elasticsearch | document_id , index_name , url |
Google Drive | drive_id , file_id |
OneDrive | server_relative_path , user_pname |
S3 | protocol , remote_file_path |
SharePoint | server_path , site_url |