Partitioning extracts content from raw unstructured files and outputs that content as structured document elements.For specific file types, such as image files and PDF files, Unstructured offers special strategies to partition them. Each of these
strategies has trade-offs for output speed, cost to output, and quality of output.PDF files, for example, vary in quality and complexity. In simple cases, traditional natural language processing (NLP) extraction techniques might
be enough to extract all the text out of a document. In other cases, advanced image-to-text models are required
to process a PDF file. Some of these strategies implement rule-based workflows, which can be faster and cheaper, because they always
extract in the same way, but you might sometimes get lower-quality resolution. Other strategies implement
model-based workflows, which can be slower and costlier because they require a model that performs inference, but you can get higher-quality resolution.
When you choose a partitioning strategy for your files, you should be mindful of these speed, cost, and quality trade-offs.
For example, the Fast strategy can be about 100 times faster than leading image-to-text models.To choose one of these strategies, select one of the following four Partition Strategy options in the Partitioner node of a workflow.
You can change a workflow’s preconfigured strategy only through Custom workflow settings.
Unstructured recommends that you choose the Auto partitioning strategy in most cases. With Auto, Unstructured does all
the heavy lifting, optimizing at runtime for the highest quality at the lowest cost page-by-page.You should consider the following additional strategies only if you are absolutely sure that your documents are of the same
type. Each of the following strategies are best suited for specific situations. Choosing one of these
strategies other than Auto for sets of documents of different types could produce undesirable results,
including reduction in transformation quality.
VLM: For the highest-quality transformation of these file types: .bmp, .gif, .heic, .jpeg, .jpg, .pdf, .png, .tiff, and .webp.
High Res: For all other supported file types except video and audio files, and for the generation of bounding box coordinates.
Fast: For text-only documents.
Multimedia: For video and audio files.
Video and audio file partitioning is available only for self-hosted deployments of Unstructured.
Fast partitioning accepts any text inputs, though automatic language detection of those inputs is restricted to langdetect.High Res partitioning leverages Tesseract OCR. For the list of languages that Tesseract supports, see:
Languages/Scripts supported in different versions of Tesseract.Language support for VLM depends on the model used. The list of supported languages for a particular model is maintained by
that model’s provider. For the list of languages that each model supports, see the following, where provided:
Anthropic
Claude 3.5 Sonnet: Arabic, Bengali, Chinese (Simplified), English, French, German, Hindi, Indonesian, Italian, Japanese, Korean,
Portuguese (Brazil), Spanish, Swahili, and Yoruba are mentioned. (Source)