Skip to main content

Documentation Index

Fetch the complete documentation index at: https://unstructured-53-docs-245-multimodal.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The following information applies only to the Unstructured Ingest CLI and the Unstructured Ingest Python library.The Unstructured SDKs for Python and JavaScript/TypeScript and the Unstructured open-source library do not support this functionality.

Task

You want to process only files with specified extensions, only files at or below a specified size, or both.

Approach

For the Ingest CLI, use the following command options. For the Ingest Python library, use the following parameters for the FiltererConfig object.
  • Use --file-glob (CLI) or file_glob (Python) to specify the list of file extensions to process.
  • Use --max-file-size (CLI) or max_file_size (Python) to specify the maximum size of files to process, in bytes.

To run this example

The following example processes only .pdf and .eml files that have a file size of 100 KB or less. To run this example, you should have a directory with a mixture of files, including at least one .pdf file and one .eml file, and with at least one of these files having a file size of 100 KB or less.

Code

unstructured-ingest \
  local \
    --input-path $LOCAL_FILE_INPUT_DIR \
    --output-dir $LOCAL_FILE_OUTPUT_DIR \
    --file-glob "*.pdf,*.eml" \
    --max-file-size 100000 \
    --partition-by-api \
    --partition-endpoint $UNSTRUCTURED_API_URL \
    --api-key $UNSTRUCTURED_API_KEY \
    --strategy hi_res