unstructured
Author: langgenius
Version: 0.0.1
Type: tool
The Unstructured Partition Endpoint provides parameters to customize the processing of documents.
The only required parameter is – the file you wish to process.
| Python / POST | JavaScript/TypeScript | Description |
|---|---|---|
| (shared.Files) | (File, Blob, shared.Files) | Required. The file to process. |
| (str) | (string) | Use one of the supported strategies to chunk the returned elements after partitioning. When not specified, no chunking is performed and any other chunking parameters are ignored. Supported: , , , . |
| (str) | (string) | A hint about the content type to use (such as ), when there are problems processing a specific file. This value is a MIME type in the format . |
| (bool) | (boolean) | to return bounding box coordinates for each element extracted with OCR. Default: . |
| (str) | (string) | The encoding method used to decode the text input. Default: . |
| (List[str]) | (string[]) | The types of elements to extract as Base64 encoded data in element metadata fields, e.g. . Supported filetypes: image and PDF. |
| (str) | (string) | If file is gzipped, use this content type after unzipping. Example: |
| (str) | (string) | The name of the inference model used when strategy is . Options: , . Default: . |
| (bool) | (boolean) | for the output to include page breaks if the filetype supports it. Default: . |
| (List[str]) | (string[]) | The languages present in the document, for use in partitioning and OCR. |
| (str) | (string) | The format of the response. Supported: , . Default: . |
| (bool) | (boolean) | Deprecated! Use instead. If and strategy is , any Table elements extracted from a PDF will include an additional metadata field, , with the HTML table. |
| (List[str]) | (string[]) | The document types to skip table extraction for. Default: . |
| (int) | (number) | The page number to assign to the first page in the document. This will be included in elements’ metadata. |
| (str) | (string) | The strategy to use for partitioning PDF and image files. Options: , , , , . Default: . |
| (bool) | (boolean) | to assign UUIDs to element IDs (guarantees uniqueness). Otherwise, a SHA-256 of the element’s text is used. Default: . |
| (str) | (Not yet available) | Applies only when strategy is . The name of the vision language model (VLM) provider to use for partitioning. must also be specified. |
| (str) | (Not yet available) | Applies only when strategy is . The name of the vision language model (VLM) to use for partitioning. must also be specified. |
| (bool) | (boolean) | to retain the XML tags in the output. Otherwise, only the text within the tags is extracted. Only applies to XML documents. |
The following parameters only apply when a chunking strategy is specified. Otherwise, they are ignored.
| Python / POST | JavaScript/TypeScript | Description |
|---|---|---|
| (int) | (number) | Applies only when the chunking strategy is set to . Combines small chunks until the combined chunk reaches a length of n characters. Default: same as . |
| (bool) | (boolean) | (default) to have the elements used to form a chunk appear in for that chunk. |
| (int) | (number) | Cut off new sections after reaching a length of n characters. (Hard maximum.) Default: . |
| (bool) | (boolean) | Applies only when the chunking strategy is set to . Determines if a chunk can include elements from more than one page. Default: . |
| (int) | (number) | Applies only when the chunking strategy is specified. Cuts off new sections after reaching a length of n characters. (Soft maximum.) Default: . |
| (int) | (number) | A prefix of this many trailing characters from the prior text-split chunk is applied to second and later chunks formed from oversized elements by text-splitting. Default: none. |
| (bool) | (boolean) | to have an overlap also applied to “normal” chunks formed by combining whole elements. Use with caution, as this can introduce noise into otherwise clean semantic units. |
| (float) | (number) | Applies only when the chunking strategy is set to . The minimum similarity text in consecutive elements must have to be included in the same chunk. Range: 0.01–0.99. Default: . |
| Python / POST | JavaScript/TypeScript | Description |
|---|---|---|
| (bool) | (boolean) | to split the PDF file client-side. |
| (bool) | (boolean) | When , a failed split request will not stop the processing of the rest of the document. The affected page range will be ignored in the results. Default: . |
| (int) | (number) | The number of split files to be sent concurrently. Default: . Maximum: . |
| (List[int]) | (number[]) | A list of 2 integers within the range . When PDF splitting is enabled, this will send only the specified page range to the API. |
| Field | Type | Description |
|---|---|---|
| string | The full parsed content in Markdown format, including images, sections, etc. | |
| array | List of attached files (if any). | |
| array | Structured JSON data (if any). | |
| array of object | List of image objects extracted from the content. | |
| array of object | List of structured content blocks (sections, paragraphs, images, etc). |
| Field | Type | Description |
|---|---|---|
| string | File extension (e.g., ) | |
| string | Unique image ID | |
| string | MIME type of the image | |
| string | Image file name | |
| string | URL for image preview | |
| integer | Image file size in bytes | |
| string | Always |
Example:
| Field | Type | Description |
|---|---|---|
| string | Unique identifier for the element | |
| object | Metadata for the element (see below for details) | |
| string | The text content of the element | |
| string | The type of element (e.g., , , , ) |
The field provides detailed information about the element's origin, layout, and context.
Possible fields include (not all fields are present in every element):
| Field | Type | Description |
|---|---|---|
| string | Name of the source file (e.g., ) | |
| string | MIME type or file type (e.g., , ) | |
| array | List of detected languages (e.g., ) | |
| integer | Page number in the source file (if applicable) | |
| object | Layout information (see below) | |
| float | Confidence score for element detection (0-1) | |
| string | MIME type for images (e.g., ) | |
| string | Unique file ID for images | |
| string | Preview URL for images | |
| array | List of original sub-elements (for composite elements) |
Describes the position and size of the element in the source file (if available):
| Field | Type | Description |
|---|---|---|
| integer | Height of the layout (e.g., page height in pixels) | |
| integer | Width of the layout (e.g., page width in pixels) | |
| array | List of four [x, y] coordinate pairs (bounding box) | |
| string | Coordinate system used (e.g., ) |