app icon
Unstructured
0.0.3

unstructured

langgenius/unstructured9 installs

Unstructured Partition API

Author: langgenius
Version: 0.0.1
Type: tool


Partition Input Parameters

The Unstructured Partition Endpoint provides parameters to customize the processing of documents.

The only required parameter is – the file you wish to process.


Main Parameters

Python / POSTJavaScript/TypeScriptDescription
(shared.Files) (File, Blob, shared.Files)Required. The file to process.
(str) (string)Use one of the supported strategies to chunk the returned elements after partitioning. When not specified, no chunking is performed and any other chunking parameters are ignored.
Supported: , , , .
(str) (string)A hint about the content type to use (such as ), when there are problems processing a specific file. This value is a MIME type in the format .
(bool) (boolean) to return bounding box coordinates for each element extracted with OCR. Default: .
(str) (string)The encoding method used to decode the text input. Default: .
(List[str]) (string[])The types of elements to extract as Base64 encoded data in element metadata fields, e.g. . Supported filetypes: image and PDF.
(str) (string)If file is gzipped, use this content type after unzipping. Example:
(str) (string)The name of the inference model used when strategy is . Options: , . Default: .
(bool) (boolean) for the output to include page breaks if the filetype supports it. Default: .
(List[str]) (string[])The languages present in the document, for use in partitioning and OCR.
(str) (string)The format of the response. Supported: , . Default: .
(bool) (boolean)Deprecated! Use instead. If and strategy is , any Table elements extracted from a PDF will include an additional metadata field, , with the HTML table.
(List[str]) (string[])The document types to skip table extraction for. Default: .
(int) (number)The page number to assign to the first page in the document. This will be included in elements’ metadata.
(str) (string)The strategy to use for partitioning PDF and image files. Options: , , , , . Default: .
(bool) (boolean) to assign UUIDs to element IDs (guarantees uniqueness). Otherwise, a SHA-256 of the element’s text is used. Default: .
(str)(Not yet available)Applies only when strategy is . The name of the vision language model (VLM) provider to use for partitioning. must also be specified.
(str)(Not yet available)Applies only when strategy is . The name of the vision language model (VLM) to use for partitioning. must also be specified.
(bool) (boolean) to retain the XML tags in the output. Otherwise, only the text within the tags is extracted. Only applies to XML documents.

Chunking Parameters

The following parameters only apply when a chunking strategy is specified. Otherwise, they are ignored.

Python / POSTJavaScript/TypeScriptDescription
(int) (number)Applies only when the chunking strategy is set to . Combines small chunks until the combined chunk reaches a length of n characters. Default: same as .
(bool) (boolean) (default) to have the elements used to form a chunk appear in for that chunk.
(int) (number)Cut off new sections after reaching a length of n characters. (Hard maximum.) Default: .
(bool) (boolean)Applies only when the chunking strategy is set to . Determines if a chunk can include elements from more than one page. Default: .
(int) (number)Applies only when the chunking strategy is specified. Cuts off new sections after reaching a length of n characters. (Soft maximum.) Default: .
(int) (number)A prefix of this many trailing characters from the prior text-split chunk is applied to second and later chunks formed from oversized elements by text-splitting. Default: none.
(bool) (boolean) to have an overlap also applied to “normal” chunks formed by combining whole elements. Use with caution, as this can introduce noise into otherwise clean semantic units.
(float) (number)Applies only when the chunking strategy is set to . The minimum similarity text in consecutive elements must have to be included in the same chunk. Range: 0.01–0.99. Default: .

Client-Specific Parameters (Not sent to server)

Python / POSTJavaScript/TypeScriptDescription
(bool) (boolean) to split the PDF file client-side.
(bool) (boolean)When , a failed split request will not stop the processing of the rest of the document. The affected page range will be ignored in the results. Default: .
(int) (number)The number of split files to be sent concurrently. Default: . Maximum: .
(List[int]) (number[])A list of 2 integers within the range . When PDF splitting is enabled, this will send only the specified page range to the API.

Notes

  • For more details on each parameter, refer to the official documentation.
  • Some parameters are only available in specific strategies or file types.
  • Default values are shown where applicable.

API Response Structure

Top-Level Fields

FieldTypeDescription
stringThe full parsed content in Markdown format, including images, sections, etc.
arrayList of attached files (if any).
arrayStructured JSON data (if any).
array of objectList of image objects extracted from the content.
array of objectList of structured content blocks (sections, paragraphs, images, etc).

Field Details

text

  • Type: string
  • Description:
    The entire content, formatted in Markdown. This may include images, headings, bullet points, and other formatting for direct rendering.

files

  • Type: array
  • Description:
    List of attached files. Typically empty in the current output.

json

  • Type: array
  • Description:
    Structured JSON data for advanced use cases. Typically empty in the current output.

images

  • Type: array of objects
  • Description:
    List of images found in the content. Each image object contains the following fields:
FieldTypeDescription
stringFile extension (e.g., )
stringUnique image ID
stringMIME type of the image
stringImage file name
stringURL for image preview
integerImage file size in bytes
stringAlways

Example:


elements

  • Type: array of objects
  • Description:
    List of structured content blocks. Each object represents a section, paragraph, image, or other content element.
FieldTypeDescription
stringUnique identifier for the element
objectMetadata for the element (see below for details)
stringThe text content of the element
stringThe type of element (e.g., , , , )
metadata (object)

The field provides detailed information about the element's origin, layout, and context.
Possible fields include (not all fields are present in every element):

FieldTypeDescription
stringName of the source file (e.g., )
stringMIME type or file type (e.g., , )
arrayList of detected languages (e.g., )
integerPage number in the source file (if applicable)
objectLayout information (see below)
floatConfidence score for element detection (0-1)
stringMIME type for images (e.g., )
stringUnique file ID for images
stringPreview URL for images
arrayList of original sub-elements (for composite elements)
coordinates (object)

Describes the position and size of the element in the source file (if available):

FieldTypeDescription
integerHeight of the layout (e.g., page height in pixels)
integerWidth of the layout (e.g., page width in pixels)
arrayList of four [x, y] coordinate pairs (bounding box)
stringCoordinate system used (e.g., )

Example Response


Notes

  • The field is suitable for direct display in web or app frontends.
  • The field is useful for structured processing, highlighting, or further analysis.
  • The field provides all image resources for preview or download.
  • The and fields are reserved for future use or advanced scenarios.
  • The object in each element may contain additional fields depending on the extraction process and file type.

CATEGORY
Tool
TAGS
RAG
VERSION
0.0.3
langgenius·09/30/2025 09:26 AM
REQUIREMENTS
Maximum memory
256MB
Maximum storage
1MB