Skip to content

Payloads

Every method that returns a dict returns a real TypedDict — these are runtime classes, introspectable with __annotations__, and fully typed for mypy --strict consumers.

Geometry

BoundingBox

Bases: TypedDict

Axis-aligned bounding box in page coordinates (points).

PageDimensions

Bases: TypedDict

Effective page dimensions returned by :attr:Page.dimensions.

Extracted content

Bases: TypedDict

A hyperlink extracted from a document.

Table

Bases: TypedDict

A reconstructed table, potentially spanning multiple pages.

TableCell

Bases: TypedDict

A single cell inside a :class:Table.

SearchHit

Bases: TypedDict

A single match returned by :meth:Document.search or :meth:Page.search.

Chunk

Bases: TypedDict

A per-page text chunk returned by :meth:Document.chunks_by_page.

OutlineEntry

Bases: TypedDict

A single heading in :meth:Document.outline.

ExtractedImage

Bases: TypedDict

A raster image extracted from a document.

alt_text is required to be present as a key but may be None when the source format does not carry alt text.

Processability

HealthLabel module-attribute

HealthLabel = Literal['ok', 'degraded', 'blocked']

HealthIssueKind module-attribute

HealthIssueKind = Literal['Encrypted', 'EmptyContent', 'DecodeFailed', 'ApproximatePagination', 'HeuristicStructure', 'PartialExtraction', 'MissingPart', 'UnresolvedStyle', 'UnresolvedRelationship', 'MissingMedia', 'TruncatedContent', 'MalformedContent', 'FilteredArtifact', 'SuspectedArtifact', 'OtherWarnings']

HealthIssueSimple

Bases: TypedDict

Health issue variants that carry no extra payload.

HealthIssueApproximatePagination

Bases: TypedDict

Pages are approximate — the engine inferred boundaries heuristically.

HealthIssueHeuristicStructure

Bases: TypedDict

Document structure was reconstructed heuristically on pages pages.

HealthIssueCounted

Bases: TypedDict

Every health variant that carries an occurrence count.

Format discrimination

FormatName module-attribute

FormatName = Literal['PDF', 'DOCX', 'XLSX', 'HTML']

FormatHint module-attribute

FormatHint = Literal['pdf', 'docx', 'docm', 'xlsx', 'xls', 'html', 'htm']