Document¶

Document ¶

An opened document — the primary entry point of the library.

Obtain an instance with :meth:Document.open (from a filesystem path) or :meth:Document.open_bytes (from raw bytes). Once opened, the document exposes its text, markdown, per-page content, images, links, tables, full-text search, outline, JSON tree, and a processability health report.

Example

doc = olgadoc.Document.open("report.pdf") doc.format, doc.page_count ('PDF', 12) hit = doc.search("executive summary")[0] hit["page"], hit["snippet"] (1, 'Executive summary: ...')

format `property` ¶

format: FormatName

Document format as an uppercase label.

Returns:

Type	Description
`FormatName`	One of `"PDF"`, `"DOCX"`, `"XLSX"` or `"HTML"`.

page_count `property` ¶

page_count: int

Total number of pages in the document.

Returns:

Type	Description
`int`	Page count, greater than or equal to zero.

is_processable `property` ¶

is_processable: bool

Shortcut for doc.processability().is_processable.

Returns:

Type	Description
`bool`	`True` unless the document is encrypted or otherwise blocked.

title `property` ¶

title: str | None

Document title from the underlying metadata, when provided.

Returns:

Type	Description
`str \| None`	The title as a string, or `None` if absent.

file_size `property` ¶

file_size: int

Size of the source document in bytes.

Returns:

Type	Description
`int`	File size, greater than or equal to zero.

encrypted `property` ¶

encrypted: bool

Whether the document is encrypted.

Returns:

Type	Description
`bool`	`True` when the source file is password-protected.

open `staticmethod` ¶

open(path: str) -> Document

Open a document from a filesystem path.

Parameters:

Name	Type	Description	Default
`path`	`str`	Absolute or relative path to the document.	required

Returns:

Type	Description
`Document`	A fully loaded :class:`Document` ready for extraction.

Raises:

Type	Description
`OlgaError`	If the file cannot be read, the format is unsupported, the document is encrypted, or decoding fails.

open_bytes `staticmethod` ¶

open_bytes(data: bytes, format: FormatHint | None = ...) -> Document

Open a document from raw bytes already held in memory.

Useful when the document arrives over HTTP or from a database blob.

Parameters:

Name	Type	Description	Default
`data`	`bytes`	The raw bytes of the document.	required
`format`	`FormatHint \| None`	Optional format hint. When `None`, the format is inferred from magic bytes.	`...`

Returns:

Type	Description
`Document`	A fully loaded :class:`Document`.

Raises:

Type	Description
`OlgaError`	If the hint is unknown, the format is unsupported, or decoding fails.

warnings ¶

warnings() -> list[str]

Diagnostic warnings emitted during decoding and structure analysis.

Returns:

Type	Description
`list[str]`	A list of human-readable strings. Empty for a clean document.

pages ¶

pages() -> list[Page]

All pages in document order.

Returns:

Type	Description
`list[Page]`	A list of :class:`Page` handles, one per page.

page ¶

page(number: int) -> Page | None

Fetch a specific page by its 1-based number.

Parameters:

Name	Type	Description	Default
`number`	`int`	1-based page index.	required

Returns:

Name	Type	Description
`The`	`Page \| None`	class:`Page` handle, or `None` if `number` is out of
	`Page \| None`	range.

text ¶

text() -> str

Concatenated plain text of every page.

Returns:

Type	Description
`str`	The whole document as a single UTF-8 string.

markdown ¶

markdown() -> str

Concatenated markdown rendering of every page.

Returns:

Type	Description
`str`	The whole document as GitHub-flavoured markdown.

text_by_page ¶

text_by_page() -> dict[int, str]

Per-page plain text, keyed by 1-based page number.

Returns:

Type	Description
`dict[int, str]`	A dict mapping each page number to its text.

markdown_by_page ¶

markdown_by_page() -> dict[int, str]

Per-page markdown, keyed by 1-based page number.

Returns:

Type	Description
`dict[int, str]`	A dict mapping each page number to its markdown.

images ¶

images() -> list[ExtractedImage]

All raster images found in the document.

Returns:

Type	Description
`list[ExtractedImage]`	A list of :class:`ExtractedImage` dicts.

image_count ¶

image_count() -> int

Total number of images in the document.

Returns:

Type	Description
`int`	Same as `len(doc.images())`, without materialising the list.

links ¶

links() -> list[Link]

All hyperlinks in the document.

Returns:

Type	Description
`list[Link]`	A list of :class:`Link` dicts.

link_count ¶

link_count() -> int

Total number of hyperlinks in the document.

Returns:

Type	Description
`int`	Same as `len(doc.links())`, without materialising the list.

tables ¶

tables() -> list[Table]

All reconstructed tables, including cross-page tables.

Returns:

Type	Description
`list[Table]`	A list of :class:`Table` dicts.

table_count ¶

table_count() -> int

Total number of tables in the document.

Returns:

Type	Description
`int`	Same as `len(doc.tables())`, without materialising the list.

search ¶

search(query: str) -> list[SearchHit]

Search for a literal substring across the full document.

The match is case-insensitive and substring-based.

Parameters:

Name	Type	Description	Default
`query`	`str`	The text to look for. An empty string returns no hits.	required

Returns:

Type	Description
`list[SearchHit]`	A list of :class:`SearchHit` dicts.

chunks_by_page ¶

chunks_by_page() -> list[Chunk]

One text chunk per page, suitable for RAG-style indexing.

Returns:

Type	Description
`list[Chunk]`	A list of :class:`Chunk` dicts.

outline ¶

outline() -> list[OutlineEntry]

Hierarchical outline (table of contents) of the document.

Returns:

Type	Description
`list[OutlineEntry]`	A list of :class:`OutlineEntry` dicts.

Raises:

Type	Description
`OlgaError`	If the outline cannot be computed.

to_json ¶

to_json() -> DocumentJson

Full document tree serialised into a JSON-compatible Python object.

The result is a dict / list / scalar structure produced via :func:json.loads, so it is safe to re-serialise with :func:json.dumps. See :class:~olgadoc.DocumentJson for the exact schema and :data:~olgadoc.JsonElement for the discriminated union of element variants.

Returns:

Name	Type	Description
`A`	`DocumentJson`	class:`DocumentJson` payload carrying document metadata,
	`DocumentJson`	per-page geometry, structural elements and any warnings.

Raises:

Type	Description
`OlgaError`	If serialisation fails.

processability ¶

processability() -> Processability

Compute a health report for the document.

Call this before paying for downstream work to know whether extraction is reliable, degraded, or outright blocked.

Returns:

Name	Type	Description
`A`	`Processability`	class:`Processability` instance describing blockers and
	`Processability`	degradations.

Document¶

Document ¶

format property ¶

page_count property ¶

is_processable property ¶

title property ¶

file_size property ¶

encrypted property ¶

open staticmethod ¶

open_bytes staticmethod ¶

warnings ¶

pages ¶

page ¶

text ¶

markdown ¶

text_by_page ¶

markdown_by_page ¶

images ¶

image_count ¶

links ¶

link_count ¶

tables ¶

table_count ¶

search ¶

chunks_by_page ¶

outline ¶

to_json ¶

processability ¶

format `property` ¶

page_count `property` ¶

is_processable `property` ¶

title `property` ¶

file_size `property` ¶

encrypted `property` ¶

open `staticmethod` ¶

open_bytes `staticmethod` ¶