Skip to content

Document

Document

An opened document — the primary entry point of the library.

Obtain an instance with :meth:Document.open (from a filesystem path) or :meth:Document.open_bytes (from raw bytes). Once opened, the document exposes its text, markdown, per-page content, images, links, tables, full-text search, outline, JSON tree, and a processability health report.

Example

doc = olgadoc.Document.open("report.pdf") doc.format, doc.page_count ('PDF', 12) hit = doc.search("executive summary")[0] hit["page"], hit["snippet"] (1, 'Executive summary: ...')

format property

format: FormatName

Document format as an uppercase label.

Returns:

Type Description
FormatName

One of "PDF", "DOCX", "XLSX" or "HTML".

page_count property

page_count: int

Total number of pages in the document.

Returns:

Type Description
int

Page count, greater than or equal to zero.

is_processable property

is_processable: bool

Shortcut for doc.processability().is_processable.

Returns:

Type Description
bool

True unless the document is encrypted or otherwise blocked.

title property

title: str | None

Document title from the underlying metadata, when provided.

Returns:

Type Description
str | None

The title as a string, or None if absent.

file_size property

file_size: int

Size of the source document in bytes.

Returns:

Type Description
int

File size, greater than or equal to zero.

encrypted property

encrypted: bool

Whether the document is encrypted.

Returns:

Type Description
bool

True when the source file is password-protected.

open staticmethod

open(path: str) -> Document

Open a document from a filesystem path.

Parameters:

Name Type Description Default
path str

Absolute or relative path to the document.

required

Returns:

Type Description
Document

A fully loaded :class:Document ready for extraction.

Raises:

Type Description
OlgaError

If the file cannot be read, the format is unsupported, the document is encrypted, or decoding fails.

open_bytes staticmethod

open_bytes(data: bytes, format: FormatHint | None = ...) -> Document

Open a document from raw bytes already held in memory.

Useful when the document arrives over HTTP or from a database blob.

Parameters:

Name Type Description Default
data bytes

The raw bytes of the document.

required
format FormatHint | None

Optional format hint. When None, the format is inferred from magic bytes.

...

Returns:

Type Description
Document

A fully loaded :class:Document.

Raises:

Type Description
OlgaError

If the hint is unknown, the format is unsupported, or decoding fails.

warnings

warnings() -> list[str]

Diagnostic warnings emitted during decoding and structure analysis.

Returns:

Type Description
list[str]

A list of human-readable strings. Empty for a clean document.

pages

pages() -> list[Page]

All pages in document order.

Returns:

Type Description
list[Page]

A list of :class:Page handles, one per page.

page

page(number: int) -> Page | None

Fetch a specific page by its 1-based number.

Parameters:

Name Type Description Default
number int

1-based page index.

required

Returns:

Name Type Description
The Page | None

class:Page handle, or None if number is out of

Page | None

range.

text

text() -> str

Concatenated plain text of every page.

Returns:

Type Description
str

The whole document as a single UTF-8 string.

markdown

markdown() -> str

Concatenated markdown rendering of every page.

Returns:

Type Description
str

The whole document as GitHub-flavoured markdown.

text_by_page

text_by_page() -> dict[int, str]

Per-page plain text, keyed by 1-based page number.

Returns:

Type Description
dict[int, str]

A dict mapping each page number to its text.

markdown_by_page

markdown_by_page() -> dict[int, str]

Per-page markdown, keyed by 1-based page number.

Returns:

Type Description
dict[int, str]

A dict mapping each page number to its markdown.

images

images() -> list[ExtractedImage]

All raster images found in the document.

Returns:

Type Description
list[ExtractedImage]

A list of :class:ExtractedImage dicts.

image_count

image_count() -> int

Total number of images in the document.

Returns:

Type Description
int

Same as len(doc.images()), without materialising the list.

links() -> list[Link]

All hyperlinks in the document.

Returns:

Type Description
list[Link]

A list of :class:Link dicts.

link_count() -> int

Total number of hyperlinks in the document.

Returns:

Type Description
int

Same as len(doc.links()), without materialising the list.

tables

tables() -> list[Table]

All reconstructed tables, including cross-page tables.

Returns:

Type Description
list[Table]

A list of :class:Table dicts.

table_count

table_count() -> int

Total number of tables in the document.

Returns:

Type Description
int

Same as len(doc.tables()), without materialising the list.

search

search(query: str) -> list[SearchHit]

Search for a literal substring across the full document.

The match is case-insensitive and substring-based.

Parameters:

Name Type Description Default
query str

The text to look for. An empty string returns no hits.

required

Returns:

Type Description
list[SearchHit]

A list of :class:SearchHit dicts.

chunks_by_page

chunks_by_page() -> list[Chunk]

One text chunk per page, suitable for RAG-style indexing.

Returns:

Type Description
list[Chunk]

A list of :class:Chunk dicts.

outline

outline() -> list[OutlineEntry]

Hierarchical outline (table of contents) of the document.

Returns:

Type Description
list[OutlineEntry]

A list of :class:OutlineEntry dicts.

Raises:

Type Description
OlgaError

If the outline cannot be computed.

to_json

to_json() -> DocumentJson

Full document tree serialised into a JSON-compatible Python object.

The result is a dict / list / scalar structure produced via :func:json.loads, so it is safe to re-serialise with :func:json.dumps. See :class:~olgadoc.DocumentJson for the exact schema and :data:~olgadoc.JsonElement for the discriminated union of element variants.

Returns:

Name Type Description
A DocumentJson

class:DocumentJson payload carrying document metadata,

DocumentJson

per-page geometry, structural elements and any warnings.

Raises:

Type Description
OlgaError

If serialisation fails.

processability

processability() -> Processability

Compute a health report for the document.

Call this before paying for downstream work to know whether extraction is reliable, degraded, or outright blocked.

Returns:

Name Type Description
A Processability

class:Processability instance describing blockers and

Processability

degradations.