Document¶
Document ¶
An opened document — the primary entry point of the library.
Obtain an instance with :meth:Document.open (from a filesystem path)
or :meth:Document.open_bytes (from raw bytes). Once opened, the
document exposes its text, markdown, per-page content, images, links,
tables, full-text search, outline, JSON tree, and a processability
health report.
Example
doc = olgadoc.Document.open("report.pdf") doc.format, doc.page_count ('PDF', 12) hit = doc.search("executive summary")[0] hit["page"], hit["snippet"] (1, 'Executive summary: ...')
format
property
¶
format: FormatName
Document format as an uppercase label.
Returns:
| Type | Description |
|---|---|
FormatName
|
One of |
page_count
property
¶
Total number of pages in the document.
Returns:
| Type | Description |
|---|---|
int
|
Page count, greater than or equal to zero. |
is_processable
property
¶
Shortcut for doc.processability().is_processable.
Returns:
| Type | Description |
|---|---|
bool
|
|
title
property
¶
Document title from the underlying metadata, when provided.
Returns:
| Type | Description |
|---|---|
str | None
|
The title as a string, or |
file_size
property
¶
Size of the source document in bytes.
Returns:
| Type | Description |
|---|---|
int
|
File size, greater than or equal to zero. |
encrypted
property
¶
Whether the document is encrypted.
Returns:
| Type | Description |
|---|---|
bool
|
|
open
staticmethod
¶
open(path: str) -> Document
Open a document from a filesystem path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Absolute or relative path to the document. |
required |
Returns:
| Type | Description |
|---|---|
Document
|
A fully loaded :class: |
Raises:
| Type | Description |
|---|---|
OlgaError
|
If the file cannot be read, the format is unsupported, the document is encrypted, or decoding fails. |
open_bytes
staticmethod
¶
open_bytes(data: bytes, format: FormatHint | None = ...) -> Document
Open a document from raw bytes already held in memory.
Useful when the document arrives over HTTP or from a database blob.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
The raw bytes of the document. |
required |
format
|
FormatHint | None
|
Optional format hint. When |
...
|
Returns:
| Type | Description |
|---|---|
Document
|
A fully loaded :class: |
Raises:
| Type | Description |
|---|---|
OlgaError
|
If the hint is unknown, the format is unsupported, or decoding fails. |
warnings ¶
Diagnostic warnings emitted during decoding and structure analysis.
Returns:
| Type | Description |
|---|---|
list[str]
|
A list of human-readable strings. Empty for a clean document. |
pages ¶
pages() -> list[Page]
All pages in document order.
Returns:
| Type | Description |
|---|---|
list[Page]
|
A list of :class: |
text ¶
Concatenated plain text of every page.
Returns:
| Type | Description |
|---|---|
str
|
The whole document as a single UTF-8 string. |
markdown ¶
Concatenated markdown rendering of every page.
Returns:
| Type | Description |
|---|---|
str
|
The whole document as GitHub-flavoured markdown. |
text_by_page ¶
Per-page plain text, keyed by 1-based page number.
Returns:
| Type | Description |
|---|---|
dict[int, str]
|
A dict mapping each page number to its text. |
markdown_by_page ¶
Per-page markdown, keyed by 1-based page number.
Returns:
| Type | Description |
|---|---|
dict[int, str]
|
A dict mapping each page number to its markdown. |
images ¶
images() -> list[ExtractedImage]
All raster images found in the document.
Returns:
| Type | Description |
|---|---|
list[ExtractedImage]
|
A list of :class: |
image_count ¶
Total number of images in the document.
Returns:
| Type | Description |
|---|---|
int
|
Same as |
link_count ¶
Total number of hyperlinks in the document.
Returns:
| Type | Description |
|---|---|
int
|
Same as |
tables ¶
tables() -> list[Table]
All reconstructed tables, including cross-page tables.
Returns:
| Type | Description |
|---|---|
list[Table]
|
A list of :class: |
table_count ¶
Total number of tables in the document.
Returns:
| Type | Description |
|---|---|
int
|
Same as |
search ¶
search(query: str) -> list[SearchHit]
Search for a literal substring across the full document.
The match is case-insensitive and substring-based.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
The text to look for. An empty string returns no hits. |
required |
Returns:
| Type | Description |
|---|---|
list[SearchHit]
|
A list of :class: |
chunks_by_page ¶
chunks_by_page() -> list[Chunk]
One text chunk per page, suitable for RAG-style indexing.
Returns:
| Type | Description |
|---|---|
list[Chunk]
|
A list of :class: |
outline ¶
outline() -> list[OutlineEntry]
Hierarchical outline (table of contents) of the document.
Returns:
| Type | Description |
|---|---|
list[OutlineEntry]
|
A list of :class: |
Raises:
| Type | Description |
|---|---|
OlgaError
|
If the outline cannot be computed. |
to_json ¶
to_json() -> DocumentJson
Full document tree serialised into a JSON-compatible Python object.
The result is a dict / list / scalar structure produced via
:func:json.loads, so it is safe to re-serialise with
:func:json.dumps. See :class:~olgadoc.DocumentJson for the
exact schema and :data:~olgadoc.JsonElement for the discriminated
union of element variants.
Returns:
| Name | Type | Description |
|---|---|---|
A |
DocumentJson
|
class: |
DocumentJson
|
per-page geometry, structural elements and any warnings. |
Raises:
| Type | Description |
|---|---|
OlgaError
|
If serialisation fails. |
processability ¶
processability() -> Processability
Compute a health report for the document.
Call this before paying for downstream work to know whether extraction is reliable, degraded, or outright blocked.
Returns:
| Name | Type | Description |
|---|---|---|
A |
Processability
|
class: |
Processability
|
degradations. |