Changelog¶
All notable changes to this project are documented here. The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
Starting with v0.2.0, each release will be documented with granular
Added / Changed / Fixed / Removed / Security sections. v0.1.0
is the single foundational cut that establishes the baseline.
[0.1.3] — 2026-05-26¶
Fix : preserve merge span on header cells (
HintKind::TableHeadernow carriesrowspan+colspan).
Fixed¶
- Merged header cells now surface their geometric span faithfully. Prior to v0.1.3,
HintKind::TableHeaderonly carriedcol: u32; the merge span was computed internally by the XLSX decoder (SheetContext::merge_span) and the non-origin cells were correctly skipped (is_merged_non_origin), but the publicapi::TableCell.rowspan/colspanfor header cells were hardcoded to1regardless of the actual merge geometry. Downstream consumers walkingDocument::tables()saw(0,0) "Identité" colspan=1,(0,2) "Tarification" colspan=1,(0,4) "Stock" colspan=1for a row-0 header band that merged(0,0)..(0,1)+(0,2)..(0,3)— the column gaps signaled the merges but the geometric span was lost at the API frontier. This made it harder to write robust consumers for XLSX with category-level header groupings (a common real-world pattern).
Changed¶
HintKind::TableHeaderextended from{ col: u32 }to{ col: u32, rowspan: u32, colspan: u32 }. The two new fields default to1for the non-merged case and reflect the source document's geometric merge span for merged origins. Source-incompatible for consumers destructuring all fields (existing consumers usingmatches!(.., HintKind::TableHeader { col: N, .. })keep working ; consumers using{ col }need to update to{ col, .. }or{ col, rowspan, colspan }).- All 5 emit sites updated to thread the rowspan/colspan through :
formats/xlsx/sheets/emit.rs(heuristic header path +native_table_hint)formats/pdf/tagged/semantics.rs(passescell.rowspan/cell.colspanfrom active cell)formats/html/walker/structures.rs(passes the parsed<th rowspan=… colspan=…>attributes)formats/docx/body/tables.rs(passescell.rowspan/cell.colspanfrom active cell)structure/detectors/table/lattice/detect.rs(defaults to1/1— lattice-detected tables don't carry merge geometry from heuristics)structure/assembler/classify.rspropagates the new fields intoPrimaryRole::TableCell { rowspan, colspan, is_header: true, ... }instead of hardcoding1/1.structure/detectors/table/continuation.rs::promote_to_continuationpreserves the span when promotingTableHeader→TableCellContinuation(instead of dropping to1/1).
Downstream impact¶
Aliya's crates/olga-bridge (Chunk 3.1, ADR-0068 D9.6) discovered this gap empirically while exercising the messy-real-world resilience fixture : the bridge re-export surface olga_bridge::TableCell showed max_colspan = 1 even when the input xlsx had merge_cells("Identité", colspan=2). The fix unblocks the consumer-side T pipeline at Chunk 5.1 onboarding — instead of inferring merges by gap detection, the T-pipeline dispatches directly on TableCell.colspan > 1.
0.1.2 — 2026-05-25¶
Per-format feature flags — opt-out binary size for downstream consumers (e.g. mobile surfaces).
Added¶
[features]block inCargo.tomlexposing 4 per-format feature flags :xlsx/pdf/docx/html. Default feature set is["xlsx", "pdf", "docx", "html"]— full backward compatibility with v0.1.1 behavior (no change for consumers who don't opt out).- A private umbrella
_ooxmlfeature that pulls in the sharedquick-xml+zipdeps that both DOCX and XLSX need ; gated transitively byxlsxanddocxfeatures. Consumers should NOT enable_ooxmldirectly. - Compile-time feature-disabled error path : when a format's feature is opted out,
Document::open()and the CLIselect_decoder()dispatch returnIdpError::UnsupportedFormatcarrying a "feature disabled at compile time" message. TheFormatenum keeps all 4 variants visible so the public API stays stable across feature combinations.
Changed¶
- Format-specific dependencies are now
optional = trueinCargo.toml: calamine→ gated byxlsxpdf_oxide→ gated bypdfscraper+ego-tree→ gated byhtmlquick-xml+zip→ gated by_ooxml(transitively byxlsxanddocx)- The PDF spatial renderer modules (
output::markdown+output::spatial+output::rules+ helperscol_resolve/row_cluster) are now cfg-gated on thepdffeature. These modules consumepdf_oxide::layout::TextChardirectly and have no meaningful rendering path without PDF support. - Integration test files (
tests/docx_tests.rs/tests/pdf_tests.rs/tests/xlsx_tests.rs/tests/html_tests.rs) gated with#![cfg(feature = "X")]— disabled formats compile to an empty test-binary shell with zero tests.
Compatibility¶
Fully backward-compatible with v0.1.1 for consumers using olga = "0.1" without explicit feature manipulation : the default feature set replicates v0.1.1's all-formats behavior verbatim. Consumers opting out of formats (e.g. default-features = false, features = ["xlsx", "html"]) gain measurable binary-size reduction.
Downstream impact¶
Aliya's crates/olga-bridge (Chunk 3.1, ADR-0067) consumes this release to support its mobile-surface feature-flag discipline. The bridge propagates the 4 feature flags one-to-one so Aliya mobile builds can opt out of PDF + DOCX while keeping XLSX + HTML for the dominant onboarding flow (Excel catalog import + product-page web scraping).
0.1.1 — 2026-04-21¶
Docs patch — surface the independent v0.1.0 benchmark on PyPI and crates.io.
No engine changes. This release publishes the independent,
reproducible post-release audit at
olga_v0.1.0_benchmark/
and links it from the crate README, the PyPI README, the MkDocs
landing page, and BENCHMARKS.md. Headline result on a 50-file
mixed-format corpus: 1.62× faster and 2.62× more extracted content
than a hand-routed best-of-breed pipeline. The crate metadata author
field is also corrected from "Hugues Tankouo" to "Hugues Dtankouo".
0.1.0 — 2026-04-21¶
First public release — the end-to-end Olga pipeline in one cut.
Olga's first public release. This foundational cut ships the full
intelligent document processing pipeline: a Rust core that parses PDF,
DOCX, XLSX, and HTML with provenance tracking and table
reconstruction, a Python distribution (olgadoc) with a strictly-typed
API surface, an olga CLI for inspection, extraction, search, and
page-level access, runnable examples, end-to-end regression coverage,
an MkDocs site, and a full CI/CD pipeline publishing to crates.io and
PyPI. The public API is stable enough for evaluation and prototyping;
expect minor breaking changes on the path to 1.0.