documaris — Architecture
- Date: 2026-05-07 (updated from 2026-04-26)
- Status: Maritime pipeline live (PIER71); BCA Green Mark profile live (BEAMP demo)
- Delivery: Web app (Cloudflare Pages) with DuckDB WASM; future: native desktop with local LLM
- Key invariants: data pulled from R2; only BLAKE3 hash transits the network for PII fields;
check()/build_audit_payload()/seal()are profile-agnostic
System overview
flowchart TD
indago["<b>REMOTE: indago</b>\nvessel · voyage · cargo · events"]
r2["<b>REMOTE: documaris R2 bucket</b>\nread-only for app\nvessels / voyages / cargo / events (Parquet)"]
cache["Local cache\n(R2 snapshot)"]
crew["User-provided crew JSON\n⚠ PII — never leaves app"]
pipeline["<b>LOCAL: documaris native app</b>\n1. Data Fetch\n2. Field Mapping\n3. AI Fill (local OSS model)\n4. Trust Layer (BLAKE3 + Ed25519)\n5. Regulatory Alert\n6. Render → PDF"]
pdf["PDF\n→ local file system"]
local_log["<b>Local audit log</b>\nappend-only · tamper-evident\nagent's own record · always available"]
remote_store["<b>REMOTE: tamper-proof audit store</b>\nappend-only\nappend-only R2 bucket (MVP) → immugate (future)\nqueryable by authorities & P&I Clubs"]
indago -->|"push — indago copy job"| r2
r2 -->|"download on first run / refresh"| cache
cache --> pipeline
crew --> pipeline
pipeline --> pdf
pipeline -->|"AuditRecord\n(no PII, no raw content)"| local_log
local_log -.->|"edgesentry-audit store-and-forward\nqueued if offline"| remote_store
Multi-profile architecture
The pipeline core (check, build_audit_payload, seal) is profile-agnostic. Profiles differ only in parser, fill function, HTML template, and rules JSON.
| Profile | Sector | Parser | Template | Rules | Programme |
|---|---|---|---|---|---|
fal-form-1 |
Maritime port call | parse_maritime_csv |
fal-form-1.html |
sg-port-compliance/rules.json |
PIER71 |
sg-bca-greenmark |
Built environment | parse_bca_csv |
sg-bca-greenmark.html |
sg-bca-greenmark/rules.json |
BEAMP |
Adding a new profile = new parser + HTML template + rules JSON in edgesentry-rs, new run*Pipeline() in pipeline.ts, new data source in *Data.ts. No changes to check(), build_audit_payload(), or seal().
Compliance check types
| Check | Fires when | Example |
|---|---|---|
not_null |
Field is missing or empty | crew_count not provided |
not_expired |
Date field is in the past | BWM certificate expired |
not_true |
Boolean field is true |
dangerous goods declared |
above_threshold |
Numeric field exceeds threshold value | EUI 122 > 115 kWh/m²/year |
Layer 1 — Data Fetch
documaris reads exclusively from its own Cloudflare R2 bucket. indago is responsible for copying the data documaris needs into this bucket. This keeps the dependency clean: indago serves multiple applications (arktrace, documaris, and future products) and adding direct cross-app R2 bucket access would create tight coupling between consumers.
Responsibility split:
| Responsibility | Owner |
|---|---|
| Ingesting raw vessel, voyage, cargo, and AIS data | indago |
| Transforming and writing data to the documaris R2 bucket | indago (copy job) |
| Reading from the documaris R2 bucket | documaris app only |
| Schema of the documaris R2 bucket | Agreed jointly at M0; owned by documaris |
documaris R2 layout (target schema — copy job implemented by indago):
s3://documaris-bucket/
vessels/vessel_id=IMO1234567/data.parquet ← name, flag, IMO, GT, LOA, certificates
voyages/voyage_id=V20260424/data.parquet ← departure/arrival port, ETA, ETD
cargo/voyage_id=V20260424/data.parquet ← HS codes, quantities, DG flags, BL refs
events/vessel_id=IMO1234567/2026-04-24.json ← AIS position fixes, port entry/exit
Schema contract (M0): the documaris R2 partition layout is the interface contract between indago and documaris. It must be agreed before Milestone 0 completes. indago's existing R2 output (AIS and vessel scoring data for arktrace) uses a different schema; the copy job for documaris is a separate pipeline that indago must implement without modifying its existing outputs.
DuckDB runs in-process (Rust duckdb crate, bundled feature) to JOIN across Parquet files with a single SQL query and output a flat JSON record. The object_store crate (aws feature) handles S3-compatible download from the documaris R2 bucket; swapping to a local file system for development requires no code change.
Crew PII is never stored in R2. It is provided by the user directly inside the native app and never leaves the local machine (see Layer 6 and the privacy boundary section).
Key dependencies:
object_store = { version = "0.10", features = ["aws"] }
duckdb = { version = "1.1", features = ["bundled"] }
tokio = { version = "1", features = ["full"] }
Layer 2 — Field Mapping
Each document type has a field_map.json that maps every form field to its indago source and specifies how it should be filled:
{
"form_field": "brief_cargo_description",
"source": "indago.cargo.manifest_summary",
"type": "llm_summarise",
"llm_required": true,
"llm_prompt": "Summarise the cargo manifest in one line suitable for IMO FAL Form 1 field 13."
}
Field types: direct (copy as-is) · llm_summarise · llm_translate · llm_infer · computed.
This schema is the formal contract between indago's data layout and documaris's form templates. It must be agreed before Milestone 0 begins. Field source paths use the indago.* namespace (e.g. indago.cargo.manifest_summary).
Layer 3 — AI Fill
The AI fill layer is decoupled from any specific model or delivery mechanism behind a Rust trait:
#[async_trait]
pub trait LlmProvider: Send + Sync {
async fn fill_field(&self, req: &FieldFillRequest) -> Result<FieldFillResponse, LlmError>;
async fn extract_image(&self, image: &[u8], schema_hint: &str) -> Result<Value, LlmError>;
}
Swapping local vs. cloud, or native app vs. server, is a config.toml change; no code change required.
⚠ Implementation under review: The specific model selection (local open-source model vs. cloud API) and delivery mechanism (native app vs. web app) are being evaluated. Options under consideration include distributing a permissively licensed (Apache 2.0 / MIT) model with the application to eliminate cloud API costs and network dependencies. Model names and provider details will be specified once the architecture decision is finalised.
Apache 2.0 / MIT compatible model candidates:
Model Licence Size Notes Llama 3.2 3B Instruct Llama 3.2 Community (≈Apache 2.0 for <700M MAU) 3B Good instruction following; runs on CPU via llama.cpp / MLX; already used in clarus explain step Qwen 2.5 3B Instruct Apache 2.0 3B Strong multilingual (EN/JA/ZH); relevant for Japan NACCS Phase 2 Qwen 2.5 7B Instruct Apache 2.0 7B Higher accuracy for llm_summarise / llm_translate fields; requires ~6 GB RAM Mistral 7B Instruct v0.3 Apache 2.0 7B Strong instruction following; well-tested with llama.cpp GGUF; European regulatory text Phi-3.5 Mini Instruct MIT 3.8B Compact; runs efficiently on CPU; good for short structured-output tasks (field inference) Selection criteria: the chosen model must handle
llm_summarise,llm_translate(EN/JA minimum), andllm_inferfield types at acceptable quality. Offline-first constraint favours 3B–7B GGUF models runnable via llama.cpp with no GPU requirement.
Capability requirements (delivery-mechanism-independent):
| Task | Requirement |
|---|---|
| Direct field copy | No AI needed |
| Cargo summary, FAL free-text | Multilingual text generation (English / Japanese) |
| Japanese field fill / translation | Japanese language support required |
| Regulatory conflict detection | Structured JSON output with confidence score |
| Japanese handwriting OCR + hanko (Phase 2) | Vision / multimodal capability required |
| Long-context multi-document reasoning | Extended context window required |
All prompts request structured JSON output with a confidence field. Low-confidence fields surface as UI warnings and are never silently auto-submitted.
Layer 4 — Trust Layer
Implemented by reusing edgesentry-audit — the shared Rust crate from edgesentry-rs (blake3 = "1.5", ed25519-dalek = "2.1"). No new crypto code is written in documaris.
[dependencies]
edgesentry-audit = { git = "https://github.com/edgesentry/edgesentry-rs", tag = "v0.1.0" }
Remote audit store — MVP and future:
MVP: an append-only Cloudflare R2 bucket. Tamper-evidence comes from the hash chain and Ed25519 signatures produced by edgesentry-audit — not from R2's storage guarantees. The bucket is write-only (no DELETE, no overwrite); any modification to a stored record breaks the chain and is detectable.
Future: immugate — a commercial service to be built by this team, providing a dedicated tamper-proof audit log with a public Merkle-tree verification API. immugate is not yet built; the R2 bucket is the production interim. The store-and-forward endpoint in edgesentry-audit is the only change needed when immugate ships.
Responsibility boundary — edgesentry-audit is domain-agnostic:
edgesentry-audit is an independent library. It knows nothing about vessels, voyages, documents, or AI fields. It receives opaque bytes and returns a sealed record. All maritime semantics live in documaris.
flowchart TD
payload["<b>documaris</b> constructs DocumentAuditPayload\nvessel_id · voyage_id · doc_type\ngenerated_by · generated_at\nai_field_values · llm_confidence_flags\nfields_modified · regulatory_alerts\n<i>(Class C only — no PII)</i>"]
bytes["serialize → opaque bytes"]
seal["<b>edgesentry-audit</b>\nseal(payload_bytes, prev_hash, signing_key)\n<i>domain-agnostic — knows nothing about maritime fields</i>"]
record["AuditRecord\npayload_hash (BLAKE3)\nprev_record_hash\nsignature (Ed25519)\nseq · ts"]
xmp["hash embedded in\nPDF XMP /DocumentHash"]
local["<b>[1] LOCAL audit log</b>\nnative app · append-only\nwritten first · always available"]
remote["<b>[2] REMOTE audit store</b>\nappend-only R2 bucket (MVP)\nimmugate · future commercial service"]
payload --> bytes --> seal --> record
record --> xmp
record --> local
local -.->|"store-and-forward\nqueued if offline"| remote
edgesentry-audit's public interface (simplified):
// All documaris-specific fields are in the opaque payload_bytes.
// edgesentry-audit seals whatever bytes it receives.
fn seal(payload_bytes: &[u8], prev_hash: Hash32, key: &SigningKey) -> AuditRecord;
fn verify(record: &AuditRecord, payload_bytes: &[u8]) -> bool;
// store-and-forward: knows only the endpoint URL, not the payload semantics
fn queue_and_sync(record: AuditRecord, payload_bytes: Vec<u8>, endpoint: &Url);
Tamper-evidence is structural, not policy:
- Hash chain: each record includes prev_record_hash. Inserting, modifying, or deleting any record breaks all subsequent hashes in the chain — detectable by any party holding a copy.
- Ed25519 signature: each record is signed with the operator's key. Modifying a record breaks its signature.
- Dual copy: the local log and the remote log can be cross-verified against each other. An attacker would need to compromise both simultaneously to suppress evidence.
- Sequence numbers: gaps in the sequence are detectable — records cannot be silently dropped.
What the audit log records (no PII, full action trace):
| What happened | What's recorded |
|---|---|
| Document generated | who, when, which vessel/voyage (by ID), document type |
| AI filled a field | what text was generated, confidence score |
| Reviewer accepted a low-confidence field | that they accepted it, the confidence at the time |
| Reviewer corrected a field | before value, after value, editor identity |
| Regulatory alert raised | severity, rule triggered, resolution action |
| MEDIUM alert overridden | reason code entered by reviewer, their identity |
| Document hash embedded in PDF | the hash (not the document content) |
Root cause analysis: agents and authorities can query either copy to reconstruct the exact sequence of actions that produced a document — without retrieving any crew PII. Whether a port rejection was caused by an AI error, a reviewer override, post-generation tampering, or a source data issue is answerable from the audit log alone.
Verification: GET /audit/verify?hash=<blake3_hex> → { "verified": true, chain_intact: true, … } — served by the remote audit store, independent of documaris.
documaris also auto-generates an AIS Voyage Evidence Summary companion document — a natural-language summary of the vessel's AIS track (departure port/time, transit, arrival, port stay duration), generated from indago's AIS event Parquet data via the AI fill layer. The summary is treated as a payload and sealed by edgesentry-audit identically to any other document — the library does not distinguish it from a FAL form. This turns a form generator into a verifiable audit instrument: false declarations become detectable.
TrustSG / IMDA alignment: the Trust Layer directly addresses two TrustSG pillars — Authenticity (Ed25519 signature proves the document originated from verified vessel data) and Integrity (BLAKE3 hash + append-only audit log proves no post-generation modification). This positions documaris as national-grade trust infrastructure for maritime document exchange, not a convenience tool.
Layer 5 — Regulatory Alert
At generation time, the AI fill layer cross-references the vessel snapshot against a per-port JSON regulatory knowledge base and returns a structured conflict list:
flowchart LR
vessel["vessel snapshot"]
kb["port regulatory KB\n(JSON, per port)"]
llm["AI conflict check"]
high["🔴 HIGH\nblock submission"]
medium["🟡 MEDIUM\nwarn · reviewer override\nreason code · audit-logged"]
low["🟢 LOW\nnote in PDF cover sheet"]
vessel --> llm
kb --> llm
llm --> high
llm --> medium
llm --> low
No hard-coded rule logic; the AI model evaluates natural-language rule descriptions against vessel data. The knowledge base is updated by a combination of automated port-notice monitoring and manual review.
Example rules: BWM D-2 certificate validity, crew document expiry windows within port-specific minimum periods, DG cargo restrictions under current port circulars, quarantine pre-notification window compliance.
This layer shifts documaris from "document automation tool" (commoditised) to "compliance advisor" (high switching cost). A single avoided port detention justifies an annual subscription many times over.
Layer 6 — Render
All forms — including those containing crew PII — are rendered inside the native app. There is no server-side rendering path and no split between PII and non-PII forms. The server-side / client-side duality that a browser-based approach required is eliminated.
flowchart TD
vessel_json["vessel/voyage JSON\n(local cache)"]
crew_json["crew JSON\n(user-provided · local only)\n⚠ PII — never transmitted"]
render["Field map → Tera template\n→ HTML → native PDF renderer"]
trust["Trust Layer\nBLAKE3 hash in XMP · Ed25519 signature"]
pdf_out["PDF → local file system"]
local_log["<b>[1] LOCAL audit log</b>\nwritten immediately\nagent's own record"]
remote_store["<b>[2] REMOTE audit store</b>\nappend-only R2 bucket (MVP)\nimmugate · future"]
vessel_json --> render
crew_json --> render
render --> trust
trust --> pdf_out
trust -->|"AuditRecord + payload_bytes"| local_log
local_log -.->|"edgesentry-audit\nstore-and-forward"| remote_store
Offline-first: The entire pipeline — data fetch cache, AI fill model, PDF render, signing key, and local audit log write — runs without a network connection. A ship's steel engine room with no signal is a supported environment. The remote audit log sync is the only network-dependent step, and it is queued with store-and-forward (via edgesentry-audit) until connectivity resumes. The local audit log is always written first, so the agent's own tamper-evident record is available immediately regardless of connectivity.
Privacy: Because everything runs inside the native app process, there is no server-side code path at all for document generation. The privacy guarantee is structurally enforced, not a matter of configuration. Veson Nautical, ShipNet, and Helm CONNECT all require active server connectivity to render documents; documaris eliminates that dependency entirely.
ZKP Portfolio Attestation
documaris can verify BCA Green Mark compliance attestations from the clarus WORM audit chain — without accessing raw sensor data. This is the fourth trust layer alongside BLAKE3 hash, Ed25519 signature, and audit chain integrity.
clarus edge (on-premises) clarus WORM chain (R2)
────────────────────────── ───────────────────────
Sensor data (private) chains/{site}/{run}/{seq}.json
→ GreenMarkProgram.prove() └── zk_proof.public_values
→ ZkProof { public_values } = base64(GreenMarkAttestation)
(cert_level, pass/fail only)
Raw EUI/COP/LPD: never stored ↓
documaris AttestationView
decodes public_values
displays cert level + PASS/FAIL
raw sensor data: never fetched
fetchSiteAttestation(siteId) discovery order:
1. zkp-latest/{site}.json — edge-written pointer; single GET (strongly consistent)
2. /api/audit-summary?site=X — clarus Pages Function (run list)
3. /api/audit-index?site=X — key listing fallback
URL routing: ?mode=bca navigates directly to BCA Green Mark + ZKP view. URL updates on tab switch so BCA-focused teams can bookmark directly.
API: GET /api/bca-portfolio/:owner → PortfolioAttestation JSON — for BCA integration and downstream audit systems.
OCR / Reverse Ingestion (Phase 2 — post-PIER71 roadmap)
image input (JPEG / PNG / PDF scan)
— smartphone photo, flatbed scanner, MFP scan-to-email, digital camera
│
▼ vision-capable AI model (local, multimodal — model TBD)
"Extract fields from this Japanese maritime form. Return structured JSON."
│
▼ JSON extraction with per-field confidence + hanko_verification:
{
"vessel_name": { "value": "...", "confidence": "high" },
"hanko_verification": {
"detected": true,
"clarity_score": 0.87,
"overlap_score": 0.12,
"naccs_risk": "low"
}
}
│
▼ Intermediate JSON review UI (native app)
All fields editable; low-confidence fields highlighted
Hanko-Confidence Score meter + NACCS risk indicator
"Confirm and proceed" gate before NACCS format conversion
│
▼ NACCS-formatted output
The Hanko-Confidence Score (0.0–1.0) detects the presence, clarity, and text-overlap of a hanko stamp, predicting NACCS automated-check rejection risk. This directly addresses Japan's paper-authentication culture and closes the trust gap between paper and digital workflows. No competing maritime software offers this.
Compliance and Operations Policy
Data classification (Class A/B/C), PII boundary, access control, human-in-the-loop gates, audit trail schema, incident response SLAs, and regulatory compliance (PDPA / APPI / GDPR) are documented in ref-compliance-policy.md.
Cargo Workspace
Design note: documaris does not yet have a native Rust crate. The workspace layout below describes the intended structure if a native app is adopted at M0. Until then,
edgesentry-auditis referenced as a git dependency (see Layer 4 above).
If a native app is built, the recommended approach is a git dependency pointing to edgesentry-rs:
[dependencies]
edgesentry-audit = { git = "https://github.com/edgesentry/edgesentry-rs", tag = "v0.1.0" }
Alternatively, a shared workspace root:
# /edgesentry/Cargo.toml
[workspace]
members = [
"edgesentry-rs/crates/edgesentry-audit",
"documaris/crates/documaris-core",
"documaris/crates/documaris-cli",
]
One Cargo.lock for the entire repo; all products share dependency versions.
Technology stack summary
| Component | Technology |
|---|---|
| App delivery | Native desktop app (macOS / Windows / Linux); distributable installer |
| Core pipeline | Rust (documaris-core + documaris-cli crates) |
| AI fill — text | Local open-source model, Apache 2.0 / MIT licence (model TBD); bundled or downloaded on first run; runs fully offline via LlmProvider trait |
| AI fill — vision / OCR (Phase 2) | Local multimodal model (model TBD) |
| PDF render | Native PDF library (all forms, including PII; single render path) |
| Template engine | Tera (Rust) |
| Document hashing + signing | edgesentry-audit path dep — BLAKE3 + Ed25519 |
| Data fetch | object_store crate, aws feature (S3-compatible; reads from documaris R2 bucket only) |
| In-process query | DuckDB (duckdb crate, bundled feature) |
| Local data cache | App-local directory; vessel/voyage/cargo Parquet snapshots from documaris R2 |
| Regulatory KB | JSON per port, bundled with app; AI eval at generation time |
| Audit log sync | edgesentry-audit store-and-forward; queued locally when offline |
| Data lake | Cloudflare R2 — documaris bucket (indago copy job writes; documaris app reads) |
See also: ref-background.md · roadmap/index.md
Full technical detail per layer: _outputs/document-generation-architecture.md