Architecture
System Overview
┌─────────────────────────────────────────────────────────────────┐
│ PUBLIC DATA SOURCES │
│ │
│ AIS (aisstream.io WebSocket with --bbox override; │
│ Marine Cadastre Parquet for US waters only) │
│ Sanctions (OFAC SDN, EU, UN, OpenSanctions CC0) │
│ Vessel registry (Equasis, ITU MMSI) │
│ Trade flow (UN Comtrade API) │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER (src/ingest/) │
│ │
│ AIS positions ──────────────────► DuckDB (ais_positions table) │
│ Sanctions entities ─────────────► DuckDB (sanctions_entities) │
│ Vessel ownership chains ────────► Lance Graph (on-disk files) │
│ Trade flow by route ────────────► DuckDB (trade_flow table) │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ FEATURE ENGINEERING (src/features/) │
│ │
│ AIS behavioral features ───────► Polars DataFrame │
│ · gap count / max gap hours │
│ · position jump count (spoofing) │
│ · STS candidate events │
│ · port call ratio │
│ │
│ Identity volatility features ───► Polars DataFrame │
│ · flag_changes_2y │
│ · name_changes_2y │
│ · owner_changes_2y │
│ │
│ Ownership graph features ───────► Lance Graph (Polars joins) │
│ · sanctions_distance (min hops to sanctioned entity) │
│ · cluster_sanctions_ratio │
│ │
│ Trade mismatch features ────────► Polars + DuckDB │
│ · route_cargo_mismatch │
└──────────────────────────┬──────────────────────────────────────┘
│ combined feature matrix (Polars)
▼
┌─────────────────────────────────────────────────────────────────┐
│ SCORING ENGINE (src/score/) │
│ │
│ HDBSCAN ── normal MPOL baseline (per vessel type / route) │
│ Isolation Forest ── anomaly_score ∈ [0,1] │
│ Lance Graph ── graph_risk_score ∈ [0,1] │
│ C3 DiD model ─ calibrate graph_risk_score weight (→ composite) │
│ Composite ── confidence = w_a·anomaly + w_g·graph │
│ + w_i·identity_volatility │
│ (weights calibrated by causal_sanction.py) │
│ SHAP ── top_signals JSON per vessel │
└──────────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ OUTPUT │
│ │
│ data/processed/candidate_watchlist.parquet │
│ FastAPI + HTMX dashboard (src/api/) → http://localhost:8000 │
│ │
│ ↕ (context window — no external calls) │
│ │
│ Ollama / MLX LLM → analyst brief + streaming chat │
└─────────────────────────────────────────────────────────────────┘
│
▼ handoff
┌─────────────────────────────────────────────────────────────────┐
│ PHYSICAL INVESTIGATION (edgesentry-app / edgesentry-rs) │
│ (out of scope for this repo — see roadmap.md) │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ STORAGE LAYER (cross-cutting; local or S3-compatible) │
│ │
│ OLAP — DuckDB + Parquet │
│ · local: data/processed/mpol.duckdb │
│ · Parquet outputs: data/processed/*.parquet │
│ or s3://arktrace/processed/ (MinIO / S3) │
│ │
│ Graph — Lance (embedded, serverless) │
│ · local: data/processed/mpol_graph/ │
│ data/processed/gdelt.lance │
│ · remote: s3://arktrace/mpol_graph/ │
│ s3://arktrace/gdelt.lance (MinIO / S3) │
│ │
│ Object store — MinIO localhost:9000 (docker-compose.yml) │
│ · bucket: arktrace (created by minio_init on first run) │
│ · console: localhost:9001 │
└─────────────────────────────────────────────────────────────────┘
Data Storage Design
DuckDB (data/processed/mpol.duckdb)
DuckDB is the primary analytical store. It runs in-process with no server and queries Parquet files directly. Multi-region deployments use separate DuckDB files per region (e.g. data/processed/europe.duckdb) — every script accepts a --db flag to target the correct file. See regional-playbooks.md for per-region paths and bbox values.
Parquet persistence: all pipeline output files (watchlist, causal effects, validation metrics) are written by src/storage/config.py. When S3_BUCKET is set, output goes to s3://<bucket>/processed/<filename> via MinIO or any S3-compatible store; otherwise it writes to data/processed/<filename> on the local filesystem. No code changes are required to switch between the two modes.
| Table | Key columns | Source |
|---|---|---|
ais_positions |
mmsi, timestamp, lat, lon, sog, cog, nav_status, ship_type |
aisstream.io (all regions); Marine Cadastre Parquet (US waters only) |
sanctions_entities |
entity_id, name, mmsi, imo, flag, type, list_source |
OFAC, EU, UN, OpenSanctions |
trade_flow |
reporter, partner, hs_code, period, trade_value_usd, route_key |
UN Comtrade |
vessel_meta |
mmsi, imo, name, flag, ship_type, gross_tonnage |
Equasis + ITU MMSI |
vessel_features |
one row per MMSI, all engineered features | Computed by src/features/ |
Lance Graph (data/processed/mpol_graph/)
Lance Graph stores the vessel ownership graph as columnar Lance datasets — no external server or Docker container required. The graph directory is written by src/ingest/vessel_registry.py and read by src/features/ownership_graph.py and src/features/identity.py.
Storage backend: src/storage/config.py exposes lance_graph_uri(stem) and lance_db_uri() which resolve to local paths (data/processed/mpol_graph/, data/processed/gdelt.lance) when running without S3, and to s3://arktrace/mpol_graph/ / s3://arktrace/gdelt.lance when S3_BUCKET is set. Lance's built-in object store support handles the S3 read/write transparently.
Node datasets (one Lance file each):
- Vessel {mmsi, imo, name}
- Company {id, name, country}
- Country {code}
- VesselName {name}
- Address {address_id, street}
- SanctionsRegime {name}
Relationship datasets (src_id → dst_id plus edge properties):
- OWNED_BY — (Vessel.mmsi) → (Company.id) with {since, until}
- MANAGED_BY — (Vessel.mmsi) → (Company.id) with {since, until}
- REGISTERED_IN — (Company.id) → (Country.code)
- CONTROLLED_BY — (Company.id) → (Company.id) — beneficial ownership layers
- ALIAS — (Vessel.mmsi) → (VesselName.name) with {date}
- SANCTIONED_BY — (Vessel.mmsi | Company.id) → (SanctionsRegime.name) with {list, date}
- REGISTERED_AT — (Company.id) → (Address.address_id) — shared-address clustering
- STS_CONTACT — (Vessel.mmsi) → (Vessel.mmsi) — co-location events
Key graph queries (implemented as Polars joins in src/features/):
# Minimum BFS distance from vessel to any sanctioned company
# 0 = directly sanctioned, 1 = 1-hop owner/manager, 2 = 2-hop via CONTROLLED_BY, 99 = none
Feature Design
AIS Behavioral Features
Computed with Polars over a rolling 30-day window per MMSI.
| Feature | Definition | Shadow fleet signal |
|---|---|---|
ais_gap_count_30d |
Gaps > 6h in AIS signal while in open sea | STS transfer or deliberate dark period |
ais_gap_max_hours |
Longest single gap | Severity indicator |
position_jump_count |
Consecutive positions implying > 50 knots | GPS spoofing |
sts_candidate_count |
Co-located drift events (2 vessels within 0.5nm, both drifting, at sea) | Illicit STS transfer |
port_call_ratio |
AIS-declared port calls ÷ detected anchorage events | Port declaration fraud |
loitering_hours_30d |
Hours at < 2 knots outside port boundaries | Waiting for STS opportunity |
Identity Volatility Features
Computed from Equasis historical data via Lance Graph datasets.
| Feature | Definition |
|---|---|
flag_changes_2y |
Number of flag state changes in rolling 2 years |
name_changes_2y |
Number of name changes in rolling 2 years (from ALIAS dataset) |
owner_changes_2y |
Number of registered owner changes (from OWNED_BY dataset) |
high_risk_flag_ratio |
Fraction of time under flags with weak PSC oversight |
ownership_depth |
Number of beneficial ownership layers to natural person |
Ownership Graph Features
Computed by Polars joins over Lance Graph datasets.
| Feature | Definition |
|---|---|
sanctions_distance |
Min BFS hops from vessel to any sanctioned entity (0 = vessel itself sanctioned) |
cluster_sanctions_ratio |
Fraction of vessels in same ownership cluster that are sanctioned |
shared_manager_risk |
Max sanctions_distance among all vessels sharing the same manager |
shared_address_centrality |
Number of distinct vessels sharing the same registered address as any company in this vessel's ownership chain |
sts_hub_degree |
Number of distinct vessels this vessel has been co-located with (STS_CONTACT degree) — identifies laundering hubs |
Trade Flow Mismatch Features
Computed by joining AIS route segments to UN Comtrade flow data.
| Feature | Definition |
|---|---|
route_cargo_mismatch |
Declared cargo type vs modal cargo on detected origin→destination route |
declared_vs_estimated_cargo_value |
AIS-implied cargo volume vs UN Comtrade flow value for that route/period |
Scoring Design
MPOL Baseline (HDBSCAN)
HDBSCAN clusters vessels by behavioral profile (speed pattern, route regularity, gap frequency) stratified by vessel type and route corridor. The resulting cluster labels define "normal" MPOL for each segment. Vessels that fall outside all clusters (noise points) are assigned higher anomaly weight.
Anomaly Score (Isolation Forest)
Isolation Forest is trained on the full feature matrix of vessels with sanctions_distance ≥ 3 (assumed clean) to learn normal behavior. The resulting anomaly scores are calibrated to [0,1].
C3 · Causal Sanction-Response Model (DiD)
src/score/causal_sanction.py quantifies whether AIS gap frequency causally increases after sanction announcements for vessels connected (within 2 graph hops) to sanctioned entities. This is used to calibrate the graph_risk_score weight in the composite formula.
For each regime (OFAC Iran, OFAC Russia, UN DPRK) the model fits a Difference-in-Differences (DiD) regression:
outcome_{it} = β₀ + β₁·treated_i + β₂·post_t + β₃·(treated_i × post_t)
+ vessel_type FEs + route_corridor FEs + ε_{it}
where β₃ (ATT) is the sanction-attributable increase in AIS gaps per 30 days. OLS is estimated with HC3 heteroskedasticity-robust standard errors. Multiple announcement dates per regime are pooled via inverse-variance weighting.
Weight calibration: calibrate_graph_weight(effects) maps the fraction of positive-significant ATT estimates to a w_graph value in [0.20, 0.65]. Pass it to compute_composite_scores() via --w-graph:
# Calibrate then score
uv run python src/score/causal_sanction.py --output data/processed/causal_effects.parquet
uv run python src/score/composite.py --w-graph <calibrated_value>
Outputs: data/processed/causal_effects.parquet — regime, n_treated, n_control, ATT estimate, 95% CI, p-value, is_significant, calibrated_weight.
Dashboard exposure: the file is served via GET /api/causal-effects and rendered in the vessel review panel as per-regime ATT badges:
⚡ OFAC Iran ATT = +0.42 95% CI [+0.31, +0.53] p < 0.001
⚡ OFAC Russia ATT = +0.15 95% CI [-0.02, +0.32] p = 0.09 n.s.
Significant regimes (p < 0.05) are highlighted in indigo; non-significant regimes are rendered in grey. Returns {"available": false} if the file does not yet exist (e.g. before the first pipeline run).
Composite Score
confidence = w_anomaly × anomaly_score
+ w_graph × graph_risk_score
+ w_identity × identity_volatility_score
Default weights: w_anomaly = 0.4, w_graph = 0.4, w_identity = 0.2. All three are configurable via --w-anomaly, --w-graph, --w-identity CLI flags on src/score/composite.py. The C3 causal model provides a data-driven w_graph calibration (see section above and roadmap.md Phase C, C3).
Per-region weight tuning recommendations are in regional-playbooks.md.
Explainability (SHAP)
SHAP TreeExplainer computes per-feature contributions to the anomaly score for each vessel. The top 5 contributing features are serialised as top_signals JSON in the watchlist output and served via GET /api/vessels/{mmsi}/signals. The review panel renders them as a mini-table (Feature / Value / SHAP contribution / bar) so a duty officer can understand why a vessel was flagged without reading raw feature values.
LLM Integration
The LLM converts a deterministic, structured risk assessment into readable English for the analyst. All scoring decisions are made before the LLM is called; the model receives a pre-computed context window and cannot modify scores or access external data.
Use cases:
| Code | Input | Output |
|---|---|---|
| C2 — Analyst brief | Vessel profile + SHAP top_signals + 3 GDELT events |
One-paragraph risk summary per vessel |
| C6 — Analyst chat | Fleet overview + optional vessel detail + analyst question | Grounded factual answer |
Provider selection: controlled by LLM_PROVIDER environment variable.
| Value | Backend |
|---|---|
llamacpp (default) |
Bundled llamacpp server — no external process required |
ollama |
Ollama local server |
anthropic |
Anthropic API (requires LLM_API_KEY) |
gemini |
Google Gemini API (requires LLM_API_KEY) |
openai |
Any OpenAI-compatible endpoint |
Recommended local model: Gemma 4 4B Instruct (Q4_K_M) via llamacpp — downloaded automatically on first docker compose up. Context window fits within ~1 200 tokens; no GPU required.
No cloud dependency: inference runs entirely on-device by default. The LLM has no tool access, no function calling, and no internet connectivity during inference. Context is injected via the context window only.
See docs/local-llm-setup.md for model recommendations, hardware requirements, and setup instructions.