Skip to content

Development Guide

Project overview, repo layout, commands, coding conventions, and links to full documentation.

What This Repo Is

A shadow fleet candidate screening pipeline. Ingests public AIS, sanctions, vessel registry, and trade flow data → produces a ranked watchlist of candidate shadow fleet vessels with SHAP-explained confidence scores.

For context on the problem and full architecture, read docs/index.md and docs/architecture.md.

Repo Layout

_inputs/        Challenge docs (Cap Vista Solicitation 5.0 — do not edit)
docs/           Project documentation (source of truth for design decisions)
scripts/        Operator-facing CLI tools (run_pipeline.py, run_backtracking.py, …)
src/
  graph/        Lance Graph storage layer (store.py — node/relationship schemas, read/write)
  ingest/       Data ingestion scripts (AIS, sanctions, registry, trade flow)
  features/     Feature engineering (Polars + Lance Graph)
  score/        Scoring engine (HDBSCAN, Isolation Forest, SHAP, composite, causal DiD)
  analysis/     Post-confirmation intelligence (label_propagation, causal_rewind, backtracking_runner)
  api/          FastAPI + HTMX dashboard (src/api/main.py → http://localhost:8000)
data/
  raw/          Downloaded raw data (gitignored)
  processed/    DuckDB files, Parquet outputs, Lance Graph datasets (<region>_graph/)
tests/
pyproject.toml

Key References

Procedures

Run the full screening pipeline

The easiest way to run the full pipeline is the interactive CLI, which handles region selection and passes all flags automatically:

uv run python scripts/run_pipeline.py                          # interactive region selection
uv run python scripts/run_pipeline.py --region singapore --non-interactive
uv run python scripts/run_pipeline.py --region japan --non-interactive

Available regions: singapore, japan, middleeast, europe, gulf. See regional-playbooks.md for per-region parameter details.

Alternatively, run each step manually:

uv run python src/ingest/schema.py             # initialise DuckDB schema
uv run python src/ingest/marine_cadastre.py    # load historical AIS
uv run python src/ingest/sanctions.py          # load sanctions entities
uv run python src/ingest/vessel_registry.py    # load Equasis + ITU MMSI → Lance Graph
uv run python src/ingest/eo_gfw.py --bbox 95,1,110,6 --days 30  # EO detections (requires GFW_API_TOKEN in .env)
uv run python src/ingest/eo_gfw.py --csv data/raw/eo_detections_sample.csv  # EO detections via local CSV (no token needed)
uv run python src/features/ais_behavior.py     # compute AIS behavioral features
uv run python src/features/identity.py         # identity volatility features (Lance Graph)
uv run python src/features/ownership_graph.py  # Lance Graph ownership features
uv run python src/features/trade_mismatch.py   # trade flow mismatch features
uv run python src/score/mpol_baseline.py       # HDBSCAN baseline
uv run python src/score/anomaly.py             # Isolation Forest scoring
uv run python src/score/causal_sanction.py     # C3: DiD causal model → calibrated w_graph
uv run python src/score/composite.py           # composite score + SHAP (pass --w-graph from above)
uv run python src/score/watchlist.py           # output candidate_watchlist.parquet

Run the dashboard

uv run uvicorn src.api.main:app --reload
# open http://localhost:8000

Run the operations shell (menu-driven jobs)

bash scripts/run_operations_shell.sh

Covers Full Screening, Review-Feedback Evaluation, Historical Backtesting, and Demo/Smoke. See pipeline-operations.md.

Run the delayed-label intelligence loop (backtracking)

# Full pass (all confirmed labels):
uv run python scripts/run_backtracking.py --db data/processed/mpol.duckdb

# Incremental (only labels confirmed since a checkpoint):
uv run python scripts/run_backtracking.py --since 2026-04-01T00:00:00Z

See backtracking-runbook.md for full options and output format.

Run tests

uv run pytest tests/

Coding Conventions

  • Polars: use the lazy API (pl.scan_parquet, .lazy(), .collect()) for all large AIS queries; avoid .to_pandas().
  • DuckDB: use parameterised queries; never interpolate user-supplied strings into SQL.
  • Lance Graph: read datasets via src.graph.store.load_tables(db_path); write via write_tables(db_path, tables). Graph features are implemented as Polars joins — no external graph server.
  • Output: all intermediate outputs are Parquet in data/processed/; no CSV outputs.
  • Secrets: API keys (aisstream.io, Equasis, GFW) go in .env (gitignored); read via python-dotenv. For EO fusion without a GFW token, pass --skip-eo or use --csv with a local detections file.

Out of Scope

Do not implement physical vessel inspection, edge sensor measurement, or VDES communication in this repo. Those belong in edgesentry-rs / edgesentry-app. If you need to reference those requirements, see field-investigation.md.