Skip to content

Historical Backtesting Validation

This document defines the reproducible offline evaluation workflow used to validate shadow-fleet candidate detection quality against historical public evidence.

Why Backtesting

We cannot know all true shadow-fleet vessels in real time. Backtesting provides a practical validation loop by replaying historical windows with known outcomes and measuring ranking quality.

Primary objective: maximize operational triage utility (high hit-rate in top-N candidates), not claim perfect 100% classification.

Inputs

  1. A versioned manifest file listing evaluation windows
  2. A watchlist parquet per window
  3. A labels CSV per window with evidence-backed positive/negative labels

Templates:

  • config/evaluation_manifest.sample.json
  • config/eval_labels.template.csv

Automation Boundary

This section clarifies what can be automated end-to-end and what still requires human judgment.

Can be automated

  1. Data extraction and file generation
  2. Generate draft labels CSV rows from sanctions tables (MMSI/IMO/name/list source).
  3. Generate manifest windows with watchlist and label file paths.
  4. Validate required columns and file shape before running backtest.

  5. Backtest execution and metric reporting

  6. Run historical window evaluation from manifest.
  7. Compute ranking and classification metrics (Precision@K, Recall@K, AUROC, PR-AUC, calibration error).
  8. Generate threshold suggestions for fixed review capacities.
  9. Export JSON reports for CI artifacts and dashboards.

  10. Regression monitoring

  11. Run repeatable checks in CI on curated historical windows.
  12. Alert when key metrics drift below agreed thresholds.

Cannot be fully automated

  1. Ground-truth certainty
  2. Public data is incomplete and delayed; many true shadow-fleet outcomes are never formally published.
  3. Therefore, a fully complete positive/negative truth set cannot be auto-derived.

  4. High-confidence negative labeling

  5. "No public evidence" is not the same as "truly negative."
  6. Negative labels with high confidence require analyst review and policy criteria.

  7. Evidence quality and temporal validity

  8. Source credibility, evidence freshness, and timeline consistency require human validation.
  9. Leakage checks (ensuring evidence was known within the historical window) require governance decisions.

  10. Operational decisioning

  11. The model provides ranked candidates and scores; officers decide investigation priority.
  12. Final status assignment (confirmed/cleared/inconclusive) is a human-in-the-loop decision.
  • Automation handles: candidate generation, metric computation, report generation, regression checks.
  • Human review handles: evidence adjudication, label confidence assignment, final investigative decisions.
  • Feedback loop combines both: human outcomes are fed back into periodic model/threshold updates.

Label Policy

  • label: positive or negative
  • label_confidence: high, medium, weak (or unknown)
  • evidence_source/evidence_url: public source traceability

Recommended:

  • Use only evidence available up to each window end date
  • Keep label confidence explicit to avoid over-claiming
  • Prefer MMSI and IMO where possible

Public Data for Identified Vessels

Yes, we can build a useful labeled set from public data.

Practical positive-label sources

  1. Sanctions lists (machine-readable, strongest baseline)
  2. OFAC SDN (US)
  3. UN sanctions lists
  4. EU sanctions lists

  5. Government and intergovernmental disclosures

  6. Enforcement advisories and designation notices
  7. Public case summaries naming vessels, IMO, or MMSI

  8. Reputable investigative datasets/reports

  9. Open investigations that provide vessel identifiers and dated evidence

How to use these sources in evaluation

  • Treat sanctions/designations as high-confidence positives when vessel identifiers are present.
  • Include source URL and publication date in labels.
  • Map identifiers by MMSI and IMO (prefer both when available).
  • Freeze each evaluation window by date to prevent future information leakage.

Limits to keep in mind

  • Public data will not cover all true shadow-fleet vessels.
  • Some records are delayed, incomplete, or ambiguous.
  • Therefore, backtesting measures practical ranking utility, not perfect population recall.

Critical evaluation caveat

  • Cases that are neither detected by our algorithm nor publicly confirmed (not identified/caught in open sources) are treated as unknown and excluded from strict success/failure judgment.
  • The primary objective is to detect cases that are publicly confirmed with credible evidence.
  • Beyond that boundary, public data alone cannot provide rigorous ground truth for complete-recall evaluation.
  • high: explicit sanctioned/officially designated vessel with identifier match
  • medium: multiple credible public sources with strong identifier evidence
  • weak: plausible but incomplete evidence (keep for analysis, not primary KPI)

Run Backtest

uv run python -m src.score.backtest \
  --manifest config/evaluation_manifest.sample.json \
  --output data/processed/backtest_report.json \
  --review-capacities 25,50,100

Output

data/processed/backtest_report.json includes:

  • Window-level metrics
  • Cross-window summary with mean and 95% CI (when multiple windows exist)
  • Stratified metrics by vessel type
  • False-positive/false-negative example rows
  • Operational threshold suggestions by review capacity

Core metrics reported:

  • precision_at_50
  • precision_at_100
  • recall_at_100
  • recall_at_200
  • auroc
  • pr_auc
  • calibration_error (ECE)

Threshold Recommendation Policy

The report includes:

  1. recommended_threshold: score threshold maximizing F1 on labeled set
  2. ops_thresholds: min score and hit-rate for specific review capacities

Use ops_thresholds for deployment defaults when analyst capacity is fixed.

CI Integration

Unit tests validate backtest metric/report generation (tests/test_backtest.py).

For full offline evaluations in CI, add a scheduled job with curated historical artifacts and publish backtest_report.json as an artifact.

Periodic Reviewed-Outcome Evaluation Loop

To close the feedback loop from analyst decisions, run the reviewed-outcome evaluator. This job consumes the latest vessel_reviews snapshot from DuckDB, joins it with regional watchlists, and emits:

  • Tier-aware reporting (review tier mix + top-k tier mix)
  • Operations-aware metrics (capacity hit-rate and min-score thresholds)
  • Region/capacity threshold recommendations with support counts
  • Drift/regression checks against a prior report baseline

Run:

uv run python scripts/run_review_feedback_evaluation.py \
  --db data/processed/mpol.duckdb \
  --output data/processed/review_feedback_evaluation.json \
  --review-capacities 25,50,100 \
  --baseline-report data/processed/review_feedback_evaluation_prev.json \
  --fail-on-regression

Key reproducibility controls:

  • --as-of-utc to freeze the review snapshot boundary
  • Stable region-to-watchlist mapping via --watchlist region=path
  • Persisted JSON artifact containing config, mappings, and tier-label policy

Report output:

  • summary: snapshot size and labeled coverage
  • regions[]: tier-aware metrics, ops thresholds, recommended threshold evidence
  • drift_regression_checks: pass/fail checks versus baseline by region

Public Data Integration Test (Opt-in)

We provide an opt-in integration test that actually downloads public sanctions data, loads DuckDB, and evaluates found-vs-missed outcomes against practical positive-label sources.

Because OpenSanctions ingestion can take time, prepare a persistent DB once and reuse it in later tests.

uv run python scripts/prepare_public_sanctions_db.py \
  --db data/processed/public_eval.duckdb

This writes:

  1. Persistent DB: data/processed/public_eval.duckdb
  2. Cached raw file: data/raw/sanctions/opensanctions_entities.jsonl
  3. Metadata snapshot: data/processed/public_eval_metadata.json

To refresh data:

uv run python scripts/prepare_public_sanctions_db.py \
  --db data/processed/public_eval.duckdb \
  --force-download \
  --force-reload

Run manually:

RUN_PUBLIC_DATA_TESTS=1 \
  PUBLIC_SANCTIONS_DB=data/processed/public_eval.duckdb \
  uv run --group dev python -m pytest tests/test_public_data_backtest_integration.py -v

Optional fallback (not recommended for daily runs): if PUBLIC_SANCTIONS_DB does not exist, you can allow the test to prepare data on demand by setting PREPARE_PUBLIC_DATA_IF_MISSING=1.

Analyst Pre-Label Holdout Evaluation

The public-data backtest measures detection of already-confirmed cases. The pre-label holdout evaluation adds a leading-indicator slice: vessels the analyst suspects before any public confirmation.

Pre-labels use a three-tier taxonomy (suspected-positive / uncertain / analyst-negative) with analyst confidence tiers (high / medium / weak). A leakage guard ensures evidence_timestamp <= window_end_date.

Initial curated set: 60 vessels across 3 regionsdata/demo/analyst_prelabels_demo.csv.

Run against a watchlist:

uv run python -m src.score.prelabel_evaluation \
  --watchlist data/processed/candidate_watchlist.parquet \
  --prelabels-csv data/demo/analyst_prelabels_demo.csv \
  --output data/processed/prelabel_evaluation.json \
  --end-date 2025-11-15 \
  --min-confidence-tier medium \
  --review-capacities 25,50,100

Output includes: - Pre-label slice metrics: precision_at_50, recall_at_100, auroc, pr_auc - Disagreement analysis: model_high_analyst_negative + model_low_analyst_positive - Leakage report: count of labels dropped for the window - Confidence tier breakdown (high / medium / weak precision separately)

Do not merge this slice with public-label metrics. Run both and compare — divergence between the two slices signals novel evasion patterns not yet captured by public lists.

Full governance policy (leakage rules, review cadence, versioning): docs/prelabel-governance.md.


Demo-size Sample Dataset

For demos, you can build a small sample DB from the prepared public DB.

uv run python scripts/build_public_sanctions_demo_sample.py \
  --source-db data/processed/public_eval.duckdb \
  --demo-db data/demo/public_eval_demo.duckdb \
  --max-rows 300

This is useful for fast demos and local smoke checks without full-size ingestion. The data/demo/ folder is intended to be committed to Git as portable demo fixtures.

Bundled dashboard fixture:

  • data/demo/candidate_watchlist_demo.parquet

To load it into the dashboard input path quickly:

uv run python scripts/use_demo_watchlist.py --backup

Main-merge Integration Batch (Known-case Check)

Run a medium-scale batch that:

  1. Reuses (or refreshes) the public sanctions DB.
  2. Runs multi-region pipeline output generation.
  3. Builds public-overlap labels per region.
  4. Executes backtesting and verifies a minimum known-case floor.

Local equivalent run:

uv run python scripts/run_public_backtest_batch.py \
  --regions singapore,japan,middleeast,europe,gulf \
  --gdelt-days 14 \
  --seed-dummy \
  --max-known-cases 200 \
  --min-known-cases 30 \
  --strict-known-cases

Outputs:

  • data/processed/evaluation_manifest_public_integration.json
  • data/processed/backtest_report_public_integration.json
  • data/processed/backtest_public_integration_summary.json
  • data/processed/eval_labels_public_*_integration.csv

GitHub Actions workflow:

  • .github/workflows/public-backtest-integration.yml

Execution policy:

  • This integration batch runs automatically on push to main (post-merge).
  • It is not scheduled as a nightly cron job.

If your target is "tens to hundreds" of known cases, tune:

  • --max-known-cases (upper cap)
  • --min-known-cases (required floor)
  • --regions and --gdelt-days (candidate pool breadth)

What this test checks:

  1. Public sanctions data is downloadable and loadable into DuckDB.
  2. Labels can be derived from practical positive-label sources (OFAC/UN/EU-like tags).
  3. Backtest report includes:
  4. source_positive_coverage.matched_total (found by algorithm output overlap)
  5. source_positive_coverage.missed_total (publicly identified positives not found)

Boundary reminder:

  • Cases not publicly identified/caught are outside strict pass/fail ground truth.
  • This test evaluates detection of publicly evidenced cases, which is the reliable scope for open-data validation.