Analyst Pre-Label Holdout Governance
This document defines the policy for creating, maintaining, and using the analyst-curated pre-label holdout set introduced in #62.
Purpose
Public-data backtesting (C9 / issue #53) measures how well the model ranks vessels that are already on sanctions lists. That approach has a structural blind spot: the most operationally valuable cases are vessels that are not yet publicly confirmed but show the same behavioural indicators.
The analyst pre-label holdout set addresses this by providing leading-indicator ground truth:
- Analyst labels vessels before they appear on any sanctions list.
- The model is evaluated against those labels to measure early-detection capability.
- As labels mature (confirmed or cleared by public evidence), they convert to confirmed labels in the standard backtest pipeline.
Pre-Label Taxonomy
| Pre-label | Meaning | Maps to in evaluation |
|---|---|---|
suspected-positive |
Analyst believes vessel is conducting evasion, with documented evidence, but no public confirmation yet | y_true = 1 |
analyst-negative |
Analyst has reviewed and determined vessel is not a shadow-fleet candidate, despite model signals | y_true = 0 |
uncertain |
Insufficient evidence to decide either way; kept for monitoring but excluded from binary metrics | y_true = None (excluded) |
Confidence Tiers
| Tier | Criteria |
|---|---|
high |
Multiple independent corroborating signals (e.g. AIS manipulation + STS satellite imagery + ownership evasion pattern) with specific evidence links |
medium |
Two or more indicators, with at least one documented source, but some ambiguity remains |
weak |
Single indicator only, or significant alternative explanation exists; included for monitoring but not primary KPI |
Use --min-confidence-tier medium in evaluation runs to exclude weak labels from precision/recall metrics. Use weak only for exploratory analysis.
Evidence Requirements
Every pre-label entry must include:
| Field | Requirement |
|---|---|
mmsi |
9-digit MMSI (mandatory) |
imo |
IMO number where known |
pre_label |
One of suspected-positive, uncertain, analyst-negative |
confidence_tier |
One of high, medium, weak |
region |
One of singapore, middleeast, europe, japan, gulf |
evidence_notes |
Human-readable summary of the evidence basis (mandatory) |
source_urls |
At least one URL or internal report reference for high/medium labels |
analyst_id |
Analyst identifier (e.g. analyst-a) |
evidence_timestamp |
ISO-8601 timestamp — when the evidence was gathered, not when the row was entered |
The evidence_timestamp is the leakage control gate. Any pre-label with evidence_timestamp > window_end_date is automatically dropped before evaluation.
Leakage Policy
A pre-label is only valid for evaluating a window if the analyst had access to the evidence before the window closed.
evidence_timestamp <= window_end_date
This prevents future knowledge from inflating metrics. The evaluation pipeline enforces this automatically and reports the count of dropped labels in leakage_report.labels_dropped.
Consequences:
- Use --end-date (or window_end_date in manifests) when running evaluation.
- Analysts must record evidence_timestamp as the date the evidence was observed, not the date of entry.
- Back-filling with post-hoc analysis is not permitted.
Holdout Dataset Provenance
The initial curated set is stored at:
data/demo/analyst_prelabels_demo.csv
| Attribute | Value |
|---|---|
| Vessel count | 60 |
| Regions | singapore (20), middleeast (20), europe (20) |
| Class breakdown | ~30 suspected-positive, ~10 uncertain, ~15 analyst-negative |
| Evidence window | 2025-09 through 2025-11 |
| Version | v1.0 |
The demo CSV is a portable fixture for development and testing. Operational pre-labels reside in the analyst_prelabels DuckDB table (persisted in data/processed/mpol.duckdb).
To load the demo set into the database:
import duckdb, csv
con = duckdb.connect("data/processed/mpol.duckdb")
with open("data/demo/analyst_prelabels_demo.csv") as f:
rows = list(csv.DictReader(f))
for row in rows:
con.execute(
"INSERT INTO analyst_prelabels "
"(mmsi, imo, pre_label, confidence_tier, region, evidence_notes, "
" source_urls_json, analyst_id, evidence_timestamp) "
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)",
[row["mmsi"], row.get("imo"), row["pre_label"], row["confidence_tier"],
row.get("region"), row.get("evidence_notes"), row.get("source_urls"),
row["analyst_id"], row["evidence_timestamp"]],
)
con.close()
Review Cadence
| Frequency | Action |
|---|---|
| Monthly | Review all uncertain labels; upgrade to suspected-positive or analyst-negative where evidence has developed |
| Monthly | Check suspected-positive labels against updated sanctions lists; convert to confirmed backtest labels if publicly confirmed |
| Quarterly | Full holdout set audit: verify evidence links still resolve, update confidence tiers, remove stale entries |
| Triggered | When a vessel in the holdout set appears on a public sanctions list, immediately reclassify and flag for retrospective analysis |
Assigned: The analyst who created the label is responsible for the monthly review. The senior analyst covers labels where the original analyst is unavailable.
Versioning
When significant changes are made to the holdout set (new entries, confidence upgrades, reclassifications):
- Export the current state to a versioned CSV:
data/demo/analyst_prelabels_v{N}.csv - Record the manifest entry in
data/processed/prelabel_manifest.json(auto-generated byscripts/run_prelabel_evaluation.py) - Commit the versioned CSV under
data/demo/with a descriptive message
Running the Pre-Label Evaluation
Against the demo CSV:
uv run python -m src.score.prelabel_evaluation \
--watchlist data/processed/candidate_watchlist.parquet \
--prelabels-csv data/demo/analyst_prelabels_demo.csv \
--output data/processed/prelabel_evaluation.json \
--end-date 2025-11-15 \
--min-confidence-tier medium \
--review-capacities 25,50,100
Against the database:
uv run python -m src.score.prelabel_evaluation \
--watchlist data/processed/candidate_watchlist.parquet \
--db data/processed/mpol.duckdb \
--output data/processed/prelabel_evaluation.json \
--end-date 2025-11-15 \
--region singapore \
--min-confidence-tier medium \
--review-capacities 25,50,100
Disagreement Analysis
The evaluation report includes a disagreement section highlighting cases where the model and analyst diverge:
model_high_analyst_negative: Vessels the model scored ≥ threshold that the analyst cleared. Review these to identify model false positives that may warrant feature adjustment.model_low_analyst_positive: Vessels the analyst suspects but the model ranked low. These are high-value misses — review signals and consider feature uplift.
The disagreement threshold defaults to the best-F1 threshold on the labeled set. Override with --disagreement-threshold.
Integration with Public-Data Backtest
The pre-label evaluation is a separate reporting slice from the public-label backtest. Do not merge the two:
| Slice | Label source | Purpose |
|---|---|---|
| Public-label backtest | OFAC / UN / EU sanctions | Measures confirmed-case recall; lagging indicator |
| Pre-label holdout | Analyst curation | Measures early-detection precision; leading indicator |
Run both and compare. If the public-label backtest is strong but the pre-label precision is low, the model is detecting known entities but missing novel evasion patterns.