Feature Engineering
The arktrace pipeline computes 21 features across five families for every vessel MMSI. All features are written to the vessel_features DuckDB table by src/features/build_matrix.py.
Feature families
| Family | Module | Features | Backend |
|---|---|---|---|
| AIS Behavioral | ais_behavior.py |
6 | DuckDB / Polars |
| Identity Volatility | identity.py |
5 | Lance Graph + DuckDB |
| Ownership Graph | ownership_graph.py |
5 | Lance Graph (Polars joins) |
| Trade Flow Mismatch | trade_mismatch.py |
2 | DuckDB + Comtrade API |
| EO Fusion | eo_fusion.py |
2 | DuckDB (GFW API / CSV) |
| Total | 21 |
AIS Behavioral features
Source: ais_positions table, computed over a rolling window (default 30 days, configurable with --window).
ais_gap_count_30d
Count of AIS transmission gaps longer than the configured threshold (default 6 hours) in the last 30 days.
Shadow fleet signal: Deliberate AIS switch-off is the primary evasion technique for sanctioned tankers. A gap of 6+ hours during transit is operationally unusual for a compliant vessel and strongly correlated with dark-transfer events.
Implementation: The Polars lazy pipeline computes the time delta between consecutive position rows for each MMSI. Gaps are counted and summed per vessel over the rolling window.
ais_gap_max_hours
Duration in hours of the longest single AIS gap in the window.
Shadow fleet signal: Compliant vessels rarely go dark for more than 2–4 hours. Gaps above 12 hours indicate a port call without AIS, an at-sea dark period, or equipment failure. Gaps above 24 hours in open water are a strong evasion indicator.
position_jump_count
Count of consecutive position pairs where the implied speed exceeds 50 knots (calculated via Haversine distance / elapsed time).
Shadow fleet signal: GPS spoofing is endemic in the Taiwan Strait, Black Sea, and Persian Gulf approaches. A vessel that "jumps" 200 km in 30 minutes without leaving any intermediate positions is almost certainly receiving a spoofed GPS signal, often to mask its true location during a dark STS transfer.
Implementation: Uses a 1-hour sliding window for robustness against occasional timestamp errors.
sts_candidate_count
Count of distinct vessels that have occupied the same H3 hexagon (resolution 8, ~0.7 km cell edge) within 2 hours of the subject vessel.
Shadow fleet signal: Ship-to-Ship transfers occur at anchorages and in open water. Two tankers sharing the same ~0.7 km cell for a sustained period without a declared port call are STS candidates. H3 resolution 8 is chosen to match the beam of a VLCC at anchor (width ≈ 60 m) within the cell precision.
Implementation: H3 hexagon IDs are pre-computed for all positions; a self-join on hexagon + time window identifies co-located vessels.
port_call_ratio
Fraction of time in the window spent within 5 nm of a known port, as a proxy for legitimate port activity.
Shadow fleet signal: Shadow fleet tankers minimise declared port calls to avoid physical inspection and AIS-based monitoring by port state control authorities. A low port_call_ratio combined with high loitering hours suggests the vessel is active at sea but avoiding port records.
loitering_hours_30d
Total hours spent moving slower than 2 knots outside declared moorage areas, accumulated over the window.
Shadow fleet signal: Loitering at sea at very low SOG (below steerage way) is a behavioural precursor to dark STS. Genuine commercial tankers loiter only when waiting for a berth, which shows up near ports. Open-water low-speed drifting suggests rendezvous behaviour.
Identity Volatility features
Source: Lance Graph datasets (ownership changes, name aliases) + vessel_meta DuckDB table. Computed over a 2-year lookback.
flag_changes_2y
Number of flag state changes recorded in the vessel registry over the past 2 years.
Shadow fleet signal: Legitimate shipping companies rarely reflag vessels. Repeated reflagging — especially to open-registry states (Panama, Marshall Islands, Comoros) — is a known evasion technique to escape the watch-list of any single port state authority, reset OFAC exposure tracking, and complicate due-diligence checks.
name_changes_2y
Number of vessel name changes in 2 years.
Shadow fleet signal: Name changes are used to break continuity between a vessel's current identity and its history of sanctioned voyages. A vessel renamed from "ATLANTIC SUN" to "PACIFIC STAR" can avoid automated blocklist checks that match on vessel name.
owner_changes_2y
Number of registered owner changes in 2 years.
Shadow fleet signal: Ownership obfuscation through rapid beneficial-owner changes is a key sanctions evasion technique. This feature counts distinct ownership transitions recorded in the Lance Graph OWNED_BY dataset over 2 years.
high_risk_flag_ratio
Fraction of companies in the vessel's full ownership chain that are registered in high-risk flag states.
High-risk flags: KP, IR, VE, SY, CU, RU, KM, GA, CM, PW, KI, TG, SL, ST
Shadow fleet signal: Even if the vessel itself flies a neutral flag, shell companies up the ownership chain may be registered in North Korea, Iran, or Venezuela. This ratio surfaces ownership-level exposure that vessel-flag screening misses.
ownership_depth
BFS path length from the vessel to the ultimate beneficial owner, capped at 5.
Shadow fleet signal: The average legitimate tanker has an ownership chain of 2–3 hops (vessel → shipowner → holding company). Chains of 4–6 hops suggest deliberate opacity: SPVs nested inside other SPVs to frustrate beneficial ownership disclosure requirements.
Ownership Graph features
Source: Lance Graph datasets, computed via Polars joins. All graph features use sanctions_distance from the merged OpenSanctions dataset.
sanctions_distance
Minimum BFS hop count from the vessel to any node in the ownership graph that carries a SANCTIONED_BY relationship.
| Value | Meaning |
|---|---|
| 0 | Vessel itself is directly designated |
| 1 | Registered owner or manager is designated |
| 2 | Parent company or beneficial owner is designated |
| 99 | No graph connection to any sanctioned entity |
Shadow fleet signal: This is the strongest individual predictor in the model. A vessel 1–2 hops from an OFAC/EU/UN entity has a >60% empirical probability of appearing in open-source shadow fleet incident reports.
cluster_sanctions_ratio
Fraction of vessels sharing the same registered owner (via the OWNED_BY dataset) that are individually sanctioned (i.e. have sanctions_distance = 0).
Shadow fleet signal: Sanctioned fleets tend to operate in clusters. If 50% of the vessels sharing a manager are on the OFAC list, the remaining 50% are likely operating on behalf of the same beneficial owner but have not yet been individually designated.
shared_manager_risk
Minimum sanctions_distance across all vessels co-managed with this vessel.
Shadow fleet signal: A vessel managed by a company that also manages an OFAC-listed tanker inherits operational risk even if its own ownership chain looks clean.
shared_address_centrality
Count of distinct vessels sharing the same registered company address.
Shadow fleet signal: Shell companies used as nominee owners for sanctioned fleets frequently register multiple vessels at the same address. High centrality (> 5 vessels at one address) is a red flag for a nominee ownership structure.
sts_hub_degree
Count of distinct vessels with which this vessel has had AIS-confirmed STS proximity events (from sts_candidate_count data).
Shadow fleet signal: A vessel that repeatedly co-locates with many different partner vessels is functioning as an STS hub — a central intermediary in a dark transfer network. Hub degree > 3 is rare in legitimate bunkering operations.
Trade Flow Mismatch features
Source: DuckDB trade_flow table (populated from the UN Comtrade+ REST API, free tier 500 requests/day). Restricted to crude oil (HS 2709) and petroleum products (HS 2710).
route_cargo_mismatch
Binary flag indicating whether the vessel is a tanker operating on routes from sanctioned exporters with no corresponding bilateral trade record in Comtrade.
| Value | Condition |
|---|---|
| 1.0 | Tanker (AIS type 80–89) from a sanctioned flag state (KP, IR, VE, SY, CU, RU) with zero Comtrade crude imports from that flag in the period |
| 0.5 | Some trade volume but below expected for vessel size |
| 0.0 | Not a tanker, or not from a sanctioned flag |
Shadow fleet signal: Iranian crude exports have been ~0 in official UN Comtrade records since 2019, yet ~1.5 mbpd of Iranian crude moves via dark tanker networks each year. A tanker arriving from Iranian waters with no matching Comtrade import record is operating off the books.
declared_vs_estimated_cargo_value
Difference (USD) between the declared cargo value from AIS voyage data and the UN Comtrade statistical estimate for the same route.
Shadow fleet signal: Deliberate under-declaration of cargo value is used to reduce tax and duty exposure in destination countries. A large positive discrepancy (declared < estimated) is consistent with dark oil sales.
EO Fusion features
Source: eo_detections DuckDB table, populated from the Global Fishing Watch Vessel Presence API or a local CSV fallback. Computed over a 30-day rolling window by src/features/eo_fusion.py.
Requires: GFW_API_TOKEN in .env for live ingestion, or a local CSV via --csv. Pass --skip-eo to build_matrix.py to skip this family entirely (features default to 0).
eo_dark_count_30d
Count of EO (Electro-Optical satellite imagery) vessel detections in the last 30 days that were not matched to an AIS broadcast within 0.1° / 120 min and were attributed to this vessel via AIS gap + 0.5° proximity.
Shadow fleet signal: A vessel detected by satellite imagery that is simultaneously dark on AIS is operating without a transponder — the clearest observable indicator of intentional AIS manipulation. Each such unmatched detection during an AIS gap is a direct observation of dark-vessel behaviour.
Implementation: GFW detections are matched to AIS broadcasts by position (≤ 0.1°) and time (≤ 120 min). Unmatched detections within 0.5° of a vessel's last known position during an AIS gap are attributed to that vessel. The 30-day count is written to vessel_features.
eo_ais_mismatch_ratio
Fraction of all EO detections attributed to this vessel (matched + unmatched) that were unmatched (dark): eo_dark_count_30d / total_attributed_detections.
Shadow fleet signal: A vessel that appears in satellite imagery only when it is also broadcasting on AIS has a ratio near 0 — consistent with compliant behaviour. A vessel with a ratio above 0.5 is dark during more than half its satellite observations, indicating a systematic pattern of AIS suppression rather than occasional equipment failure.
Build matrix
src/features/build_matrix.py merges all four feature families on MMSI using DuckDB JOINs and writes the result to the vessel_features table. Missing values are filled with sensible defaults:
| Column | Default when missing |
|---|---|
sanctions_distance |
99 (no graph connection) |
cluster_sanctions_ratio |
0.0 |
shared_manager_risk |
99 |
high_risk_flag_ratio |
0.0 |
ownership_depth |
1 |
| All count features | 0 |
Pass --skip-graph to run without loading Lance Graph datasets (graph features default to safe values).