Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Operations Runbook

This page covers observability wiring, alert thresholds, and backup/restore procedures for a production EdgeSentry-RS deployment.


Observability

Structured logging with tracing

EdgeSentry-RS uses the tracing facade. No subscriber is bundled — deployers wire up the backend of their choice at application startup. The library emits zero overhead when no subscriber is registered.

Recommended subscriber for production (JSON over stdout, ingested by Loki / CloudWatch):

# Cargo.toml of the host application
tracing-subscriber = { version = "0.3", features = ["env-filter", "json"] }
use tracing_subscriber::{fmt, EnvFilter};

fn main() {
    fmt()
        .json()
        .with_env_filter(EnvFilter::from_default_env()) // RUST_LOG=edgesentry_rs=info
        .init();
    // ...
}

Set RUST_LOG=edgesentry_rs=info for production; edgesentry_rs=debug for incident investigation.

Structured log events emitted by the library

All events include the module path as target. Key events:

LevelTargetEventKey fields
DEBUGedgesentry_rs::agentsigning recorddevice_id, sequence, payload_bytes
DEBUGedgesentry_rs::ingest::storageingest starteddevice_id, sequence, object_ref, payload_bytes
WARNedgesentry_rs::ingest::storagepayload hash mismatch — record rejecteddevice_id, sequence
WARNedgesentry_rs::ingest::storageintegrity policy rejected recorddevice_id, sequence, reason
ERRORedgesentry_rs::ingest::storageraw data store write faileddevice_id, sequence, error
ERRORedgesentry_rs::ingest::storageaudit ledger append faileddevice_id, sequence, error
ERRORedgesentry_rs::ingest::storageoperation log write faileddevice_id, sequence, error
INFOedgesentry_rs::ingest::storagerecord accepteddevice_id, sequence, object_ref
DEBUGedgesentry_rs::ingest::verifysignature verification faileddevice_id, sequence
DEBUGedgesentry_rs::ingest::verifyduplicate record rejecteddevice_id, sequence
DEBUGedgesentry_rs::ingest::verifysequence out of orderdevice_id, expected, actual
DEBUGedgesentry_rs::ingest::verifyprev_record_hash mismatch — chain brokendevice_id, sequence
DEBUGedgesentry_rs::ingest::verifyrecord verified and accepteddevice_id, sequence

Use a log-to-metrics pipeline (e.g. Promtail + Loki, or Vector) to derive counters from structured log events:

MetricHow to deriveAlert threshold
edgesentry_ingest_accepted_totalCount INFO "record accepted" events
edgesentry_ingest_rejected_total{reason}Count WARN rejection events, label by reason field> 10/min sustained → P1 alert
edgesentry_ingest_error_total{component}Count ERROR storage failure events, label by component (raw_data_store / audit_ledger / operation_log)Any occurrence → P0 alert
edgesentry_chain_break_totalCount DEBUG "prev_record_hash mismatch" eventsAny occurrence → P0 alert
edgesentry_signature_fail_totalCount DEBUG "signature verification failed" events> 5/min sustained → P1 alert

OpenTelemetry (tracing spans)

The IngestService::ingest method emits a tracing span. Wire it to an OTLP exporter for distributed tracing:

opentelemetry = "0.26"
opentelemetry-otlp = { version = "0.26", features = ["grpc-tonic"] }
tracing-opentelemetry = "0.27"

Alert Definitions

AlertConditionSeverityResponse
IngestStorageErrorAny ERROR-level storage failureP0Check DB/S3 connectivity; verify disk and credentials
ChainBreakAny prev_record_hash mismatch eventP0Investigate tamper or replay; preserve logs before any restart
HighRejectionRateRejection rate > 10/min for 5 minP1Check device firmware; look for misconfigured signing key rotation
SignatureFailureSurgeSignature failures > 5/min for 5 minP1Possible key compromise or active spoofing attempt
AuditLedgerLagPostgres operation_logs insert latency > 2 s p99P1Check DB query plan; autovacuum contention

Recovery Objectives

ObjectiveTargetBasis
RTO (recovery time)< 30 minutesTime to restore Postgres from pg_basebackup + WAL replay
RPO (recovery point)< 5 minutesContinuous WAL archiving at 5-minute intervals

Backup Runbook

PostgreSQL — audit ledger and operation log

Prerequisites: WAL archiving enabled (archive_mode = on, archive_command shipping to S3 or equivalent).

1. Take a base backup

pg_basebackup \
  --host=<DB_HOST> \
  --username=<DB_USER> \
  --pgdata=/backup/pg_base_$(date +%Y%m%d_%H%M%S) \
  --format=tar \
  --gzip \
  --wal-method=stream \
  --checkpoint=fast \
  --progress

2. Verify the backup

pg_restore --list /backup/pg_base_<timestamp>/base.tar.gz | head -20

3. Archive WAL continuously

Ensure the archive_command in postgresql.conf ships WAL segments to durable storage (e.g. S3):

archive_command = 'aws s3 cp %p s3://<BUCKET>/wal/%f'

4. Retention policy

Backup typeRetention
Base backup30 days
WAL archive30 days
Logical dump (pg_dump)7 days (weekly)

S3 / MinIO — raw payload store

Enable versioning and cross-region replication on the bucket:

# Enable versioning
aws s3api put-bucket-versioning \
  --bucket <BUCKET> \
  --versioning-configuration Status=Enabled

# Enable replication (requires a destination bucket and IAM role configured separately)
aws s3api put-bucket-replication \
  --bucket <BUCKET> \
  --replication-configuration file://replication.json

Minimum replication target: one additional region. For CLS Level 3 evidence integrity, ensure object lock or versioning is enabled so payloads cannot be silently overwritten.


Restore Runbook

PostgreSQL — point-in-time recovery (PITR)

# 1. Stop the Postgres service
systemctl stop postgresql

# 2. Restore base backup
tar -xzf /backup/pg_base_<timestamp>/base.tar.gz -C /var/lib/postgresql/data/

# 3. Create recovery config
cat > /var/lib/postgresql/data/recovery.conf <<EOF
restore_command = 'aws s3 cp s3://<BUCKET>/wal/%f %p'
recovery_target_time = '<TARGET_TIMESTAMP>'
recovery_target_action = 'promote'
EOF

# 4. Start Postgres — it will replay WAL to the target time
systemctl start postgresql

# 5. Verify: query the last accepted sequence per device
psql -U <DB_USER> -d <DB_NAME> \
  -c "SELECT device_id, MAX(sequence) FROM audit_records GROUP BY device_id;"

Recovery verification checklist

  • Last record sequence per device matches pre-incident snapshot
  • Hash chain continuity verified: eds verify-chain <exported-records.json>
  • Operation log shows no unexpected gaps (check timestamps around recovery target)
  • Alert suppression lifted after verification completes

S3 / MinIO — object restore

# Restore a specific object version
aws s3api get-object \
  --bucket <BUCKET> \
  --key <OBJECT_KEY> \
  --version-id <VERSION_ID> \
  <OUTPUT_FILE>

Failure Drill Schedule

Run the following drills quarterly to verify runbook accuracy:

DrillProcedurePass criterion
DB failoverStop primary Postgres; promote replicaIngest resumes in < 30 min
DB restorePITR to 1 hour ago on stagingChain continuity verified in < 30 min
S3 object recoveryRestore a deleted test objectObject byte-identical to original
Alert fireInject a bad signature via test harnessP1 alert fires within 2 min