BSFG Operations Runbook

Audience: Operators, support engineers. Use: Follow routine operating procedures and incident-response steps for BSFG systems.

Operational Overview

BSFG has a simple operational model due to its stateless-at-boundary architecture. The key operational concern is durability and connectivity — not complex choreography.

Key Metrics

Monitor these metrics for each BSFG node and zone:

Replication Lag

The delay between a fact being appended to ISB/ESB and confirmed at IFB/EFB.

Healthy: < 100ms (normal operation)
Warning: 100ms–1s (minor delay, check network)
Alert: > 1s (network partition or boundary issue)

Consumer Backlog

The number of unconfirmed facts at a forward buffer (IFB/EFB).

Healthy: 0–100 facts (consumers keeping up)
Warning: 100–10,000 facts (consumer lag, check processing)
Alert: > 10,000 facts (consumer stalled or boundary closed)

Buffer Fill Percentage

Store buffers (ISB/ESB) have configurable capacity (e.g., 100GB). Track fill ratio:

Healthy: < 50% capacity
Warning: 50–80% capacity (approaching limits)
Alert: > 80% capacity (trigger backpressure or cleanup)

Confirmation Rate

Facts confirmed per second at each forward buffer.

Healthy: Steady rate matching producer throughput
Warning: Rate drops suddenly (network issue or consumer failure)
Alert: Rate drops to zero for > 1 minute (boundary closed or consumers stopped)

TLS Handshake Errors

mTLS connection failures:

Certificate validation failures (expired, wrong issuer, untrusted CA)
Peer identity mismatch (zone_id does not match claim)
Alert immediately on any TLS error

RPC Latency

Time to complete RPC operations (AppendFact, FetchFacts, ConfirmReceipt, PutObject):

Healthy: p99 < 50ms
Warning: p99 50–500ms (check network or load)
Alert: p99 > 500ms or timeout (network partition or overload)

Alert Thresholds

Condition	Threshold	Action
Replication lag > 1 second	1 second	Check network, verify boundary connectivity
Consumer backlog > 10,000	10,000 facts	Check consumer health, verify processing
Buffer fill > 80%	80% capacity	Check retention policy, trigger cleanup
TLS handshake error	Any failure	Immediately — verify certificates, CA trust
Certificate expires in < 30 days	30 days	Begin renewal workflow
Fact TTL expires (7 days default)	TTL threshold	Alert if unconfirmed facts will be truncated

Backpressure Policy

When buffer capacity approaches limits, BSFG enforces a backpressure policy.

Standard Deployment (Non-Safety-Critical)

if (buffer_fill >= 80%) {
  // Two options (choose one):
  option_1: reject new AppendFact calls
            (return error to producer)
  option_2: drop oldest unacknowledged facts
            (truncate without waiting for confirmation)
}

Safety-Critical / SIL-Regulated Deployment

In regulated environments (FDA, IEC 61508):

if (buffer_fill >= 80%) {
  MUST: reject new AppendFact calls
  MUST_NOT: drop unacknowledged data
  ACTION: alert operations team, trigger manual intervention
}

Failure Mode Analysis

1. ISB Crash (Store Buffer Failure)

Behavior:
  - Producers unable to append facts (AppendFact fails)
  - Existing facts in ISB are lost (if not replicated)
  - Consumers can still fetch from IFB (if facts already transferred)

Recovery:
  1. Detect: AppendFact returns error for > 1 minute
  2. Alert operations team
  3. Restart ISB (with data recovery if applicable)
  4. Producers retry AppendFact (idempotent)
  5. Confirm replication lag recovers

2. IFB Crash (Forward Buffer Failure)

Behavior:
  - Consumers unable to fetch facts (FetchFacts fails)
  - Cursor does not advance (confirmations stall)
  - ISB continues accepting writes (but will overflow if IFB stays down)

Recovery:
  1. Detect: FetchFacts returns error or consumer backlog > 10k
  2. Alert operations team
  3. Restart IFB (with data recovery)
  4. Verify cursor is recovered from checkpoint
  5. Consumers retry FetchFacts
  6. Confirm confirmation rate recovers

3. Network Partition (Boundary Unreachable)

Behavior:
  - Zone A BSFG cannot reach Zone B BSFG
  - Gate closes: autonomous mode activated
  - Producers in Zone A continue writing to ISB
  - Consumers in Zone A continue reading from IFB
  - Replication lag stalls (frontier does not advance)
  - Buffer fill increases over time (facts not replicated)

Duration: Minutes to hours (network partitions)

Recovery:
  1. Monitor: replication lag > 30s = probable partition
  2. Verify: ping / traceroute to peer zone
  3. Check: firewall rules, TLS certificate validity, peer availability
  4. Fix: repair network, restore DNS, update firewall rules
  5. Reconnect: Reconciliation mode activates
  6. Replay: store buffer replays unconfirmed facts to forward buffer
  7. Confirm: cursor advances, buffer drains, replication lag returns to normal

4. Hash Collision (Idempotency Key Collision)

Behavior (extremely rare):
  - Two different facts hash to the same idempotency_key
  - putIfAbsent rejects the second fact (already exists)
  - Producer sees "AlreadyExists" error

Prevention:
  - Use strong hash (SHA-256, not MD5 or CRC)
  - Use explicit producer event IDs (not payload hash) if hash collisions are a concern
  - Monitor for unusual rejection rates

Recovery:
  - Producer should emit a new fact with a different message_id
  - Update business process to avoid the collision

5. Buffer Exhaustion (Capacity Limits Exceeded)

Behavior:
  - Buffer reaches 100% capacity (e.g., 100GB ISB full)
  - Backpressure policy activates: reject or drop
  - Producers may experience failures or data loss (if drop-oldest is enabled)

Root causes:
  - Consumers are dead or hung (not confirming)
  - TTL too long (facts retained too long)
  - Throughput too high for capacity

Recovery:
  1. Alert: buffer_fill == 100%
  2. Diagnosis: check consumer status, confirm rates, backlog
  3. Action:
     - If consumer down: restart consumer
     - If throughput high: increase capacity or reduce TTL
     - If misconfigured: review retention policy
  4. Drain: buffer fill decreases as facts are confirmed and truncated

Node Upgrades and Restarts

Rolling Upgrade (HA Setup)

If a zone has multiple BSFG node instances:

Drain traffic from node 1 (stop accepting new connections)
Wait for existing RPC calls to complete (grace period: 30 seconds)
Upgrade node 1 (binary, configuration, dependencies)
Restart node 1
Verify connectivity: test RPC calls to peer zones
Re-enable traffic to node 1
Repeat for nodes 2, 3, etc.

Single Node Upgrade (No HA)

Without HA, the upgrade causes temporary unavailability:

Plan upgrade during maintenance window
Notify consumers and producers (may see timeouts)
Upgrade and restart BSFG node
Verify recovery: check replication lag, consumer backlog
Confirm zone is healthy before allowing normal traffic

Certificate Rotation

Planned Rotation (Before Expiry)

Generate new certificate with same CN (zone identity)
Install new certificate on BSFG node (or all instances)
Reload or restart BSFG service
Verify TLS handshake succeeds with peer zones
Confirm RPC connectivity with peers
Archive old certificate for audit trail

Emergency Rotation (Compromised Key)

Immediately generate new certificate (new key)
Install new certificate
Restart BSFG node (force reconnection with peer zones)
Monitor for connection errors (peers must trust new certificate)
If peers use pinned CA, notify them to reload CA root
Destroy old private key securely

Operational Checklist

☐ Set up monitoring dashboard (replication lag, backlog, buffer fill, RPC latency)
☐ Configure alerts for each threshold
☐ Document alert response procedures
☐ Set up certificate expiry alerts (email reminder 30 days before)
☐ Create runbook for each failure mode above
☐ Test failure mode recovery (simulate network partition, crashes)
☐ Document upgrade procedures (single node, HA rolling upgrade)
☐ Train operations team on BSFG troubleshooting
☐ Set up audit logging for RPC calls, zone identities, confirmations

← Back: Identity Model Next: SOW / Deployment Specification Template →