Audience: Operators, support engineers. Use: Follow routine operating procedures and incident-response steps for BSFG systems.
Operational Overview
BSFG has a simple operational model due to its stateless-at-boundary architecture. The key operational concern is durability and connectivity — not complex choreography.
Key Metrics
Monitor these metrics for each BSFG node and zone:
Replication Lag
The delay between a fact being appended to ISB/ESB and confirmed at IFB/EFB.
- Healthy: < 100ms (normal operation)
- Warning: 100ms–1s (minor delay, check network)
- Alert: > 1s (network partition or boundary issue)
Consumer Backlog
The number of unconfirmed facts at a forward buffer (IFB/EFB).
- Healthy: 0–100 facts (consumers keeping up)
- Warning: 100–10,000 facts (consumer lag, check processing)
- Alert: > 10,000 facts (consumer stalled or boundary closed)
Buffer Fill Percentage
Store buffers (ISB/ESB) have configurable capacity (e.g., 100GB). Track fill ratio:
- Healthy: < 50% capacity
- Warning: 50–80% capacity (approaching limits)
- Alert: > 80% capacity (trigger backpressure or cleanup)
Confirmation Rate
Facts confirmed per second at each forward buffer.
- Healthy: Steady rate matching producer throughput
- Warning: Rate drops suddenly (network issue or consumer failure)
- Alert: Rate drops to zero for > 1 minute (boundary closed or consumers stopped)
TLS Handshake Errors
mTLS connection failures:
- Certificate validation failures (expired, wrong issuer, untrusted CA)
- Peer identity mismatch (zone_id does not match claim)
- Alert immediately on any TLS error
RPC Latency
Time to complete RPC operations (AppendFact, FetchFacts, ConfirmReceipt, PutObject):
- Healthy: p99 < 50ms
- Warning: p99 50–500ms (check network or load)
- Alert: p99 > 500ms or timeout (network partition or overload)
Alert Thresholds
| Condition | Threshold | Action |
|---|---|---|
| Replication lag > 1 second | 1 second | Check network, verify boundary connectivity |
| Consumer backlog > 10,000 | 10,000 facts | Check consumer health, verify processing |
| Buffer fill > 80% | 80% capacity | Check retention policy, trigger cleanup |
| TLS handshake error | Any failure | Immediately — verify certificates, CA trust |
| Certificate expires in < 30 days | 30 days | Begin renewal workflow |
| Fact TTL expires (7 days default) | TTL threshold | Alert if unconfirmed facts will be truncated |
Backpressure Policy
When buffer capacity approaches limits, BSFG enforces a backpressure policy.
Standard Deployment (Non-Safety-Critical)
if (buffer_fill >= 80%) {
// Two options (choose one):
option_1: reject new AppendFact calls
(return error to producer)
option_2: drop oldest unacknowledged facts
(truncate without waiting for confirmation)
}
Safety-Critical / SIL-Regulated Deployment
In regulated environments (FDA, IEC 61508):
if (buffer_fill >= 80%) {
MUST: reject new AppendFact calls
MUST_NOT: drop unacknowledged data
ACTION: alert operations team, trigger manual intervention
}
Failure Mode Analysis
1. ISB Crash (Store Buffer Failure)
Behavior:
- Producers unable to append facts (AppendFact fails)
- Existing facts in ISB are lost (if not replicated)
- Consumers can still fetch from IFB (if facts already transferred)
Recovery:
1. Detect: AppendFact returns error for > 1 minute
2. Alert operations team
3. Restart ISB (with data recovery if applicable)
4. Producers retry AppendFact (idempotent)
5. Confirm replication lag recovers
2. IFB Crash (Forward Buffer Failure)
Behavior:
- Consumers unable to fetch facts (FetchFacts fails)
- Cursor does not advance (confirmations stall)
- ISB continues accepting writes (but will overflow if IFB stays down)
Recovery:
1. Detect: FetchFacts returns error or consumer backlog > 10k
2. Alert operations team
3. Restart IFB (with data recovery)
4. Verify cursor is recovered from checkpoint
5. Consumers retry FetchFacts
6. Confirm confirmation rate recovers
3. Network Partition (Boundary Unreachable)
Behavior:
- Zone A BSFG cannot reach Zone B BSFG
- Gate closes: autonomous mode activated
- Producers in Zone A continue writing to ISB
- Consumers in Zone A continue reading from IFB
- Replication lag stalls (frontier does not advance)
- Buffer fill increases over time (facts not replicated)
Duration: Minutes to hours (network partitions)
Recovery:
1. Monitor: replication lag > 30s = probable partition
2. Verify: ping / traceroute to peer zone
3. Check: firewall rules, TLS certificate validity, peer availability
4. Fix: repair network, restore DNS, update firewall rules
5. Reconnect: Reconciliation mode activates
6. Replay: store buffer replays unconfirmed facts to forward buffer
7. Confirm: cursor advances, buffer drains, replication lag returns to normal
4. Hash Collision (Idempotency Key Collision)
Behavior (extremely rare):
- Two different facts hash to the same idempotency_key
- putIfAbsent rejects the second fact (already exists)
- Producer sees "AlreadyExists" error
Prevention:
- Use strong hash (SHA-256, not MD5 or CRC)
- Use explicit producer event IDs (not payload hash) if hash collisions are a concern
- Monitor for unusual rejection rates
Recovery:
- Producer should emit a new fact with a different message_id
- Update business process to avoid the collision
5. Buffer Exhaustion (Capacity Limits Exceeded)
Behavior:
- Buffer reaches 100% capacity (e.g., 100GB ISB full)
- Backpressure policy activates: reject or drop
- Producers may experience failures or data loss (if drop-oldest is enabled)
Root causes:
- Consumers are dead or hung (not confirming)
- TTL too long (facts retained too long)
- Throughput too high for capacity
Recovery:
1. Alert: buffer_fill == 100%
2. Diagnosis: check consumer status, confirm rates, backlog
3. Action:
- If consumer down: restart consumer
- If throughput high: increase capacity or reduce TTL
- If misconfigured: review retention policy
4. Drain: buffer fill decreases as facts are confirmed and truncated
Node Upgrades and Restarts
Rolling Upgrade (HA Setup)
If a zone has multiple BSFG node instances:
- Drain traffic from node 1 (stop accepting new connections)
- Wait for existing RPC calls to complete (grace period: 30 seconds)
- Upgrade node 1 (binary, configuration, dependencies)
- Restart node 1
- Verify connectivity: test RPC calls to peer zones
- Re-enable traffic to node 1
- Repeat for nodes 2, 3, etc.
Single Node Upgrade (No HA)
Without HA, the upgrade causes temporary unavailability:
- Plan upgrade during maintenance window
- Notify consumers and producers (may see timeouts)
- Upgrade and restart BSFG node
- Verify recovery: check replication lag, consumer backlog
- Confirm zone is healthy before allowing normal traffic
Certificate Rotation
Planned Rotation (Before Expiry)
- Generate new certificate with same CN (zone identity)
- Install new certificate on BSFG node (or all instances)
- Reload or restart BSFG service
- Verify TLS handshake succeeds with peer zones
- Confirm RPC connectivity with peers
- Archive old certificate for audit trail
Emergency Rotation (Compromised Key)
- Immediately generate new certificate (new key)
- Install new certificate
- Restart BSFG node (force reconnection with peer zones)
- Monitor for connection errors (peers must trust new certificate)
- If peers use pinned CA, notify them to reload CA root
- Destroy old private key securely
Operational Checklist
- ☐ Set up monitoring dashboard (replication lag, backlog, buffer fill, RPC latency)
- ☐ Configure alerts for each threshold
- ☐ Document alert response procedures
- ☐ Set up certificate expiry alerts (email reminder 30 days before)
- ☐ Create runbook for each failure mode above
- ☐ Test failure mode recovery (simulate network partition, crashes)
- ☐ Document upgrade procedures (single node, HA rolling upgrade)
- ☐ Train operations team on BSFG troubleshooting
- ☐ Set up audit logging for RPC calls, zone identities, confirmations