Audience: Operators, SREs, platform engineers. Use: Understand the monitoring model, alerting signals, and operational visibility expectations.
Overview
BSFG deployments require operational discipline to maintain reliability in industrial environments. Operators must be able to detect issues, understand system state, and perform safe upgrades without data loss.
This guide provides practical procedures for:
- Monitoring replication health and consumer lag
- Detecting and diagnosing operational failures
- Responding to incidents with confidence
- Performing safe node restarts and upgrades
- Auditing cross-zone synchronization
Monitoring Architecture
Typical BSFG monitoring architecture involves:
BSFG Node
├── Metrics (Prometheus format)
│ └→ Prometheus Server
│ └→ Grafana Dashboards
└── Logs (structured JSON or syslog)
└→ Log Aggregation (ELK, OpenSearch, Splunk, Loki)
Components
- BSFG Node: Emits metrics and logs during operation
- Metrics Exporter: Exposes metrics in Prometheus format on `/metrics` endpoint
- Prometheus: Scrapes metrics at regular intervals (e.g., every 15 seconds)
- Grafana: Visualizes metrics in dashboards, defines alerts
- Log Aggregation: Centralizes logs from all zones for analysis and audit
Core Metrics
BSFG should expose metrics in Prometheus format. Key metrics to monitor:
Replication Metrics
| Metric Name | Type | Description | Healthy Range |
|---|---|---|---|
| bsfg_replication_lag_seconds | Gauge | Seconds between fact appended to ISB and confirmed at IFB | < 1 second |
| bsfg_replication_lag_high_watermark | Gauge | Highest replication lag observed (for alerting on spikes) | < 5 seconds |
| bsfg_frontier_offset | Gauge | Current frontier (highest contiguous committed offset) | Increasing monotonically |
| bsfg_fetch_requests_total | Counter | Total FetchFacts RPC calls (peer-to-peer) | Steady or increasing |
| bsfg_confirm_requests_total | Counter | Total ConfirmReceipt RPC calls | Matches fetch activity |
Ingestion Metrics
| Metric Name | Type | Description | Healthy Range |
|---|---|---|---|
| bsfg_append_requests_total | Counter | Total AppendFact RPC calls (producer ingestion) | Matches producer throughput |
| bsfg_append_failures_total | Counter | AppendFact failures (retryable or permanent) | Zero or very low |
| bsfg_store_buffer_fill_percent | Gauge | Percentage of ISB/ESB capacity used | < 50% |
| bsfg_append_latency_seconds | Histogram | Time to append fact (p50, p95, p99) | p99 < 100ms |
Consumer Metrics
| Metric Name | Type | Description | Healthy Range |
|---|---|---|---|
| bsfg_consumer_backlog_size | Gauge | Unconfirmed facts waiting for consumer processing | < 100 facts |
| bsfg_consumer_lag_seconds | Gauge | Age of oldest unconfirmed fact at IFB/EFB | < 10 seconds |
| bsfg_confirm_rate_per_second | Gauge | Facts confirmed per second (consumer throughput) | Steady, matching producer rate |
Artifact Metrics
| Metric Name | Type | Description | Healthy Range |
|---|---|---|---|
| bsfg_artifact_upload_total | Counter | Total PutObject calls (artifact uploads) | Matches producer artifact activity |
| bsfg_artifact_upload_failures | Counter | Failed artifact uploads | Zero or very low |
| bsfg_artifact_retrieval_total | Counter | Total GetObject calls (artifact downloads) | Matches consumer artifact activity |
| bsfg_artifact_retrieval_failures | Counter | Failed artifact retrievals (missing or inaccessible) | Zero |
| bsfg_object_store_usage_bytes | Gauge | Artifact storage used by bucket | Trending within capacity |
System Metrics
| Metric Name | Type | Description | Healthy Range |
|---|---|---|---|
| bsfg_node_up | Gauge | Node health indicator (1 = up, 0 = down) | 1 (up) |
| bsfg_certificate_expiry_seconds | Gauge | Seconds until mTLS certificate expires | > 2,592,000 (30 days) |
| bsfg_tls_handshake_errors_total | Counter | mTLS handshake failures (peer auth failure) | Zero |
| bsfg_rpc_latency_seconds | Histogram | RPC call duration (AppendFact, FetchFacts, etc.) | p99 < 500ms |
Alerting Rules
Configure alerts based on these thresholds. Adjust thresholds per deployment based on expected throughput and latency.
| Condition | Threshold | Severity | Action |
|---|---|---|---|
| Replication lag > 5 seconds | 5 sec | Warning | Check network, verify peer connectivity |
| Replication lag > 30 seconds | 30 sec | Critical | Check for network partition, node failure, or consumer backlog |
| Consumer backlog > 1000 facts | 1000 | Warning | Check consumer health; may be processing slowly or stalled |
| Consumer lag > 60 seconds | 60 sec | Critical | Consumer is severely behind; investigate failure |
| Store buffer fill > 80% | 80% | Warning | Buffer approaching capacity; check replication and consumer progress |
| Store buffer fill > 95% | 95% | Critical | Buffer near exhaustion; risk of producer backpressure or data loss |
| Append failures > 0 (per minute) | > 0 | Warning | Producer experiencing errors; check why AppendFact is failing |
| Artifact retrieval failures > 0 | > 0 | Critical | Consumer cannot retrieve artifact; storage issue or missing artifact |
| TLS handshake errors > 0 | > 0 | Critical | Peer authentication failing; check certificates and CA trust |
| Certificate expiry in < 30 days | 30 days | Warning | Begin certificate renewal process |
| Node down (bsfg_node_up = 0) | N/A | Critical | Node is unreachable; check health, restart if necessary |
Log Structure
BSFG nodes should emit structured logs for observability. Logs should include:
Log Fields
- timestamp: RFC3339 format (e.g., 2026-03-06T14:30:00Z)
- level: INFO, WARN, ERROR, FATAL
- message_id: Unique identifier for the fact (if applicable)
- from_zone: Source zone (producer or peer)
- to_zone: Destination zone (consumer or peer)
- operation: AppendFact, FetchFacts, ConfirmReceipt, PutObject
- predicate: Fact predicate (if applicable)
- result: success, retryable_error, permanent_error
- error_message: Human-readable error (if error)
- duration_ms: Operation latency
Example Log Entry
{
"timestamp": "2026-03-06T14:30:45.123Z",
"level": "INFO",
"zone": "enterprise-bsfg",
"message_id": "msg_abc123def456",
"operation": "AppendFact",
"predicate": "order_created",
"result": "success",
"duration_ms": 12,
"buffer_fill_percent": 45
}
Troubleshooting Scenarios
Replication Has Stopped (Lag Growing)
Symptom: Replication lag continuously increases, facts are not flowing between zones.
Diagnosis:
- Check replication lag metric: is it > 30 seconds?
- Check consumer backlog: is it growing?
- Check network connectivity: ping between BSFG nodes
- Check firewall rules: is traffic allowed on RPC port (9443)?
- Check certificates: are they valid and trusted?
- Check logs for TLS errors or connection timeouts
Resolution:
- If network unreachable: restore connectivity, replication resumes automatically
- If certificate expired: rotate certificate, restart node
- If consumer backlog growing: restart consumer, monitor confirmation rate
- If no obvious cause: restart BSFG node (no data loss due to durability)
Consumer Backlog Growing (Not Draining)
Symptom: Consumer backlog continuously increases; facts are fetched but not confirmed.
Diagnosis:
- Check confirm rate: is it zero or very low?
- Check consumer process: is it running or hung?
- Check logs for consumer errors or exceptions
- Check if consumer is idempotently processing (not getting stuck on duplicates)
Resolution:
- Restart consumer process
- Review consumer code for blocking operations or deadlocks
- Verify consumer can write to its destination (database, file system)
- Monitor confirm rate recovery after restart
Artifact Retrieval Failures
Symptom: GetObject calls failing with "not found" or access denied errors.
Diagnosis:
- Check artifact reference in fact: bucket, key, digest
- Check object store: does the bucket exist? Is the key present?
- Check storage access: can BSFG node read from the bucket?
- Check retention policy: has the artifact been garbage collected?
Resolution:
- If artifact missing: verify producer uploaded it correctly
- If access denied: check IAM permissions for BSFG node
- If storage offline: restore storage, consumer retries automatically
- If artifact expired: implement better retention policy or upload frequency
TLS Handshake Failures
Symptom: TLS handshake errors in logs, replication unable to establish connections.
Diagnosis:
- Check certificate validity: is it expired?
- Check certificate CN: does it match the peer's zone identity?
- Check CA trust: is the certificate CA trusted by peers?
- Check certificate chain: is the full chain available?
Resolution:
- Renew certificate if expired
- Verify CN matches expected zone identity
- Ensure all peers have updated CA root certificate
- Restart BSFG node to reload certificates
Store Buffer Exhaustion
Symptom: Store buffer fill > 95%, risk of backpressure or data loss.
Diagnosis:
- Check replication lag: facts being produced faster than replicated?
- Check consumer backlog: facts being replicated faster than consumed?
- Check retention policy: TTL too long, facts not being truncated?
Resolution:
- Increase buffer capacity if throughput permanently increased
- Reduce TTL if facts are being retained longer than needed
- Tune replication and consumer concurrency
- Verify backpressure policy (reject vs. drop oldest) is appropriate for SIL level
Operational Playbooks
Playbook: Safe Node Restart
Restarting a BSFG node is safe due to durability guarantees:
- Check replication lag to baseline (record pre-restart state)
- Stop BSFG service gracefully (signal SIGTERM, wait for shutdown)
- Verify service stopped (check process list)
- Start BSFG service (service restart or systemctl restart)
- Wait 30 seconds for startup and peer reconnection
- Check replication lag metric: should return to baseline within 1 minute
- Check consumer backlog: should drain as normal
- Check logs for any errors during startup
Playbook: Certificate Rotation (Planned)
Rotate certificates before expiration without service interruption:
- Generate new certificate and key for zone (e.g., enterprise-bsfg)
- Sign certificate with enterprise PKI CA
- Copy certificate and key to BSFG node(s)
- Reload BSFG configuration (SIGHUP or restart)
- Verify certificate loaded: check certificate_expiry_seconds metric
- Test peer connectivity: RPC calls succeed
- Archive old certificate for audit trail
- Document rotation in change log
Playbook: Emergency Certificate Rotation (Key Compromise)
If a node's private key is compromised, rotate immediately:
- Generate new certificate and key immediately
- Copy new cert/key to node
- Restart BSFG service (forces reconnection with new certificate)
- Verify peers accept new certificate (check TLS handshake success)
- Revoke old certificate in PKI system
- Destroy old private key securely
- Alert security team and document incident
Playbook: Rolling Upgrade (Multi-Node HA)
Upgrade BSFG binary or dependencies with zero downtime:
- Plan upgrade during low-traffic period (if possible)
- Drain traffic from node 1 (stop accepting new RPC calls)
- Wait 30 seconds for in-flight RPC to complete
- Stop node 1 gracefully
- Upgrade binary and dependencies on node 1
- Start node 1, wait for peer reconnection (30 seconds)
- Verify replication resumes: check lag metric
- Re-enable traffic to node 1
- Repeat for nodes 2, 3, etc. (one at a time)
Monitoring Dashboard Recommendations
A comprehensive Grafana dashboard should include:
- Replication Health: Lag trend, frontier progress, fetch/confirm rates
- Ingestion Health: Append rate, append failures, buffer fill
- Consumer Health: Backlog trend, lag trend, confirm rate
- Artifact Status: Upload/retrieval counts and failures, storage usage
- System Health: Node up/down, certificate expiry, TLS errors, RPC latency
- Logs: Recent errors, warnings, and notable events
Pre-Deployment Checklist
- ☐ Monitoring stack deployed (Prometheus, Grafana, or equivalent)
- ☐ Metrics endpoint configured on each BSFG node
- ☐ Prometheus scrape jobs defined for all zones
- ☐ Grafana dashboards created for key metrics
- ☐ Alert rules configured with appropriate thresholds
- ☐ Alert destinations configured (PagerDuty, Slack, email, etc.)
- ☐ Log aggregation pipeline deployed
- ☐ Structured logging enabled in BSFG configuration
- ☐ Log retention policy defined (at least 30 days)
- ☐ Audit logging enabled (all RPC operations logged)
- ☐ Runbooks documented for common scenarios
- ☐ On-call rotation and escalation procedures defined
- ☐ Incident response procedures documented
- ☐ Monitoring tools access and authentication configured
Cross-Links to Related Documentation
- Operations Runbook — Detailed failure modes and recovery
- Enterprise + IDMZ + 2 Plants — Deployment reference
- Peer Replication — Understanding replication protocol
- Identity Model — Certificate management
- Network Policy — Connectivity and firewall