BSFG Observability & Operations Guide

Audience: Operators, SREs, platform engineers. Use: Understand the monitoring model, alerting signals, and operational visibility expectations.

Overview

BSFG deployments require operational discipline to maintain reliability in industrial environments. Operators must be able to detect issues, understand system state, and perform safe upgrades without data loss.

This guide provides practical procedures for:

Monitoring replication health and consumer lag
Detecting and diagnosing operational failures
Responding to incidents with confidence
Performing safe node restarts and upgrades
Auditing cross-zone synchronization

Monitoring Architecture

Typical BSFG monitoring architecture involves:

BSFG Node
├── Metrics (Prometheus format)
│   └→ Prometheus Server
│       └→ Grafana Dashboards
└── Logs (structured JSON or syslog)
  └→ Log Aggregation (ELK, OpenSearch, Splunk, Loki)

Components

BSFG Node: Emits metrics and logs during operation
Metrics Exporter: Exposes metrics in Prometheus format on `/metrics` endpoint
Prometheus: Scrapes metrics at regular intervals (e.g., every 15 seconds)
Grafana: Visualizes metrics in dashboards, defines alerts
Log Aggregation: Centralizes logs from all zones for analysis and audit

Core Metrics

BSFG should expose metrics in Prometheus format. Key metrics to monitor:

Replication Metrics

Metric Name	Type	Description	Healthy Range
bsfg_replication_lag_seconds	Gauge	Seconds between fact appended to ISB and confirmed at IFB	< 1 second
bsfg_replication_lag_high_watermark	Gauge	Highest replication lag observed (for alerting on spikes)	< 5 seconds
bsfg_frontier_offset	Gauge	Current frontier (highest contiguous committed offset)	Increasing monotonically
bsfg_fetch_requests_total	Counter	Total FetchFacts RPC calls (peer-to-peer)	Steady or increasing
bsfg_confirm_requests_total	Counter	Total ConfirmReceipt RPC calls	Matches fetch activity

Ingestion Metrics

Metric Name	Type	Description	Healthy Range
bsfg_append_requests_total	Counter	Total AppendFact RPC calls (producer ingestion)	Matches producer throughput
bsfg_append_failures_total	Counter	AppendFact failures (retryable or permanent)	Zero or very low
bsfg_store_buffer_fill_percent	Gauge	Percentage of ISB/ESB capacity used	< 50%
bsfg_append_latency_seconds	Histogram	Time to append fact (p50, p95, p99)	p99 < 100ms

Consumer Metrics

Metric Name	Type	Description	Healthy Range
bsfg_consumer_backlog_size	Gauge	Unconfirmed facts waiting for consumer processing	< 100 facts
bsfg_consumer_lag_seconds	Gauge	Age of oldest unconfirmed fact at IFB/EFB	< 10 seconds
bsfg_confirm_rate_per_second	Gauge	Facts confirmed per second (consumer throughput)	Steady, matching producer rate

Artifact Metrics

Metric Name	Type	Description	Healthy Range
bsfg_artifact_upload_total	Counter	Total PutObject calls (artifact uploads)	Matches producer artifact activity
bsfg_artifact_upload_failures	Counter	Failed artifact uploads	Zero or very low
bsfg_artifact_retrieval_total	Counter	Total GetObject calls (artifact downloads)	Matches consumer artifact activity
bsfg_artifact_retrieval_failures	Counter	Failed artifact retrievals (missing or inaccessible)	Zero
bsfg_object_store_usage_bytes	Gauge	Artifact storage used by bucket	Trending within capacity

System Metrics

Metric Name	Type	Description	Healthy Range
bsfg_node_up	Gauge	Node health indicator (1 = up, 0 = down)	1 (up)
bsfg_certificate_expiry_seconds	Gauge	Seconds until mTLS certificate expires	> 2,592,000 (30 days)
bsfg_tls_handshake_errors_total	Counter	mTLS handshake failures (peer auth failure)	Zero
bsfg_rpc_latency_seconds	Histogram	RPC call duration (AppendFact, FetchFacts, etc.)	p99 < 500ms

Alerting Rules

Configure alerts based on these thresholds. Adjust thresholds per deployment based on expected throughput and latency.

Condition	Threshold	Severity	Action
Replication lag > 5 seconds	5 sec	Warning	Check network, verify peer connectivity
Replication lag > 30 seconds	30 sec	Critical	Check for network partition, node failure, or consumer backlog
Consumer backlog > 1000 facts	1000	Warning	Check consumer health; may be processing slowly or stalled
Consumer lag > 60 seconds	60 sec	Critical	Consumer is severely behind; investigate failure
Store buffer fill > 80%	80%	Warning	Buffer approaching capacity; check replication and consumer progress
Store buffer fill > 95%	95%	Critical	Buffer near exhaustion; risk of producer backpressure or data loss
Append failures > 0 (per minute)	> 0	Warning	Producer experiencing errors; check why AppendFact is failing
Artifact retrieval failures > 0	> 0	Critical	Consumer cannot retrieve artifact; storage issue or missing artifact
TLS handshake errors > 0	> 0	Critical	Peer authentication failing; check certificates and CA trust
Certificate expiry in < 30 days	30 days	Warning	Begin certificate renewal process
Node down (bsfg_node_up = 0)	N/A	Critical	Node is unreachable; check health, restart if necessary

Log Structure

BSFG nodes should emit structured logs for observability. Logs should include:

Log Fields

timestamp: RFC3339 format (e.g., 2026-03-06T14:30:00Z)
level: INFO, WARN, ERROR, FATAL
message_id: Unique identifier for the fact (if applicable)
from_zone: Source zone (producer or peer)
to_zone: Destination zone (consumer or peer)
operation: AppendFact, FetchFacts, ConfirmReceipt, PutObject
predicate: Fact predicate (if applicable)
result: success, retryable_error, permanent_error
error_message: Human-readable error (if error)
duration_ms: Operation latency

Example Log Entry

{
  "timestamp": "2026-03-06T14:30:45.123Z",
  "level": "INFO",
  "zone": "enterprise-bsfg",
  "message_id": "msg_abc123def456",
  "operation": "AppendFact",
  "predicate": "order_created",
  "result": "success",
  "duration_ms": 12,
  "buffer_fill_percent": 45
}

Troubleshooting Scenarios

Replication Has Stopped (Lag Growing)

Symptom: Replication lag continuously increases, facts are not flowing between zones.

Diagnosis:

Check replication lag metric: is it > 30 seconds?
Check consumer backlog: is it growing?
Check network connectivity: ping between BSFG nodes
Check firewall rules: is traffic allowed on RPC port (9443)?
Check certificates: are they valid and trusted?
Check logs for TLS errors or connection timeouts

Resolution:

If network unreachable: restore connectivity, replication resumes automatically
If certificate expired: rotate certificate, restart node
If consumer backlog growing: restart consumer, monitor confirmation rate
If no obvious cause: restart BSFG node (no data loss due to durability)

Consumer Backlog Growing (Not Draining)

Symptom: Consumer backlog continuously increases; facts are fetched but not confirmed.

Diagnosis:

Check confirm rate: is it zero or very low?
Check consumer process: is it running or hung?
Check logs for consumer errors or exceptions
Check if consumer is idempotently processing (not getting stuck on duplicates)

Resolution:

Restart consumer process
Review consumer code for blocking operations or deadlocks
Verify consumer can write to its destination (database, file system)
Monitor confirm rate recovery after restart

Artifact Retrieval Failures

Symptom: GetObject calls failing with "not found" or access denied errors.

Diagnosis:

Check artifact reference in fact: bucket, key, digest
Check object store: does the bucket exist? Is the key present?
Check storage access: can BSFG node read from the bucket?
Check retention policy: has the artifact been garbage collected?

Resolution:

If artifact missing: verify producer uploaded it correctly
If access denied: check IAM permissions for BSFG node
If storage offline: restore storage, consumer retries automatically
If artifact expired: implement better retention policy or upload frequency

TLS Handshake Failures

Symptom: TLS handshake errors in logs, replication unable to establish connections.

Diagnosis:

Check certificate validity: is it expired?
Check certificate CN: does it match the peer's zone identity?
Check CA trust: is the certificate CA trusted by peers?
Check certificate chain: is the full chain available?

Resolution:

Renew certificate if expired
Verify CN matches expected zone identity
Ensure all peers have updated CA root certificate
Restart BSFG node to reload certificates

Store Buffer Exhaustion

Symptom: Store buffer fill > 95%, risk of backpressure or data loss.

Diagnosis:

Check replication lag: facts being produced faster than replicated?
Check consumer backlog: facts being replicated faster than consumed?
Check retention policy: TTL too long, facts not being truncated?

Resolution:

Increase buffer capacity if throughput permanently increased
Reduce TTL if facts are being retained longer than needed
Tune replication and consumer concurrency
Verify backpressure policy (reject vs. drop oldest) is appropriate for SIL level

Operational Playbooks

Playbook: Safe Node Restart

Restarting a BSFG node is safe due to durability guarantees:

Check replication lag to baseline (record pre-restart state)
Stop BSFG service gracefully (signal SIGTERM, wait for shutdown)
Verify service stopped (check process list)
Start BSFG service (service restart or systemctl restart)
Wait 30 seconds for startup and peer reconnection
Check replication lag metric: should return to baseline within 1 minute
Check consumer backlog: should drain as normal
Check logs for any errors during startup

Playbook: Certificate Rotation (Planned)

Rotate certificates before expiration without service interruption:

Generate new certificate and key for zone (e.g., enterprise-bsfg)
Sign certificate with enterprise PKI CA
Copy certificate and key to BSFG node(s)
Reload BSFG configuration (SIGHUP or restart)
Verify certificate loaded: check certificate_expiry_seconds metric
Test peer connectivity: RPC calls succeed
Archive old certificate for audit trail
Document rotation in change log

Playbook: Emergency Certificate Rotation (Key Compromise)

If a node's private key is compromised, rotate immediately:

Generate new certificate and key immediately
Copy new cert/key to node
Restart BSFG service (forces reconnection with new certificate)
Verify peers accept new certificate (check TLS handshake success)
Revoke old certificate in PKI system
Destroy old private key securely
Alert security team and document incident

Playbook: Rolling Upgrade (Multi-Node HA)

Upgrade BSFG binary or dependencies with zero downtime:

Plan upgrade during low-traffic period (if possible)
Drain traffic from node 1 (stop accepting new RPC calls)
Wait 30 seconds for in-flight RPC to complete
Stop node 1 gracefully
Upgrade binary and dependencies on node 1
Start node 1, wait for peer reconnection (30 seconds)
Verify replication resumes: check lag metric
Re-enable traffic to node 1
Repeat for nodes 2, 3, etc. (one at a time)

Monitoring Dashboard Recommendations

A comprehensive Grafana dashboard should include:

Replication Health: Lag trend, frontier progress, fetch/confirm rates
Ingestion Health: Append rate, append failures, buffer fill
Consumer Health: Backlog trend, lag trend, confirm rate
Artifact Status: Upload/retrieval counts and failures, storage usage
System Health: Node up/down, certificate expiry, TLS errors, RPC latency
Logs: Recent errors, warnings, and notable events

Pre-Deployment Checklist

☐ Monitoring stack deployed (Prometheus, Grafana, or equivalent)
☐ Metrics endpoint configured on each BSFG node
☐ Prometheus scrape jobs defined for all zones
☐ Grafana dashboards created for key metrics
☐ Alert rules configured with appropriate thresholds
☐ Alert destinations configured (PagerDuty, Slack, email, etc.)
☐ Log aggregation pipeline deployed
☐ Structured logging enabled in BSFG configuration
☐ Log retention policy defined (at least 30 days)
☐ Audit logging enabled (all RPC operations logged)
☐ Runbooks documented for common scenarios
☐ On-call rotation and escalation procedures defined
☐ Incident response procedures documented
☐ Monitoring tools access and authentication configured

Cross-Links to Related Documentation

Operations Runbook — Detailed failure modes and recovery
Enterprise + IDMZ + 2 Plants — Deployment reference
Peer Replication — Understanding replication protocol
Identity Model — Certificate management
Network Policy — Connectivity and firewall

← Back: SOW / Deployment Specification Template Next: NATS/JetStream Reference →