Architecture Decision Record

ADR-0018: Operational Visibility Uses Metrics and Structured Logs

Status: Accepted · Date: 2026-03-06

Status: Accepted

Date: 2026-03-06

Context

BSFG is a boundary appliance. Its failure modes are operational before they are semantic: append failures, replication lag, confirmation gaps, object-store write issues, mTLS problems, and retention pressure. Operators need enough visibility to answer:

The observability model must therefore support routine operations and incident response without turning BSFG into a telemetry-heavy platform in its own right.

Options Considered

Option Description Benefits Drawbacks
Logs only Emit application logs and rely on downstream log search for all operational diagnosis.

|

| | Metrics only | Expose counters, gauges, and histograms, but avoid detailed logs. |

|

| | Metrics + logs + full distributed tracing | Adopt tracing across every request, append, fetch, confirm, and object-store operation. |

|

| | Metrics + structured logs (Selected) | Expose operational metrics for monitoring and alerting, plus structured logs for event-level diagnosis. |

|

|

Decision

BSFG will expose metrics and structured logs as its standard operational visibility model.

Metrics are used for health, capacity, throughput, lag, and error-rate monitoring. Structured logs are used for append failures, conflicting duplicates, object-store errors, authorization failures, and operator diagnosis.

Example metrics include:

bsfg_append_rate
bsfg_fetch_rate
bsfg_confirm_rate
bsfg_dedupe_hits
bsfg_conflicting_duplicates
bsfg_replication_lag
bsfg_auth_failures
object_store_put_rate
object_store_put_failures

Structured log entries should include enough correlation data to connect an operational event to its semantic context, for example:

Full distributed tracing is not part of the baseline architecture, but can be introduced later if operational evidence shows it is necessary.

Consequences

Benefits:

Tradeoffs: