BSFG ADR-0018

Status: Accepted

Date: 2026-03-06

Context

BSFG is a boundary appliance. Its failure modes are operational before they are semantic: append failures, replication lag, confirmation gaps, object-store write issues, mTLS problems, and retention pressure. Operators need enough visibility to answer:

is the boundary service healthy?
are facts being appended and confirmed?
is cross-zone lag growing?
are attachments being stored successfully?
are retries, duplicates, or auth failures increasing?

The observability model must therefore support routine operations and incident response without turning BSFG into a telemetry-heavy platform in its own right.

Options Considered

Option	Description	Benefits	Drawbacks
Logs only	Emit application logs and rely on downstream log search for all operational diagnosis.

simple implementation
minimal surface area

weak aggregate visibility
poor alerting basis
lag and throughput trends are hard to see

| | Metrics only | Expose counters, gauges, and histograms, but avoid detailed logs. |

good dashboards and alerting
small telemetry volume

poor incident forensics
hard to explain individual failures or rejected operations

| | Metrics + logs + full distributed tracing | Adopt tracing across every request, append, fetch, confirm, and object-store operation. |

maximal visibility
strong end-to-end request analysis

heavier operational footprint
more moving parts
overkill for the current appliance scope

| | Metrics + structured logs (Selected) | Expose operational metrics for monitoring and alerting, plus structured logs for event-level diagnosis. |

good operational baseline
supports dashboards and alerting
preserves incident-level detail
keeps telemetry surface moderate

correlation across systems is not as rich as full tracing
log structure must be governed consistently

Decision

BSFG will expose metrics and structured logs as its standard operational visibility model.

Metrics are used for health, capacity, throughput, lag, and error-rate monitoring. Structured logs are used for append failures, conflicting duplicates, object-store errors, authorization failures, and operator diagnosis.

Example metrics include:

bsfg_append_rate
bsfg_fetch_rate
bsfg_confirm_rate
bsfg_dedupe_hits
bsfg_conflicting_duplicates
bsfg_replication_lag
bsfg_auth_failures
object_store_put_rate
object_store_put_failures

Structured log entries should include enough correlation data to connect an operational event to its semantic context, for example:

message_id
from_zone
to_zone
stream
subject
predicate
correlation_id
error_code where applicable

Full distributed tracing is not part of the baseline architecture, but can be introduced later if operational evidence shows it is necessary.

Consequences

Benefits:

clear alerting and dashboard foundation
good enough forensic detail for most incidents
moderate operational complexity for a boundary appliance
separation between aggregate monitoring and per-event diagnosis

Tradeoffs:

end-to-end request reconstruction across multiple systems remains less direct than with tracing
structured logging discipline must be maintained over time
teams may still introduce tracing later for specific failure modes

← Previous ADR Next ADR →