Concept

Peer Replication Protocol

How BSFG nodes synchronize facts across zones

Audience: Platform engineers, implementers. Use: Understand cross-zone replication flow, cursors, confirmations, and recovery behavior.

Context: Multi-Zone Deployments

Industrial deployments typically span multiple zones separated by trust, network, and operational boundaries:

Enterprise Zone          IDMZ              Plant OT Zone
(IT Systems)        (Mediation Layer)    (Operational Equip)
     │                    │                      │
     ├── BSFG Node    ├── BSFG Node       ├── BSFG Node
     ├── Durable      ├── Durable         ├── Durable
     │   Fact Store   │   Fact Store      │   Fact Store
     └── Apps         └── Gateway         └── PLC/Historian

Each zone runs a local BSFG node with its own durable substrate implementing the BSFG storage ports. Zones are operationally independent: each can survive the failure or disconnection of any other zone.

Yet facts must flow between zones reliably. The mechanism for this flow is the peer replication protocol — a receiver-driven replay mechanism where BSFG nodes pull facts from each other asynchronously, confirm receipt, and advance shared cursors.

Protocol Overview

The peer replication protocol is fundamentally pull-driven. The receiving zone initiates fetch requests; the sending zone returns facts without pushing.

Basic Replication Relationship

┌────────────────────┐
│   Sending Zone     │  (holds authoritative facts)
│   BSFG Node        │
│                    │
│   ISB (store)  ─┐  │
│                 │  │
│   IFB ──────────┼─→ Cursor Tracker (frontier)
└────────────────┼──┘
                 │
  FetchFacts(consumer_name, limit)
                 │
                 ▼
┌────────────────────┐
│ Receiving Zone     │  (pulls facts)
│ BSFG Node          │
│                    │
│ ESB (store) ───┐   │
│                │   │
│ EFB ────────── ├──→ Cursor Tracker
└────────────────┼───┘
                 │
  ConfirmReceipt(consumer_name, offset)
                 │
                 ▼
         Cursor advances
         (frontier moves)
    

Key Characteristics

Replication Loop: Step by Step

The replication loop runs continuously or on-demand. Each iteration moves facts from sender to receiver.

Step 1: Receiver Requests New Facts

Receiver calls:
  FetchFacts(
    consumer_name: "plant-a-receiver",
    limit: 100
  )

Parameters:
  consumer_name — identifies the replication relationship
  limit — batch size (e.g., 100 facts per fetch)
    

The consumer name is durable. It tells the sender which cursor position to use. The sender looks up the last confirmed offset for this consumer and returns facts from last_confirmed_offset + 1 onward.

Step 2: Sender Returns Facts

Sender responds:
  batch = [
    {offset: 100, fact: {...}, envelope: {...}},
    {offset: 101, fact: {...}, envelope: {...}},
    ...
    {offset: 199, fact: {...}, envelope: {...}}
  ]

Response includes:
  offsets — immutable, assigned by sender's store buffer
  facts — the appended messages
  envelopes — metadata, from_zone, message_id, etc.
    

The sender returns up to limit facts. If fewer than limit facts are available, it returns all available facts. An empty batch signals "no new facts at this moment."

Step 3: Receiver Appends Facts Locally

For each fact in batch:
  AppendFact(
    envelope: {...},
    fact: {...}
  ) → local_offset

Receiver's local append:
  - facts go into receiver's store buffer (ESB/ISB)
  - offsets are assigned locally (independent of sender's offsets)
  - facts are durably persisted
    

The receiver's local offsets are independent of the sender's offsets. The mapping between sender offset and receiver offset is tracked internally (or not at all — only the sender's offset matters for cursor advancement).

Step 4: Receiver Confirms Receipt

After all facts in batch are durably appended:
  ConfirmReceipt(
    consumer_name: "plant-a-receiver",
    offset: 199  — highest offset from the batch
  ) → {cursor_advanced_to: 199}

Confirmation includes:
  consumer_name — identifies the relationship
  offset — highest offset successfully appended locally
    

Confirmation is the critical signal. It tells the sender: "I have received and durably stored facts up to offset 199."

Step 5: Sender Advances Frontier

Sender updates cursor tracker:
  Cursor(consumer_name="plant-a-receiver") = 199

Frontier semantics:
  highest_contiguous_committed_offset = 199

Sender can now:
  - truncate facts 0-199 from its store buffer (if TTL allows)
  - reduce storage pressure
  - know which facts have been safely replicated
    

Step 6: Loop Continues

The receiver calls FetchFacts again. The sender returns facts from offset 200 onward. The loop repeats until there are no more facts to replicate.

Cursor Semantics and Frontier Rules

Cursors are the mechanism that enables safe, deterministic replication. Each replication relationship maintains two related positions:

Cursor Position (Receiving Zone View)

Receiver tracks:
  durable_consumer_position = last_confirmed_offset

After confirming offset 199:
  receiver next calls: FetchFacts(limit=100)
  sender will return: facts from offset 200 onward
    

Frontier Position (Sending Zone View)

Sender tracks (per receiving zone):
  highest_contiguous_committed_offset = 199

This frontier is used for:
  - truncation decisions (can delete offsets 0-199)
  - replication progress monitoring
  - recovery checkpoint
    

Contiguous Frontier Rule (ADR-0004)

The frontier advances only over contiguous confirmed offsets. If the receiver confirms offset 100 but then crashes before confirming 101, the frontier cannot advance to 102 until 101 is confirmed.

Sender store buffer: [0, 1, 2, ..., 100, 101, 102, 103]
Confirmations:       [Y, Y, Y, ..., Y,   N,   Y,   Y  ]  (101 missing)
Frontier:            100  (stops at first gap)

After receiver retries and confirms 101:
Frontier:            103  (advances to contiguous prefix 0-103)
    

This rule ensures that truncation is safe: we only delete facts that have been confirmed by all consumers, in order.

Failure Recovery

The peer replication protocol is designed to tolerate failures. Recovery is deterministic and requires no external orchestration.

Network Partition

Scenario: Receiving zone loses connectivity to sending zone

Behavior:
  - FetchFacts calls timeout
  - Receiver stops pulling facts
  - Sending zone buffer accumulates facts
  - Receiver continues locally (no blocking)

Duration: Minutes to hours

Recovery:
  1. Connectivity restored
  2. Receiver retries FetchFacts
  3. Sender returns facts from (last_confirmed_offset + 1) onward
  4. Receiver re-appends facts (idempotent)
  5. Confirmations resume
  6. Frontier advances
    

Receiver (BSFG Node) Crash

Scenario: Receiving zone BSFG node crashes during a fetch/confirm cycle

Behavior:
  - Receiver restarts
  - Last confirmed offset is recovered from durable consumer
  - Receiver resumes fetching from (last_confirmed_offset + 1)
  - Some facts in new batch may have been partially appended
  - Re-append them (idempotent append prevents duplicates)
  - Confirm again
    

Sender (BSFG Node) Crash

Scenario: Sending zone BSFG node crashes with unconfirmed facts in buffer

Behavior:
  - Sender restarts
  - Store buffer is recovered from durable storage
  - Cursor tracker is recovered
  - Receiver calls FetchFacts
  - Sender returns unconfirmed facts (from last confirmed offset + 1 onward)
  - Receiver appends, confirms
  - Frontier advances normally
    

Consumer Backlog Accumulation

Scenario: Receiving zone consumer processes facts very slowly

Behavior:
  - Sender accumulates unconfirmed facts
  - Buffer fill percentage increases
  - Replication lag grows
  - Receiving zone continues operating independently

Duration: Hours to days (slow consumer)

Recovery:
  - Consumer accelerates
  - Confirmations resume
  - Frontier advances
  - Sending zone buffer drains
    

Delivery Guarantees

The peer replication protocol provides clear, testable guarantees about how facts move between zones.

At-Least-Once Delivery

Every fact successfully appended to the sender's store buffer will be delivered at least once to the receiver. Facts are never lost due to network partition or sender/receiver restart.

Idempotent Append

If the same fact is sent twice (due to receiver retry, sender recovery), the receiver's append is idempotent: the fact is stored once.

Sender offset 100 contains fact F with message_id "evt-123"

Scenario A: Normal flow
  Receiver fetches offset 100 → appends F locally with message_id "evt-123"
  Receiver confirms offset 100

Scenario B: Receiver crashes and retries
  Receiver recovers from durable consumer → position is offset 99
  Receiver fetches from offset 100 again → gets F again
  Receiver tries to append F with message_id "evt-123"
  Forward buffer's putIfAbsent rejects it (already exists)
  Receiver confirms offset 100 again
  Result: ONE copy of F in receiver's buffer, not two
    

Eventual Consistency

All replication relationships converge to the same state given enough time and connectivity.

No Exactly-Once

BSFG does not guarantee exactly-once delivery. Receivers must tolerate and handle duplicate delivery by implementing idempotent processing (deduplication by message_id or natural key).

No Synchronous Confirmation

FetchFacts and ConfirmReceipt are asynchronous. Confirmation does not mean the receiving zone's consumer has processed the facts, only that they are durably received.

Multi-Zone Topology

In a three-zone deployment, replication relationships form a network:

Enterprise ↔ IDMZ
    ↑       ↓
    └─ Plant A

Replication relationships:
  - Enterprise → IDMZ (enterprise pulls from IDMZ, or vice versa)
  - Enterprise → Plant A
  - IDMZ → Plant A
  - (optional) Enterprise → Plant A (direct, bypassing IDMZ)

Each relationship is independent:
  - separate cursors
  - separate confirmation state
  - can fail/recover separately
    

Performance Characteristics

Under healthy, normal operation:

Under partition or overload:

Replication vs. Consumption

It's important to distinguish between replication (zone-to-zone) and consumption (application processing):

Replication and consumption are decoupled. A fact may be replicated to the receiving zone long before any consumer processes it. This allows producers to complete without waiting for consumer processing.

Related Concepts

The peer replication protocol builds on several foundational BSFG concepts: