HELIX OS — Zero-Copy Memory Architecture

Technical Specification: Binary Interface Between Rust Core, GPU, and Disk

Document ID: HELIX-ARCH-0002
Status: Normative
Revision: 1.4.0
Authors: Systems Architecture Group, HELIX OS
Applies To: helix-core ≥ 2.4, helix-hal ≥ 2.4, helix-gpu ≥ 2.4

Scope. This document defines the complete binary interface for the HELIX OS zero-copy data pipeline. It is the authoritative specification for memory layout, alignment requirements, atomic coordination logic, GPU buffer interop, and page-lifecycle safety. Any implementation that deviates from this specification without a corresponding revision to this document is non-conformant and may produce silent data corruption, audit ledger gaps, or undefined GPU behavior.

This is not a tutorial. It is a contract.

Architectural Context
Alignment & Cache-Line Requirements
Core Data Structures
Lock-Free SPSC Ring Buffer
GPU Interop — Zero-Copy Buffer Views
Page Lifecycle & Safety Protocol
Error Conditions & Recovery
Benchmarks & Validation Targets
Appendix A — Full Struct Reference
Appendix B — Memory Map Diagram

1. Architectural Context

HELIX OS moves data from instrument hardware to three concurrent consumers — the 3D rendering pipeline, the digital twin ODE solver, and the audit ledger writer — without copying bytes between them. The mechanism is a set of memory-mapped ring buffers where all consumers hold read-only views into the same physical pages that the hardware abstraction layer (HAL) wrote into.

The pipeline has exactly one producer per ring buffer (SPSC — Single-Producer, Single-Consumer is a simplification; in practice each ring has one write owner and N read owners coordinated by the hazard pointer subsystem). The producer is always a HAL driver thread. The consumers are:

Consumer	Binding	Access Pattern
Vulkan/Metal Render Pipeline	`VkBuffer` / `MTLBuffer` external import	Sequential, frame-aligned reads
Digital Twin ODE Solver	Direct pointer dereference via `Arc<MappedRegion>`	Random access within sliding window
Audit Ledger Writer	`O_DIRECT` scatter-gather to NVMe	Sequential, page-aligned reads

The invariant the entire architecture is built to maintain:

No byte of instrument telemetry or simulation output is copied more than once after it exits the kernel DMA buffer.

The HAL writes once. Every downstream consumer reads from that write. When the ring buffer page is recycled, every consumer that may have been reading it must have either completed their read or registered an intent-to-read that blocks recycling. The mechanism that enforces this is described in §6.

2. Alignment & Cache-Line Requirements

A ring buffer's correctness depends on two atomic variables: the head pointer (owned by the consumer, advanced on read) and the tail pointer (owned by the producer, advanced on write). If these two variables share a cache line, every write to tail by the producer invalidates the cache line in the consumer's L1 cache — even though the consumer only cares about head. This is false sharing, and it degrades throughput by forcing cache-line ping-pong between producer and consumer cores across the interconnect.

At HELIX OS's target throughput (>10,000 telemetry frames/sec per instrument stream, running on a multi-socket EPYC node where producer and consumer threads may be on physically separate dies), false sharing on the ring buffer control structure is not a theoretical concern. It is a measured ~40% throughput regression in early prototypes before the alignment policy defined here was adopted.

2.2 The 64-Byte Cache-Line Rule

Modern x86-64 and ARM64 processors use 64-byte cache lines universally. The HELIX OS alignment policy is:

Rule 1 — Control Variables Are Cache-Line Isolated. Every atomic variable that is written by one thread and read by a different thread MUST be the sole occupant of its cache line. This is achieved by padding to 64 bytes.

Rule 2 — Data Structures Are Cache-Line Aligned. All structs that are stored in ring buffer slots MUST begin at a 64-byte-aligned address. This ensures that a single-field read does not straddle a cache line boundary and does not share a cache line with a field from an adjacent slot that is being written concurrently.

Rule 3 — Ring Buffer Slots Are a Multiple of Cache-Line Size. Slot size MUST be a multiple of 64 bytes. A slot that is 96 bytes wastes 32 bytes of padding but guarantees that no two adjacent slots share a cache line.

Rule 4 — GPU-Mapped Regions Are Page-Aligned. Any memory region that will be imported into a Vulkan VkBuffer or Metal MTLBuffer MUST begin at a page-aligned address (4096 bytes on x86-64 and ARM64). This is required by both the Vulkan and Metal specifications for external memory import. HELIX OS uses 2 MB huge pages for GPU-mapped regions where the OS supports it (Linux MAP_HUGETLB, macOS large page support via VM_FLAGS_SUPERPAGE_SIZE_2MB), reducing TLB pressure for the render pipeline's sequential access pattern.

2.3 Rust Enforcement

These rules are enforced at compile time using #[repr(C, align(N))] and validated at runtime using debug_assert! in ring buffer constructors. A test in helix-core/tests/alignment.rs runs under cargo test and cargo miri to verify layout assumptions:

// helix-core/src/ring/alignment_test.rs
#[cfg(test)]
mod tests {
    use super::*;
    use std::mem;

    #[test]
    fn telemetry_frame_alignment() {
        assert_eq!(mem::align_of::<TelemetryFrame>(), 64,
            "TelemetryFrame must be 64-byte aligned for cache-line safety");
        assert_eq!(mem::size_of::<TelemetryFrame>() % 64, 0,
            "TelemetryFrame size must be a multiple of 64 bytes");
    }

    #[test]
    fn ring_header_pointer_isolation() {
        let h = mem::offset_of!(RingBufferHeader, head);
        let t = mem::offset_of!(RingBufferHeader, tail);
        // head and tail must not share a cache line
        assert!(
            (h / 64) != (t / 64),
            "head ({h}) and tail ({t}) share a cache line — false sharing guaranteed"
        );
    }

    #[test]
    fn sim_result_alignment() {
        assert_eq!(mem::align_of::<SimResult>(), 64);
        assert_eq!(mem::size_of::<SimResult>() % 64, 0);
    }
}

3. Core Data Structures

All structs in this section use #[repr(C)] to guarantee a stable, C-ABI-compatible memory layout. #[repr(Rust)] layout is undefined and cannot be relied upon for mmap'd regions that are read by the GPU driver, the ledger writer, or any out-of-process consumer.

3.1 `TelemetryFrame`

A TelemetryFrame is the atomic unit of instrument telemetry in the HELIX OS data model. It represents a single sensor sample from a single instrument channel at a single point in time.

Design constraints:

Must fit in exactly 2 cache lines (128 bytes) to allow the render pipeline to load a complete frame in 2 cache misses maximum.
Must be C-ABI compatible for GPU shader access.
The quality_flags field encodes OPC-UA quality codes, NAMUR NE107 status, and HELIX-internal validity bits.

// helix-core/src/schema/telemetry.rs

/// A single sensor sample. Atomic unit of the telemetry pipeline.
///
/// # Layout
/// Total size: 128 bytes (2 cache lines).
/// Alignment: 64 bytes.
///
/// # Invariants
/// - `timestamp_ns` is a monotonic clock reading (CLOCK_MONOTONIC_RAW on Linux,
///   mach_continuous_time on macOS). It is NOT wall-clock time.
/// - `wall_timestamp_ns` is TAI-corrected UTC nanoseconds since Unix epoch.
///   The distinction matters for FDA audit records where UTC provenance is required.
/// - `value` is always in SI base units or the unit indicated by `unit_code`.
///   The HAL driver is responsible for conversion before writing to the ring buffer.
/// - `sequence` is monotonically increasing per (instrument_id, channel_id) pair.
///   A gap in sequence numbers indicates dropped frames and MUST trigger an alarm.
#[repr(C, align(64))]
#[derive(Debug, Clone, Copy)]
pub struct TelemetryFrame {
    // ── Cache Line 0 (bytes 0–63) ────────────────────────────────────────────

    /// Monotonic timestamp. Nanosecond resolution. Never goes backward.
    /// Set by the HAL driver immediately after DMA completion.
    pub timestamp_ns: u64,                  // offset 0,  size 8

    /// Wall-clock timestamp (TAI → UTC). Set by the HAL's timekeeping subsystem.
    /// Do not use for causal ordering. Use `timestamp_ns` for that.
    pub wall_timestamp_ns: i64,             // offset 8,  size 8

    /// Unique identifier for the physical instrument (assigned at HAL registration).
    /// Packed as [site_id: u16, rack_id: u8, slot_id: u8, instrument_type: u16, _pad: u16]
    pub instrument_id: u64,                 // offset 16, size 8

    /// Channel index within the instrument (e.g., 0 = DO, 1 = pH, 2 = temp).
    pub channel_id: u16,                    // offset 24, size 2

    /// OPC-UA quality code (see OPC UA Part 8, §7.3).
    /// 0x00C0 = Good. Values < 0x0080 indicate uncertain or bad quality.
    pub quality_flags: u16,                 // offset 26, size 2

    /// Unit of measure. Encoded as UCUM code packed into u32.
    /// 0x0001 = dimensionless, 0x0002 = Celsius, 0x0003 = pH, 0x0004 = bar,
    /// 0x0005 = L/min, 0x0006 = RPM, 0x0007 = mg/L (DO), 0x0008 = OD600
    pub unit_code: u32,                     // offset 28, size 4

    /// The measured value in the unit indicated by `unit_code`.
    pub value: f64,                         // offset 32, size 8

    /// Monotonically increasing sequence number. Scoped to (instrument_id, channel_id).
    pub sequence: u64,                      // offset 40, size 8

    /// Reserved for hardware-layer metadata (ADC raw count, calibration revision, etc.)
    pub hw_metadata: u64,                   // offset 48, size 8

    /// CRC-32C (Castagnoli) over bytes 0–55. Validated by consumers before use.
    /// Computed by the HAL; re-validated by ledger writer before persistence.
    pub crc32c: u32,                        // offset 56, size 4

    /// Explicit padding to 64-byte boundary.
    pub _pad0: [u8; 4],                     // offset 60, size 4

    // ── Cache Line 1 (bytes 64–127) ──────────────────────────────────────────

    /// Raw setpoint value active at time of measurement (for context; not controlled).
    pub active_setpoint: f64,               // offset 64, size 8

    /// Digital twin predicted value at this timestamp (written by DT engine, 0.0 if unavailable).
    pub dt_predicted: f64,                  // offset 72, size 8

    /// Deviation of `value` from `dt_predicted`, in sigma units of the process model.
    /// Values > 3.0 trigger a predictive alarm in the alarm subsystem.
    pub dt_deviation_sigma: f32,            // offset 80, size 4

    /// Alarm state bitmap. Bit 0: Low Low, Bit 1: Low, Bit 2: High, Bit 3: High High.
    /// Bit 4: Rate-of-change alarm. Bit 5: DT prediction alarm. Bits 6–15: reserved.
    pub alarm_bitmap: u16,                  // offset 84, size 2

    /// SPC control chart signals. Bit flags for Western Electric rules 1–8 in low byte,
    /// CUSUM signal in bit 8, EWMA signal in bit 9, bits 10–15 reserved.
    pub spc_signals: u16,                   // offset 86, size 2

    /// Explicit padding to cache-line-align and fill to 128 bytes.
    pub _pad1: [u8; 40],                    // offset 88, size 40
    //                                      total: 128 bytes ✓
}

// Compile-time layout assertions. These fail at build time if the struct
// layout changes, alerting the engineer to update this spec and the GPU shaders.
const _: () = {
    assert!(core::mem::size_of::<TelemetryFrame>() == 128);
    assert!(core::mem::align_of::<TelemetryFrame>() == 64);
    assert!(core::mem::offset_of!(TelemetryFrame, value) == 32);
    assert!(core::mem::offset_of!(TelemetryFrame, crc32c) == 56);
    assert!(core::mem::offset_of!(TelemetryFrame, dt_predicted) == 72);
};

GPU Shader Binding (WGSL)

The TelemetryFrame layout is replicated in WGSL for the render pipeline. The two layouts MUST remain in sync. A mismatch is a silent rendering error.

// helix-gpu/shaders/telemetry_frame.wgsl
//
// CRITICAL: This struct MUST match TelemetryFrame in helix-core/src/schema/telemetry.rs.
// If you change either, you MUST change both and update HELIX-ARCH-0002.
//
// std430 layout rules apply. All fields are naturally aligned.

struct TelemetryFrame {
    timestamp_ns:       u64,     // offset 0
    wall_timestamp_ns:  i64,     // offset 8
    instrument_id:      u64,     // offset 16
    channel_id:         u32,     // offset 24  (u16 promoted to u32 by WGSL std430)
    quality_flags:      u32,     // offset 28  (u16 promoted to u32 by WGSL std430)
    // NOTE: WGSL promotes u16 fields to u32 in buffer layouts.
    // The Rust struct uses explicit u32 for unit_code which aligns correctly.
    unit_code:          u32,     // offset 28... (see note)
    // Full binding omitted for brevity — see helix-gpu/shaders/include/telemetry.wgsl
    value:              f64,     // offset 32
    sequence:           u64,     // offset 40
    // ... remaining fields
}

⚠️ Implementation Note on u16 → u32 Promotion: WGSL's std430 layout rules promote u16 to u32 alignment. The Rust struct is designed to avoid u16 fields that would cause WGSL alignment divergence in the second cache line. Any future modifications to TelemetryFrame that introduce u16 fields must account for this and maintain byte-offset compatibility by consulting this document and updating the GPU shader bindings.

3.2 `SimResult`

A SimResult encapsulates the output of a single trajectory segment from the molecular dynamics engine. It is larger than TelemetryFrame and is routed through a separate ring buffer (helix://rings/sim-results) with a larger slot size.

Design constraints:

Must accommodate a 128-element energy gradient vector without heap allocation.
Must be passable to the GPU compute pipeline for MSM transition matrix accumulation.
The collective_variables array maps to the GPU's WGSL array<f32, 16> type exactly.

// helix-core/src/schema/simulation.rs

/// Output of a single molecular dynamics trajectory segment.
///
/// # Layout
/// Total size: 512 bytes (8 cache lines).
/// Alignment: 64 bytes.
///
/// # Note on `energy_gradient`
/// The gradient vector is 128 × f32 = 512 bytes, which would make this struct
/// very large. Instead, `energy_gradient` is a [f32; 8] "reduced gradient"
/// representing the 8 most significant principal components of the full gradient.
/// The full gradient is written to the simulation data partition separately and
/// referenced by `trajectory_chunk_id`.
#[repr(C, align(64))]
#[derive(Debug, Clone, Copy)]
pub struct SimResult {
    // ── Cache Line 0 (bytes 0–63) ────────────────────────────────────────────

    /// Unique ID for the parent simulation ensemble this segment belongs to.
    pub ensemble_id: u128,                  // offset 0,  size 16

    /// Unique ID for this trajectory segment within the ensemble.
    pub segment_id: u64,                    // offset 16, size 8

    /// Monotonic timestamp of segment completion (from the simulation thread's clock).
    pub completed_at_ns: u64,              // offset 24, size 8

    /// Simulated time covered by this segment (femtoseconds).
    pub simulated_time_fs: f64,            // offset 32, size 8

    /// Number of atoms in the system.
    pub n_atoms: u32,                      // offset 40, size 4

    /// Force field identifier. See `helix_core::forcefield::ForceFieldId`.
    pub force_field_id: u16,               // offset 44, size 2

    /// Termination reason for this segment.
    /// 0 = completed normally, 1 = time limit, 2 = energy blowup,
    /// 3 = constraint failure, 4 = external interrupt
    pub termination_reason: u8,            // offset 46, size 1

    /// Compression applied to trajectory data in the sim-data partition.
    /// 0 = none, 1 = zstd-3, 2 = zstd-19, 3 = lz4-hc
    pub compression_codec: u8,             // offset 47, size 1

    /// Pointer (as u64 file offset) into the trajectory chunk store for full coordinate data.
    pub trajectory_chunk_id: u64,          // offset 48, size 8

    /// CRC-32C over the fixed fields (bytes 0–55). Validated before GPU dispatch.
    pub crc32c: u32,                       // offset 56, size 4

    /// Padding to 64-byte boundary.
    pub _pad0: [u8; 4],                    // offset 60, size 4

    // ── Cache Line 1 (bytes 64–127) ──────────────────────────────────────────

    /// Potential energy at segment end (kJ/mol).
    pub potential_energy: f64,             // offset 64, size 8

    /// Kinetic energy at segment end (kJ/mol).
    pub kinetic_energy: f64,               // offset 72, size 8

    /// Temperature (K). Computed from kinetic energy and degrees of freedom.
    pub temperature_k: f64,                // offset 80, size 8

    /// Pressure (bar). From virial theorem.
    pub pressure_bar: f64,                 // offset 88, size 8

    /// Root-mean-square deviation from reference structure (nm).
    pub rmsd_nm: f32,                      // offset 96, size 4

    /// Radius of gyration (nm).
    pub rg_nm: f32,                        // offset 100, size 4

    /// Solvent-accessible surface area (nm²).
    pub sasa_nm2: f32,                     // offset 104, size 4

    /// Padding to complete cache line 1.
    pub _pad1: [u8; 20],                   // offset 108, size 20
    //                                      cache line 1 end: 128 bytes ✓

    // ── Cache Lines 2–3 (bytes 128–255) — Collective Variables ───────────────

    /// Up to 16 collective variable values at segment end.
    /// Which CVs are populated is indicated by `cv_mask` below.
    /// Indexed as: [phi, psi, chi1, chi2, end-to-end_dist, rg, sasa, ...user-defined...]
    pub collective_variables: [f64; 16],   // offset 128, size 128

    // ── Cache Lines 4–5 (bytes 256–383) — Reduced Energy Gradient ────────────

    /// 8-component reduced gradient (principal component projection).
    /// Full gradient is in the trajectory chunk store at `trajectory_chunk_id`.
    pub reduced_gradient: [f64; 8],        // offset 256, size 64

    /// Bitmask indicating which entries of `collective_variables` are valid.
    pub cv_mask: u16,                      // offset 320, size 2

    /// MSM microstate assignment (assigned by the state discretizer post-processing).
    /// u32::MAX if not yet assigned.
    pub msm_microstate: u32,               // offset 322, size 4

    /// Padding.
    pub _pad2: [u8; 58],                   // offset 326, size 58
    //                                      end: 384 bytes

    // ── Cache Lines 6–7 (bytes 384–511) — Reserved for future use ────────────
    pub _reserved: [u8; 128],              // offset 384, size 128
    //                                      total: 512 bytes ✓
}

const _: () = {
    assert!(core::mem::size_of::<SimResult>() == 512);
    assert!(core::mem::align_of::<SimResult>() == 64);
    assert!(core::mem::offset_of!(SimResult, collective_variables) == 128);
    assert!(core::mem::offset_of!(SimResult, reduced_gradient) == 256);
};

3.3 `RingBufferHeader`

The RingBufferHeader is the control structure for a ring buffer, laid out at the beginning of the mmap'd region. It contains the atomic head and tail pointers and the ring's static configuration parameters.

The layout of this struct is the most alignment-critical in the entire codebase. The head and tail fields MUST NOT share a cache line.

// helix-core/src/ring/header.rs

use std::sync::atomic::{AtomicU64, Ordering};

/// Control structure for an SPSC ring buffer. Lives at offset 0 of the mmap region.
///
/// # Layout
/// Total size: 256 bytes (4 cache lines).
///
/// Cache line 0 (bytes   0– 63): Static configuration. Read-only after initialization.
/// Cache line 1 (bytes  64–127): `tail` — written by producer, read by consumer.
/// Cache line 2 (bytes 128–191): `head` — written by consumer, read by producer.
/// Cache line 3 (bytes 192–255): Hazard pointer array (see §6).
///
/// The 64-byte gap between `tail` and `head` ensures they are on separate cache lines,
/// eliminating false sharing between producer and consumer cores.
#[repr(C, align(64))]
pub struct RingBufferHeader {
    // ── Cache Line 0 — Static Configuration (bytes 0–63) ─────────────────────

    /// Magic number for corruption detection. Always `HELIX_RING_MAGIC = 0x48454C49585249_u64`.
    pub magic: u64,                         // offset 0,  size 8

    /// Ring buffer format version. This spec defines version 3.
    pub version: u32,                       // offset 8,  size 4

    /// Slot size in bytes (must be a multiple of 64).
    pub slot_size_bytes: u32,               // offset 12, size 4

    /// Total number of slots. Must be a power of two for efficient modulo via bitmask.
    pub capacity: u32,                      // offset 16, size 4

    /// Bitmask for index wrapping. Always `capacity - 1`.
    pub index_mask: u32,                    // offset 20, size 4

    /// Total size of the data region in bytes (capacity × slot_size_bytes).
    pub data_region_bytes: u64,             // offset 24, size 8

    /// Offset from the start of the mmap region to the first slot (always 4096 — one page).
    pub data_offset: u64,                   // offset 32, size 8

    /// Ring buffer type discriminant.
    /// 1 = TelemetryFrame, 2 = SimResult, 3 = AuditEvent, 255 = raw bytes
    pub ring_type: u8,                      // offset 40, size 1

    /// Number of registered consumers (readers). Currently max 8.
    pub consumer_count: u8,                 // offset 41, size 1

    /// Flags. Bit 0: persistent (mmap-backed to file). Bit 1: GPU-mapped. Bit 2: locked (no new writers).
    pub flags: u8,                          // offset 42, size 1

    pub _pad0: [u8; 21],                    // offset 43, size 21
    //                                       cache line 0 end: 64 bytes ✓

    // ── Cache Line 1 — Tail Pointer (bytes 64–127) ───────────────────────────
    // This entire cache line belongs to the PRODUCER.
    // The consumer reads it but never writes it.

    /// Index of the next slot to write into. Monotonically increasing.
    /// The actual slot index is `tail & index_mask`.
    /// Written by producer with Release ordering after slot data is fully written.
    /// Read by consumer with Acquire ordering before reading the slot.
    ///
    /// NEVER stored as a wrapping u32. Using u64 allows the ring to handle
    /// 2^64 writes without ABA issues.
    pub tail: AtomicU64,                    // offset 64, size 8

    /// Sequence number for the current write epoch (incremented on ring reset).
    pub write_epoch: AtomicU64,             // offset 72, size 8

    /// Producer's cached copy of head (refreshed periodically to check for backpressure).
    /// This is a non-atomic shadow copy — only the producer writes it, only the producer reads it.
    pub _producer_head_cache: u64,          // offset 80, size 8

    pub _pad1: [u8; 40],                    // offset 88, size 40
    //                                       cache line 1 end: 128 bytes ✓

    // ── Cache Line 2 — Head Pointer (bytes 128–191) ──────────────────────────
    // This entire cache line belongs to the CONSUMER.
    // The producer reads it but never writes it.

    /// Index of the next slot to read from. Monotonically increasing.
    /// Written by consumer with Release ordering after processing is complete.
    /// Read by producer with Acquire ordering to check available space.
    pub head: AtomicU64,                    // offset 128, size 8

    /// Consumer's cached copy of tail (refreshed on ring-empty condition).
    pub _consumer_tail_cache: u64,          // offset 136, size 8

    /// Count of dropped frames (incremented when producer finds no space).
    /// Monotonically increasing. An alarm fires when this diverges from last-known value.
    pub dropped_frame_count: AtomicU64,     // offset 144, size 8

    pub _pad2: [u8; 40],                    // offset 152, size 40
    //                                       cache line 2 end: 192 bytes ✓

    // ── Cache Line 3 — Hazard Pointers (bytes 192–255) ───────────────────────
    // See §6 for the full hazard pointer protocol.
    // Each u64 is the slot index currently being read by the corresponding consumer.
    // u64::MAX means "no slot currently being read" (safe to recycle).

    /// Hazard pointers for up to 8 concurrent consumers.
    pub hazard_pointers: [AtomicU64; 8],    // offset 192, size 64
    //                                       cache line 3 end: 256 bytes ✓
}

const _: () = {
    assert!(core::mem::size_of::<RingBufferHeader>() == 256);
    assert!(core::mem::align_of::<RingBufferHeader>() == 64);
    // Verify cache line isolation of head and tail
    assert!(core::mem::offset_of!(RingBufferHeader, tail) == 64);
    assert!(core::mem::offset_of!(RingBufferHeader, head) == 128);
    assert!(core::mem::offset_of!(RingBufferHeader, hazard_pointers) == 192);
    // Verify tail and head are on different cache lines (trivially true given above, but explicit)
    assert!((64 / 64) != (128 / 64));
};

pub const HELIX_RING_MAGIC: u64 = 0x48454C49585249_u64; // "HELIXRI" in ASCII

4. Lock-Free SPSC Ring Buffer

4.1 mmap Layout

The ring buffer occupies a contiguous mmap'd region. The layout is:

Byte Offset     Region
──────────────  ──────────────────────────────────────────────────────────────
0               RingBufferHeader (256 bytes)
256             [unused — header padding to next 4096-byte boundary]
4096            Slot 0  (slot_size_bytes bytes, 64-byte aligned)
4096 + S        Slot 1
4096 + 2S       Slot 2
...
4096 + (C-1)×S  Slot C-1

Where S = slot_size_bytes and C = capacity. The data region starts at a page boundary (offset 4096) for two reasons:

The header must be on its own page to allow the data region pages to be marked MADV_SEQUENTIAL without also marking the header page (which has random-access patterns from multiple threads).
GPU external memory import requires page-aligned base addresses. The GPU is only ever mapped the data region, not the header page.

// helix-core/src/ring/mmap.rs

use std::num::NonZeroUsize;
use rustix::mm::{mmap_anonymous, MapFlags, ProtFlags};

pub const RING_HEADER_SIZE: usize = 256;
pub const DATA_REGION_OFFSET: usize = 4096; // One page — must be >= RING_HEADER_SIZE

pub struct MappedRing {
    pub ptr: *mut u8,
    pub total_size: usize,
    pub header: *mut RingBufferHeader,
    pub data: *mut u8,
}

impl MappedRing {
    /// Allocate a new ring buffer via anonymous mmap.
    ///
    /// # Safety
    /// The returned region must be kept alive for the lifetime of all consumers.
    /// Dropping this struct while consumers hold pointers into the data region
    /// is undefined behavior. Use the hazard pointer protocol in §6 to ensure
    /// all consumers have released before dropping.
    pub unsafe fn allocate(capacity: usize, slot_size: usize) -> Result<Self, RingError> {
        assert!(capacity.is_power_of_two(), "capacity must be a power of two");
        assert!(slot_size % 64 == 0, "slot_size must be a multiple of 64");

        let data_size = capacity * slot_size;
        let total_size = DATA_REGION_OFFSET + data_size;

        // MAP_LOCKED pins the pages in RAM — critical for real-time instrument paths
        // where a page fault would introduce latency spikes.
        // Requires CAP_IPC_LOCK or sufficient RLIMIT_MEMLOCK.
        let ptr = mmap_anonymous(
            None,
            NonZeroUsize::new(total_size).unwrap(),
            ProtFlags::READ | ProtFlags::WRITE,
            MapFlags::PRIVATE | MapFlags::LOCKED | MapFlags::POPULATE,
        )?;

        // Advise the kernel on access patterns for the data region only.
        // MADV_SEQUENTIAL reduces readahead latency for the ledger writer's
        // sequential access pattern.
        rustix::mm::madvise(
            (ptr as *mut u8).add(DATA_REGION_OFFSET) as *mut _,
            data_size,
            rustix::mm::Advice::Sequential,
        )?;

        let header_ptr = ptr as *mut RingBufferHeader;
        let data_ptr = (ptr as *mut u8).add(DATA_REGION_OFFSET);

        Ok(MappedRing {
            ptr: ptr as *mut u8,
            total_size,
            header: header_ptr,
            data: data_ptr,
        })
    }

    /// Return a raw pointer to slot `index`.
    ///
    /// # Safety
    /// Caller must ensure `index < capacity` and that no concurrent writer
    /// is writing to this slot (i.e., the head pointer has advanced past it).
    #[inline(always)]
    pub unsafe fn slot_ptr(&self, index: u64, slot_size: usize) -> *mut u8 {
        let slot_index = (index & (*self.header).index_mask as u64) as usize;
        self.data.add(slot_index * slot_size)
    }
}

4.2 Atomic Head/Tail Protocol

The correctness of the ring buffer rests entirely on the ordering semantics of the head and tail atomic stores and loads. Getting this wrong produces data races that are:

Not detectable by Rust's borrow checker (because the unsafe boundary crosses the mmap region)
Not reliably detectable by cargo miri (which cannot model hardware memory ordering)
Potentially silent — the consumer reads stale or partially-written data with no indication of corruption

The ordering rules are:

Operation	Atomic Ordering	Reason
Producer stores data to slot	`ptr::write_volatile` + fence	Ensure all slot bytes are visible before tail update
Producer increments `tail`	`Release`	All preceding writes are visible to any thread that subsequently loads `tail` with `Acquire`
Consumer loads `tail`	`Acquire`	Synchronizes with producer's `Release` — slot data is now safe to read
Consumer reads slot data	`ptr::read_volatile`	Prevent compiler from caching stale slot reads across the tail check
Consumer increments `head`	`Release`	All preceding reads complete — producer can now see this slot as reclaimable
Producer loads `head`	`Acquire`	Synchronizes with consumer's `Release` — producer's `head` view is up to date

The critical happens-before chain is:

Producer writes slot N data
    │
    │  (sequenced-before, same thread)
    ▼
Producer does atomic::fence(Release)  ← pairs with →  Consumer does tail.load(Acquire)
    │                                                            │
    │  Producer tail.store(N+1, Release)                       │ (synchronizes-with)
    ▼                                                            ▼
[Producer completes write]                          Consumer reads slot N data safely
                                                                │
                                                    Consumer head.store(N+1, Release)
                                                                │  ← pairs with →
                                                    Producer head.load(Acquire)
                                                                ▼
                                                    Producer sees slot N as reclaimable

4.3 Producer Write Path

// helix-core/src/ring/producer.rs

use std::sync::atomic::Ordering::{Acquire, Release};
use std::sync::atomic::fence;
use std::ptr;

impl<T: Copy + 'static> RingProducer<T> {
    /// Write a single item into the ring buffer.
    ///
    /// Returns `Err(RingError::Full)` if no slot is available.
    /// The caller is responsible for backpressure handling (drop, spin, or yield).
    ///
    /// # Timing
    /// This function is designed to execute in <200ns in the common case on a
    /// warmed cache. It must not be called from interrupt context or SCHED_FIFO
    /// threads without careful analysis of the backpressure path.
    #[inline]
    pub fn write(&mut self, item: &T) -> Result<(), RingError> {
        let header = unsafe { &*self.ring.header };

        // Step 1: Load current tail. This is our private copy — only we write tail,
        // so we can use a relaxed load here (no other thread writes tail).
        let current_tail = header.tail.load(Ordering::Relaxed);

        // Step 2: Check if there is space. The ring is full when tail - head == capacity.
        // We refresh our cached head periodically. On a fully warmed SPSC loop,
        // the head advances continuously and this check almost always passes.
        let current_head = header.head.load(Acquire); // Acquire: see head/tail protocol
        let used = current_tail.wrapping_sub(current_head);
        if used >= header.capacity as u64 {
            header.dropped_frame_count.fetch_add(1, Ordering::Relaxed);
            return Err(RingError::Full);
        }

        // Step 3: Write data to the slot. We write directly into the mmap'd region.
        // `ptr::write_volatile` prevents the compiler from reordering or eliding the write.
        let slot = unsafe { self.ring.slot_ptr(current_tail, self.slot_size) as *mut T };
        unsafe { ptr::write_volatile(slot, *item) };

        // Step 4: Memory fence. Ensure ALL bytes of the slot are written to coherent
        // memory before we advance tail. Without this fence, a weakly-ordered CPU
        // (ARM64) could allow the tail store to become visible to another core before
        // the slot data stores, causing the consumer to read a partially-initialized slot.
        fence(Release);

        // Step 5: Advance tail with Release ordering. This is the publication point.
        // After this store, consumers with a subsequent Acquire load of tail will
        // see the complete slot data.
        header.tail.store(current_tail + 1, Release);

        Ok(())
    }
}

4.4 Consumer Read Path

// helix-core/src/ring/consumer.rs

use std::sync::atomic::Ordering::{Acquire, Release, Relaxed};
use std::ptr;

impl<T: Copy + 'static> RingConsumer<T> {
    /// Read a single item from the ring buffer.
    ///
    /// Returns `None` if the ring is empty.
    ///
    /// Before returning the item, this function sets the caller's hazard pointer
    /// to the slot being read. See §6 for the full hazard pointer protocol.
    #[inline]
    pub fn read(&mut self) -> Option<T> {
        let header = unsafe { &*self.ring.header };

        // Step 1: Load our current head position.
        let current_head = header.head.load(Relaxed); // Only we write head.

        // Step 2: Check if data is available.
        // The ring is empty when head == tail.
        // Acquire ordering here synchronizes with the producer's Release store to tail,
        // guaranteeing that the slot data written before tail was incremented is visible.
        let current_tail = header.tail.load(Acquire);
        if current_head == current_tail {
            return None; // Ring is empty.
        }

        // Step 3: Publish hazard pointer BEFORE reading the slot.
        // This must happen before the read — if we set it after, the producer
        // could recycle the slot in the window between our tail check and our read.
        // See §6.2 for the full memory ordering argument.
        let hp_slot = current_head & header.index_mask as u64;
        header.hazard_pointers[self.consumer_id]
            .store(hp_slot, Ordering::SeqCst); // SeqCst: see §6.1 for rationale

        // Step 4: Re-check tail. The producer might have filled and recycled
        // this slot between our tail load in Step 2 and our hazard pointer
        // publication in Step 3. This is a defense-in-depth check.
        // If the ring has wrapped, we retry.
        let tail_recheck = header.tail.load(Acquire);
        let used = tail_recheck.wrapping_sub(current_head);
        if used >= header.capacity as u64 {
            // The slot we want has been overwritten. This indicates a ring-full
            // condition with a concurrent writer — should not happen in SPSC,
            // but is a safety net for debugging.
            header.hazard_pointers[self.consumer_id]
                .store(u64::MAX, Ordering::SeqCst);
            return None;
        }

        // Step 5: Read the slot data. volatile prevents stale-cache reads.
        let slot = unsafe { self.ring.slot_ptr(current_head, self.slot_size) as *const T };
        let item = unsafe { ptr::read_volatile(slot) };

        // Step 6: Clear hazard pointer — we are done reading this slot.
        header.hazard_pointers[self.consumer_id]
            .store(u64::MAX, Ordering::SeqCst);

        // Step 7: Advance head with Release. The producer can now see this slot as free.
        header.head.store(current_head + 1, Release);

        Some(item)
    }
}

5. GPU Interop — Zero-Copy Buffer Views

5.1 wgpu External Memory Import

The goal is to give the GPU render pipeline and compute pipeline read-only access to the ring buffer's data region without copying bytes from CPU memory to a wgpu-managed GPU buffer. The mechanism is platform-specific but the abstraction layer in helix-gpu presents a uniform interface.

// helix-gpu/src/external_buffer.rs

/// A GPU buffer view backed by a HELIX ring buffer data region.
/// No memory is copied. The GPU reads directly from the mmap'd pages.
pub struct ExternalRingView {
    /// Platform-specific buffer handle
    pub inner: PlatformExternalBuffer,
    /// The mmap region this view references (keeps it alive)
    pub _region_ref: Arc<MappedRing>,
    /// Byte offset into the data region (start of the visible window)
    pub byte_offset: u64,
    /// Number of bytes visible to the GPU (must be a multiple of the slot size)
    pub byte_len: u64,
}

pub enum PlatformExternalBuffer {
    Vulkan(VulkanExternalBuffer),
    Metal(MetalExternalBuffer),
}

5.2 Vulkan Path

On Vulkan (Linux, Windows, non-Apple), the data region pages are imported as a VkBuffer backed by VkDeviceMemory created from a POSIX file descriptor (via VK_EXT_external_memory_host or VK_KHR_external_memory_fd).

// helix-gpu/src/platform/vulkan.rs

use ash::vk;

pub unsafe fn import_ring_as_vk_buffer(
    device: &ash::Device,
    ext_host_mem: &ash::extensions::ext::ExternalMemoryHost,
    data_ptr: *mut u8,
    data_size: usize,
    physical_device_props: &vk::PhysicalDeviceExternalMemoryHostPropertiesEXT,
) -> Result<VulkanExternalBuffer, GpuError> {

    // Vulkan requires the host pointer to be aligned to minImportedHostPointerAlignment.
    // For our purposes this is always satisfied because our data region starts at a
    // page boundary (offset 4096 in the mmap region) and page size >= minImportedHostPointerAlignment
    // on all supported platforms.
    let alignment = physical_device_props.min_imported_host_pointer_alignment;
    assert!(
        data_ptr as u64 % alignment == 0,
        "Ring data region is not aligned to VkPhysicalDeviceExternalMemoryHostPropertiesEXT \
         minImportedHostPointerAlignment ({}). This is a bug in MappedRing::allocate.",
        alignment
    );

    // Step 1: Determine memory type index for host-visible, host-coherent memory.
    // For external host memory import, we use VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT.
    let import_info = vk::ImportMemoryHostPointerInfoEXT::builder()
        .handle_type(vk::ExternalMemoryHandleTypeFlags::HOST_ALLOCATION_EXT)
        .host_pointer(data_ptr as *mut std::ffi::c_void)
        .build();

    let alloc_info = vk::MemoryAllocateInfo::builder()
        .allocation_size(data_size as u64)
        .memory_type_index(find_host_coherent_memory_type(device)?)
        .push_next(&mut import_info.clone())
        .build();

    // Step 2: Import the host memory as VkDeviceMemory.
    // This does NOT copy data. The GPU's address translation points directly to
    // the mmap'd pages in host memory.
    let device_memory = device.allocate_memory(&alloc_info, None)
        .map_err(GpuError::VkAllocate)?;

    // Step 3: Create a VkBuffer view over this memory.
    // The buffer is HOST_VISIBLE and DEVICE_LOCAL is NOT required — we are explicitly
    // using host memory. Performance implications: GPU reads incur PCIe round-trips
    // unless the data is in a cache-coherent aperture (AMD GPUs, Apple M-series unified memory).
    // For discrete NVIDIA GPUs, pre-copying hot data into DEVICE_LOCAL memory is recommended
    // for render-critical paths; raw sensor telemetry is acceptable over PCIe.
    let buffer_info = vk::BufferCreateInfo::builder()
        .size(data_size as u64)
        .usage(
            vk::BufferUsageFlags::STORAGE_BUFFER  // for compute shaders (MSM accumulation)
            | vk::BufferUsageFlags::UNIFORM_BUFFER // for render pipeline vertex fetch
        )
        .sharing_mode(vk::SharingMode::EXCLUSIVE)
        .build();

    let buffer = device.create_buffer(&buffer_info, None)?;
    device.bind_buffer_memory(buffer, device_memory, 0)?;

    Ok(VulkanExternalBuffer { buffer, device_memory })
}

5.3 Metal Path

On macOS/iOS (Apple Silicon and Intel Mac), Metal provides makeBuffer(bytesNoCopy:...) which creates an MTLBuffer backed by caller-supplied memory. The memory is not copied. This API was designed precisely for this use case.

// helix-gpu/src/platform/metal.rs

use metal::{Buffer as MTLBuffer, Device as MTLDevice, MTLResourceOptions};
use objc::runtime::Object;

pub unsafe fn import_ring_as_metal_buffer(
    device: &MTLDevice,
    data_ptr: *mut u8,
    data_size: usize,
) -> Result<MetalExternalBuffer, GpuError> {

    // MTLDevice::newBufferWithBytesNoCopy:length:options:deallocator:
    // - Does NOT copy. The MTLBuffer is a view over the provided pointer.
    // - The pointer must be page-aligned. ✓ (guaranteed by our allocator)
    // - The length must be a multiple of the page size. We round up to page size.
    let page_size = 4096usize;
    let aligned_size = (data_size + page_size - 1) & !(page_size - 1);

    // MTLResourceStorageModeShared: buffer is accessible by both CPU and GPU.
    // On Apple Silicon, this is the unified memory architecture — no PCIe transfer,
    // truly zero-copy in the physical sense.
    let buffer: MTLBuffer = device.new_buffer_with_bytes_no_copy(
        data_ptr as *const std::ffi::c_void,
        aligned_size as u64,
        MTLResourceOptions::StorageModeShared,
        None, // No deallocator — we manage the memory via MappedRing's Drop impl
    );

    if buffer.as_ptr().is_null() {
        return Err(GpuError::MetalBufferImportFailed);
    }

    Ok(MetalExternalBuffer { buffer })
}

Apple Silicon Note. On Apple M-series chips, Metal's StorageModeShared combined with newBufferWithBytesNoCopy means the GPU and CPU are literally reading from the same physical DRAM cells. There is no copy, no DMA transfer, and no cache coherency overhead (the unified memory subsystem handles coherency in hardware). The HELIX OS benchmark on M3 Ultra shows GPU read latency of ~120ns for data written by the HAL in the previous millisecond — effectively pipeline-limited, not memory-bandwidth-limited.

5.4 Synchronization Primitives

The GPU cannot participate in the Rust atomic ordering model. The GPU does not know about head and tail pointers. The solution is a double-buffered read window for the render pipeline:

The CPU-side render coordinator determines which slots have valid data (by reading tail - head).
It records a "snapshot tail" value — the index up to which it will read in this frame.
It submits a GPU command buffer that accesses only slots in [head, snapshot_tail).
After the GPU command buffer signals completion (via VkFence / MTLCommandBuffer.waitUntilCompleted()), the CPU render coordinator advances head to snapshot_tail.

This means the GPU never touches a slot that might be concurrently written by the producer. The CPU acts as the arbiter between the GPU's asynchronous execution model and the ring buffer's atomic coordination protocol.

// helix-gpu/src/render_coordinator.rs

impl RenderCoordinator {
    /// Called once per render frame. Returns a GPU buffer descriptor for the
    /// valid telemetry window, suitable for binding to the render pipeline.
    pub fn acquire_frame_window(&mut self) -> Option<GpuBufferWindow> {
        let header = unsafe { &*self.ring.header };

        let head = header.head.load(Acquire);
        let tail = header.tail.load(Acquire);

        if head == tail {
            return None; // No new data this frame.
        }

        // Clamp the window to a maximum of 1024 slots per frame to bound GPU work.
        let snapshot_tail = tail.min(head + 1024);
        let slot_count = (snapshot_tail - head) as u32;

        let byte_offset = (head & header.index_mask as u64) as u64 * self.slot_size as u64;
        let byte_len = slot_count as u64 * self.slot_size as u64;

        self.pending_snapshot_tail = snapshot_tail;

        Some(GpuBufferWindow {
            buffer_handle: self.gpu_buffer.handle(),
            byte_offset,
            byte_len,
            slot_count,
        })
    }

    /// Called after the GPU signals frame completion.
    pub fn release_frame_window(&mut self) {
        let header = unsafe { &*self.ring.header };
        // Safe to advance head — GPU is done with these slots.
        header.head.store(self.pending_snapshot_tail, Release);
    }
}

6. Page Lifecycle & Safety Protocol

6.1 The Hazard Pointer Mechanism

The fundamental tension in the zero-copy architecture is this: the HAL producer wants to recycle ring buffer slots as fast as possible (to avoid dropping frames). The Audit Ledger Writer wants to read from slots that may be near the tail — i.e., recently written slots that the producer has not yet recycled, but which are "almost recyclable." If the producer recycles a slot while the ledger writer is reading it, the ledger writer reads memory that is being concurrently overwritten. This is a data race.

The hazard pointer mechanism prevents this without locking.

Each consumer registers a hazard pointer — a single AtomicU64 in the RingBufferHeader that stores the slot index currently being actively read by that consumer. u64::MAX means "no slot is being read" (safe to recycle all slots).

The producer checks all hazard pointers before determining which slots are safe to recycle.

The safety invariant:

A slot S may only be recycled (overwritten) by the producer if tail - head >= capacity (the ring is full) AND for all consumers C, hazard_pointer[C] != (S & index_mask).

In practice, the SPSC ring buffer doesn't recycle slots explicitly — it overwrites old slots when the ring is full. The hazard pointer mechanism converts the condition "ring is full, may I overwrite slot S?" into "ring is full, is any consumer currently reading slot S?"

Why SeqCst for hazard pointer operations: The correctness of the hazard pointer protocol requires a total order between the consumer's hazard pointer store and the producer's hazard pointer load. Acquire/Release pairs are insufficient because the producer and consumer are not paired — there is no single synchronization edge between them on the hazard pointer path. SeqCst provides the required total store order (TSO) guarantee. The performance cost of SeqCst on x86-64 is zero (x86 is already TSO). On ARM64, it costs one STLR + LDAR pair — measurable but acceptable at the ring's throughput rates.

6.2 Ledger Writer Read-Side Protocol

The Audit Ledger Writer's read path:

// helix-ledger/src/reader.rs

impl LedgerWriter {
    /// Read and persist a telemetry frame from the ring buffer.
    ///
    /// This function implements the hazard pointer read-side protocol.
    /// It must NEVER be called concurrently for the same `consumer_id`.
    pub fn read_and_persist(&self, consumer_id: usize) -> Result<(), LedgerError> {
        let header = unsafe { &*self.ring.header };
        let head = header.head.load(Acquire);
        let tail = header.tail.load(Acquire);

        if head == tail {
            return Ok(()); // Nothing to read.
        }

        let slot_index = head & header.index_mask as u64;

        // STEP 1: Publish hazard pointer BEFORE reading the slot.
        // Order matters: if we publish after reading, the producer could recycle
        // the slot between our read and our hazard pointer publication.
        // With SeqCst, the hazard pointer store is globally visible before any
        // subsequent read by this thread, and the producer's subsequent hazard
        // pointer load with SeqCst sees our store.
        header.hazard_pointers[consumer_id].store(slot_index, Ordering::SeqCst);

        // STEP 2: Memory barrier — ensure the hazard pointer store is complete
        // before we load the slot data.
        std::sync::atomic::fence(Ordering::SeqCst);

        // STEP 3: Re-validate. Between step 1's tail load and our hazard pointer
        // publication, the producer could have advanced tail by a full ring's worth
        // and now be trying to overwrite our slot. Re-checking defends against this.
        let tail_recheck = header.tail.load(Acquire);
        let distance = tail_recheck.wrapping_sub(head);
        if distance >= header.capacity as u64 {
            // The slot has been overwritten. Clear hazard and signal a drop event.
            header.hazard_pointers[consumer_id].store(u64::MAX, Ordering::SeqCst);
            return Err(LedgerError::SlotRecycledUnderRead {
                head,
                tail: tail_recheck,
                capacity: header.capacity,
            });
        }

        // STEP 4: Read the slot. We have a valid hazard pointer; the producer cannot
        // recycle this slot while our hazard pointer is set.
        let frame = unsafe {
            let slot_ptr = self.ring.slot_ptr(head, self.slot_size) as *const TelemetryFrame;
            ptr::read_volatile(slot_ptr)
        };

        // STEP 5: Validate CRC before persisting. A CRC mismatch indicates either
        // a bug in the HAL driver or a genuine memory corruption event.
        // Either way, the ledger must record the anomaly — not silently drop it.
        let computed_crc = crc32c::crc32c_masked(
            unsafe { std::slice::from_raw_parts(&frame as *const _ as *const u8, 56) }
        );
        if computed_crc != frame.crc32c {
            header.hazard_pointers[consumer_id].store(u64::MAX, Ordering::SeqCst);
            return Err(LedgerError::CrcMismatch {
                expected: frame.crc32c,
                computed: computed_crc,
                slot: slot_index,
            });
        }

        // STEP 6: Write to ledger. O_DIRECT, page-aligned, synchronous.
        // The ledger writer uses a separate aligned buffer for O_DIRECT writes —
        // it does NOT write directly from the ring buffer page, because O_DIRECT
        // requires 512-byte sector alignment and our slots may not be sector-aligned.
        // This is the ONE copy in the entire pipeline — from ring slot to O_DIRECT buffer.
        // It is unavoidable for O_DIRECT compliance and is bounded to a single memcpy of
        // sizeof(TelemetryFrame) = 128 bytes.
        self.ledger_write_buffer.write_frame(&frame)?;

        // STEP 7: Clear hazard pointer. We are done with this slot.
        header.hazard_pointers[consumer_id].store(u64::MAX, Ordering::SeqCst);

        Ok(())
    }
}

6.3 HAL Recycle-Side Protocol

When the ring is full, the producer must decide whether to drop the incoming frame or wait. HELIX OS never blocks in the HAL driver — a blocked HAL thread means a missed DMA interrupt, which means a dropped instrument sample, which is worse than a dropped ledger entry. The policy is:

Check if the ring is full.
If full, scan hazard pointers.
For each hazard pointer that is set (consumer currently reading), skip recycling that slot.
If ALL hazard pointers are u64::MAX, the ring is logically empty from a safety perspective — the consumers have drained it and the ring being "full" means the consumer is too slow. Increment dropped_frame_count and overwrite the oldest slot.
If ANY hazard pointer is set AND the ring is full, stall for up to MAX_RECYCLE_STALL_NS (default: 10µs). If the hazard pointer has not cleared within the stall window, emit a RING_STALL alarm and drop the incoming frame.

// helix-hal/src/ring_producer.rs

const MAX_RECYCLE_STALL_NS: u64 = 10_000; // 10 microseconds

impl HalRingProducer {
    fn check_recycle_safety(&self, candidate_slot: u64) -> bool {
        let header = unsafe { &*self.ring.header };
        let slot_index = candidate_slot & header.index_mask as u64;

        // Scan all registered consumer hazard pointers.
        for consumer_id in 0..header.consumer_count as usize {
            let hp = header.hazard_pointers[consumer_id].load(Ordering::SeqCst);
            if hp == slot_index {
                return false; // Consumer is reading this slot. Do not recycle.
            }
        }
        true // All consumers have cleared. Safe to recycle.
    }

    pub fn write_or_drop(&mut self, frame: &TelemetryFrame) {
        let header = unsafe { &*self.ring.header };
        let tail = header.tail.load(Relaxed);
        let head = header.head.load(Acquire);

        if tail.wrapping_sub(head) < header.capacity as u64 {
            // Space available — normal write path.
            self.write(frame);
            return;
        }

        // Ring is full. Check if the oldest slot (at `head`) is safe to recycle.
        let stall_start = crate::time::monotonic_ns();
        loop {
            if self.check_recycle_safety(head) {
                // Force-advance head (overwrite oldest). This is a controlled data loss
                // event — the ledger writer's `dropped_frame_count` monotonic counter
                // will record it.
                header.dropped_frame_count.fetch_add(1, Relaxed);
                // Producer claims the slot and advances head by writing over it.
                // The next consumer read will see the incremented dropped_frame_count
                // and raise an alarm.
                self.write_to_slot(tail, frame);
                header.tail.store(tail + 1, Release);
                return;
            }

            let elapsed = crate::time::monotonic_ns() - stall_start;
            if elapsed > MAX_RECYCLE_STALL_NS {
                // Consumer is holding the hazard pointer too long. This is a bug
                // in the consumer, not the producer. Emit alarm and drop frame.
                self.alarm_bus.emit(AlarmEvent::RingStall {
                    ring_id: self.ring_id,
                    stall_ns: elapsed,
                    blocking_consumer: self.find_blocking_consumer(),
                });
                header.dropped_frame_count.fetch_add(1, Relaxed);
                return; // Drop the incoming frame.
            }

            std::hint::spin_loop();
        }
    }
}

6.4 Safety Invariants

The following invariants are maintained by the protocol above and verified by the helix-core/tests/ring_safety.rs test suite under both cargo test and cargo loom (Loom's exhaustive concurrent execution model explorer):

ID	Invariant	Enforcement Mechanism
INV-01	A slot is never overwritten while any consumer's hazard pointer equals that slot's index.	HAL recycle-side scan + stall
INV-02	`tail` is never stored with Release ordering until all slot bytes are fully written.	`fence(Release)` before `tail.store`
INV-03	A consumer never reads slot data before publishing its hazard pointer.	Hazard pointer store before slot read
INV-04	A consumer never publishes a hazard pointer after the producer has already recycled the target slot.	Re-check of `tail` after hazard pointer publication
INV-05	The ledger writer's single `memcpy` (for O_DIRECT) occurs while its hazard pointer is set.	Hazard pointer cleared only after `ledger_write_buffer.write_frame()` returns
INV-06	`dropped_frame_count` is monotonically increasing. Never decremented, never reset.	`fetch_add(1, Relaxed)` only
INV-07	CRC-32C is validated before ledger persistence. Invalid frames are logged as anomalies, not silently dropped.	`LedgerError::CrcMismatch` variant

7. Error Conditions & Recovery

Error	Cause	Recovery Action
`RingError::Full`	Consumer is too slow; producer is outrunning it	Increment `dropped_frame_count`; emit `RING_BACKPRESSURE` alarm; HAL drops frame
`LedgerError::SlotRecycledUnderRead`	Consumer read-side was too slow between tail check and hazard pointer publication	Log anomaly to ledger with `FRAME_LOST` marker; increment gap counter; continue
`LedgerError::CrcMismatch`	Memory corruption in ring slot or HAL bug	Log anomaly; raise `INTEGRITY_VIOLATION` alarm; trigger diagnostic snapshot
`GpuError::VkAllocate`	Vulkan driver refused external memory import	Fall back to a standard `VkBuffer` with a memcpy stage; raise `GPU_EXTERNAL_MEM_UNAVAILABLE`
`GpuError::MetalBufferImportFailed`	Metal refused `newBufferWithBytesNoCopy` (alignment or size violation)	Check page alignment; raise `GPU_BUFFER_IMPORT_FAILED`
`AlarmEvent::RingStall`	A consumer held a hazard pointer for >10µs	Log consumer thread stack trace; raise `CONSUMER_STALL` alarm; DO NOT drop the frame under normal ops
`RingError::CorruptMagic`	`RingBufferHeader.magic != HELIX_RING_MAGIC`	Ring is corrupted. Shut down the affected HAL channel. Do not continue writing. Alert immediately.

8. Benchmarks & Validation Targets

All targets must be met in the helix bench --suite zero-copy suite before a release is tagged. These are hard gates, not aspirational targets.

Metric	Validation Condition	Target	Gate
Single write latency (SPSC, warm)	`cargo bench --bench ring_write`, 99th percentile	< 80 ns	FAIL if > 120 ns
Single read latency (SPSC, warm)	`cargo bench --bench ring_read`, 99th percentile	< 80 ns	FAIL if > 120 ns
Sustained throughput (TelemetryFrame)	10M frames, measured	> 12M frames/sec	FAIL if < 10M/sec
Hazard pointer check overhead	Incremental vs. no-hazard-pointer baseline	< 15 ns/check	FAIL if > 25 ns
Vulkan external memory import latency	First bind, warm driver	< 500 µs	FAIL if > 1 ms
Metal `newBufferWithBytesNoCopy` latency	First bind, warm driver	< 100 µs	FAIL if > 250 µs
GPU read latency (Apple Silicon M3 Ultra)	Measured end-to-end, HAL write to GPU shader visible	< 500 ns	Informational
Ledger O_DIRECT write latency (NVMe)	99th percentile, `sizeof(TelemetryFrame)` payload	< 50 µs	FAIL if > 100 µs
False sharing regression (tail/head)	Benchmark with `perf stat -e cache-misses` vs. baseline	0 excess L1 evictions	FAIL if regresses

9. Appendix A — Full Struct Reference

Quick-reference table for struct sizes and key field offsets.

Struct	Size (bytes)	Alignment	Key Fields
`TelemetryFrame`	128	64	`value` @ 32, `crc32c` @ 56, `dt_predicted` @ 72
`SimResult`	512	64	`collective_variables` @ 128, `reduced_gradient` @ 256
`RingBufferHeader`	256	64	`tail` @ 64, `head` @ 128, `hazard_pointers` @ 192

10. Appendix B — Memory Map Diagram

Virtual Address Space of a HELIX OS Process (simplified)

0x0000_0000_0000_0000
│
│  [... kernel reserved ...]
│
├── 0x7f00_0000_0000  ← mmap base (kernel-chosen)
│
│   ┌──────────────────────────────────────────────────────────────────┐
│   │  Telemetry Ring Buffer (helix://rings/telemetry/BRX-07-ch0)     │
│   │                                                                  │
│   │  Byte 0:    RingBufferHeader (256 bytes)                        │
│   │    offset 64:  tail  (AtomicU64, PRODUCER cache line)           │
│   │    offset 128: head  (AtomicU64, CONSUMER cache line)           │
│   │    offset 192: hazard_pointers[0..8]                            │
│   │                                                                  │
│   │  Byte 256:  [header padding — 3840 bytes]                       │
│   │                                                                  │
│   │  Byte 4096: Data Region (page-aligned)                          │
│   │    Slot 0:  TelemetryFrame [128 bytes]  ← GPU reads here        │
│   │    Slot 1:  TelemetryFrame [128 bytes]  ← Ledger reads here     │
│   │    Slot 2:  TelemetryFrame [128 bytes]  ← DT solver reads here  │
│   │    ...                                                           │
│   │    Slot N-1: [128 bytes]                                         │
│   │                                                                  │
│   │  TOTAL: 4096 + (capacity × 128) bytes                           │
│   │                                                                  │
│   │  ┌─────────────────────────────────────────────────────────┐    │
│   │  │  VkBuffer / MTLBuffer imported over Data Region         │    │
│   │  │  (same physical pages — no copy)                        │    │
│   │  └─────────────────────────────────────────────────────────┘    │
│   └──────────────────────────────────────────────────────────────────┘
│
│   ┌──────────────────────────────────────────────────────────────────┐
│   │  Simulation Result Ring Buffer (helix://rings/sim-results)      │
│   │  (same layout, slot_size = 512 bytes for SimResult)             │
│   └──────────────────────────────────────────────────────────────────┘
│
│   [... other mmap regions: ledger, config, inter-process shared mem ...]
│
0x7fff_ffff_ffff  ← stack top

End of HELIX-ARCH-0002. All implementations are measured against this document. If the implementation and the document disagree, the document is authoritative and the implementation contains a bug.

Document revision history is maintained in git log --follow docs/arch/HELIX-ARCH-0002.md.

Technical Specification: Binary Interface Between Rust Core, GPU, and Disk​

Table of Contents​

1. Architectural Context​

2. Alignment & Cache-Line Requirements​

2.1 The False Sharing Problem​

2.2 The 64-Byte Cache-Line Rule​

2.3 Rust Enforcement​

3. Core Data Structures​

3.1 TelemetryFrame​

GPU Shader Binding (WGSL)​

3.2 SimResult​

3.3 RingBufferHeader​

4. Lock-Free SPSC Ring Buffer​

4.1 mmap Layout​

4.2 Atomic Head/Tail Protocol​

4.3 Producer Write Path​

4.4 Consumer Read Path​

5. GPU Interop — Zero-Copy Buffer Views​

5.1 wgpu External Memory Import​

5.2 Vulkan Path​

5.3 Metal Path​

5.4 Synchronization Primitives​

6. Page Lifecycle & Safety Protocol​

6.1 The Hazard Pointer Mechanism​

6.2 Ledger Writer Read-Side Protocol​

6.3 HAL Recycle-Side Protocol​

6.4 Safety Invariants​

7. Error Conditions & Recovery​

8. Benchmarks & Validation Targets​

9. Appendix A — Full Struct Reference​

10. Appendix B — Memory Map Diagram​