Technical Deep Dive
Lock-free data structures, risk management, and memory pool design
Lock-Free SPSC Ring Buffer
The core communication primitive is a Single-Producer Single-Consumer ring buffer. It provides wait-free push/pop operations using only acquire/release memory ordering — never the heavier seq_cst.
template<typename T, size_t Capacity>
class LockFreeRingBuffer {
static_assert(is_power_of_two(Capacity), "Capacity must be a power of 2");
static_assert(std::is_trivially_copyable_v<T>, "T must be trivially copyable");
public:
LockFreeRingBuffer()
: buffer_(new T[Capacity]{}) {}
/// Producer: attempt to push an item. Returns false if full.
bool try_push(const T& item) noexcept {
const size_t current_tail = tail_.load(std::memory_order_relaxed);
const size_t next_tail = (current_tail + 1) & MASK;
if (next_tail == head_.load(std::memory_order_acquire)) {
return false; // Full
}
buffer_[current_tail] = item;
tail_.store(next_tail, std::memory_order_release);
return true;
}
/// Consumer: attempt to pop an item. Returns false if empty.
bool try_pop(T& item) noexcept {
const size_t current_head = head_.load(std::memory_order_relaxed);
if (current_head == tail_.load(std::memory_order_acquire)) {
return false; // Empty
}
item = buffer_[current_head];
head_.store((current_head + 1) & MASK, std::memory_order_release);
return true;
}
private:
static constexpr size_t MASK = Capacity - 1;
// Separate cache lines to prevent false sharing
alignas(64) std::atomic<size_t> head_{0};
alignas(64) std::atomic<size_t> tail_{0};
// Heap-allocated buffer (single allocation at construction)
std::unique_ptr<T[]> buffer_;
};
Key Design Points
- Power-of-2 capacity: Enables bitwise masking (
index & MASK) instead of modulo for index wrapping — a single AND instruction vs. expensive division - Relaxed loads for own index: The producer only writes
tail_and the consumer only writeshead_. Each thread loads its own index withmemory_order_relaxedsince there’s no contention - Acquire/release pairing: The producer does a
releasestore ontail_after writing data, guaranteeing the consumer sees the data when it does anacquireload ontail_ - Cache-line separation:
alignas(64)onhead_andtail_ensures they’re on different cache lines, preventing false sharing between producer and consumer cores - Heap-allocated buffer: Uses
std::unique_ptr<T[]>instead ofstd::array. A 65536-element array would be ~5MB on the stack, causing stack overflow
Risk Manager: 21ns Pre-Trade Checks
The risk manager runs 6 sequential checks on every order, completing in approximately 21 nanoseconds. The key techniques:
class RiskManager {
public:
/// Check an order against all risk limits. O(1), <100ns target.
__attribute__((hot))
RiskCheckResult check_order(
const OrderRequest& request,
Price current_market_price
) noexcept;
private:
alignas(64) std::atomic<bool> kill_switch_{false};
// Precomputed reciprocal to avoid division on hot path
double price_deviation_threshold_;
// Flat array position tracking (O(1) by instrument ID)
PositionTracker positions_;
};
The Six Checks (~21ns total)
| # | Check | Implementation | Cost |
|---|---|---|---|
| 1 | Kill switch | Atomic load | ~1 ns |
| 2 | Order size | Single comparison | ~1 ns |
| 3 | Position limit | Array lookup + abs | ~3 ns |
| 4 | Capital limit | Multiplication | ~3 ns |
| 5 | Order rate | Counter check | ~2 ns |
| 6 | Fat finger | Multiply, not divide | ~3 ns |
Why It’s Fast
[[unlikely]]on all failure branches: The compiler lays out the approved path as straight-line code with no branches taken. Rejection paths are moved out-of-line- No virtual dispatch: Concrete class, no inheritance, direct function calls
- Flat array for positions:
PositionTrackeruses a fixed-size array indexed byInstrumentId— O(1) lookup, no hash computation - Multiplication over division: Fat finger check uses
abs(order_price - market_price) * reciprocal > market_priceinstead of dividing, since multiplication is ~3x faster than division on modern CPUs __attribute__((hot)): Hints to the compiler to optimize this function aggressively and place it in the hot section of the binary
Memory Pool: Zero-Allocation Order Management
Orders are allocated from a pre-allocated memory pool to avoid new/delete on the hot path:
template<typename T, size_t PoolSize>
class MemoryPool {
public:
MemoryPool() : storage_(std::make_unique<StorageType[]>(PoolSize)) {
// Initialize free list: each slot points to the next
for (uint32_t i = 0; i < PoolSize - 1; ++i) {
*reinterpret_cast<uint32_t*>(&storage_[i]) = i + 1;
}
*reinterpret_cast<uint32_t*>(&storage_[PoolSize - 1]) = INVALID;
}
/// O(1) allocate: pop from free list head
T* allocate() noexcept {
if (free_head_ == INVALID) return nullptr;
uint32_t idx = free_head_;
free_head_ = *reinterpret_cast<uint32_t*>(&storage_[idx]);
++allocated_count_;
return reinterpret_cast<T*>(&storage_[idx]);
}
/// O(1) deallocate: push to free list head
void deallocate(T* ptr) noexcept {
if (!ptr) return;
auto* raw = reinterpret_cast<StorageType*>(ptr);
size_t idx = static_cast<size_t>(raw - storage_.get());
*reinterpret_cast<uint32_t*>(&storage_[idx]) = free_head_;
free_head_ = static_cast<uint32_t>(idx);
--allocated_count_;
}
private:
static constexpr uint32_t INVALID = 0xFFFFFFFF;
std::unique_ptr<StorageType[]> storage_;
uint32_t free_head_ = 0;
size_t allocated_count_ = 0;
};
Design Insights
- Intrusive free list: The free list index is stored inside the unused memory slots themselves. When a slot is free, its first 4 bytes contain the index of the next free slot. This means the free list requires zero additional memory
- Single heap allocation:
std::make_unique<StorageType[]>(PoolSize)allocates all storage at construction. After that,allocate()anddeallocate()never touch the heap - No atomics: The pool is single-threaded by design (each component owns its pool), so there’s no atomic overhead
- Contiguous memory: All objects are in a contiguous array, maximizing cache locality during iteration
Strategy Engine: Pre-Allocated Signal Output
Strategies return their order signals via std::span over pre-allocated member arrays. This avoids allocating std::vector on every tick:
class StrategyInterface {
public:
/// Generate orders based on market data update.
/// Returns a span over internally-owned buffer (no allocation).
virtual std::span<const OrderRequest> on_market_data(
const MarketDataMessage& msg,
const OrderBook& book
) = 0;
};
Each strategy implementation has a std::array<OrderRequest, MAX_ORDERS> member and a count, returning a std::span view over the filled portion. The caller never allocates, and the span is valid until the next call.
FIX Parser: Zero-Copy Field Access
The FIX protocol parser achieves ~700ns per message using zero-copy techniques:
string_viewinto raw buffer: Field values are views into the original message bytes — no string copies- Flat array tag lookup: A
std::array<StringView, MAX_TAG>indexed by FIX tag number provides O(1) field access after parsing - Single-pass parsing: One scan through the message populates all tag-value pairs
Lessons Learned
std::array<T, 65536> in class templates caused stack overflow when objects were stack-allocated (~5MB per instance). Fixed by using std::unique_ptr<T[]> with heap allocation at construction. This single allocation at startup doesn't affect hot-path performance.memory_order_seq_cst everywhere "to be safe." Switching to the correct acquire/release pairs on the SPSC queue improved throughput measurably — seq_cst emits unnecessary fence instructions on x86.