Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Performance Optimization

Relevant source files

This page describes techniques and strategies for optimizing throughput, latency, and memory usage in rust-muxio applications. It covers serialization efficiency, chunking strategies, memory allocation patterns, and profiling approaches. For general architectural patterns, see Core Concepts. For transport-specific tuning, see Transport Implementations.


Binary Serialization Efficiency

The system uses bitcode for binary serialization of RPC method parameters and responses. This provides compact encoding with minimal overhead compared to text-based formats like JSON.

graph LR
 
   A["Application\nRust Types"] --> B["encode_request\nbitcode::encode"]
B --> C["Vec<u8>\nBinary Buffer"]
C --> D["RpcHeader\nrpc_metadata_bytes"]
D --> E["Frame Protocol\nLow-Level Transport"]
E --> F["decode_request\nbitcode::decode"]
F --> A

Serialization Strategy

Sources:

Optimization Guidelines

TechniqueImpactImplementation
Use #[derive(bitcode::Encode, bitcode::Decode)]Automatic optimal encodingApplied in service definitions
Avoid nested Option<Option<T>>Reduces byte overheadFlatten data structures
Prefer fixed-size types over variable-lengthPredictable buffer sizesUse [u8; N] instead of Vec<u8> when size is known
Use u32 instead of u64 when range allowsHalves integer encoding sizeRPC method IDs use u32

Sources:

  • Service definition patterns in example-muxio-rpc-service-definition
  • Cargo.lock:158-168 (bitcode dependencies)

Chunking Strategy and Throughput

The max_chunk_size parameter controls how large payloads are split into multiple frames. Optimal chunk size balances latency, memory usage, and transport efficiency.

graph TB
    subgraph "Small Chunks (e.g., 1KB)"
 
       A1["Lower Memory\nPer Request"] --> A2["More Frames"]
A2 --> A3["Higher CPU\nFraming Overhead"]
end
    
    subgraph "Large Chunks (e.g., 64KB)"
 
       B1["Higher Memory\nPer Request"] --> B2["Fewer Frames"]
B2 --> B3["Lower CPU\nFraming Overhead"]
end
    
    subgraph "Optimal Range"
 
       C1["8KB - 16KB"] --> C2["Balance of\nMemory & CPU"]
end

Chunk Size Selection

Performance Characteristics:

Chunk SizeLatencyMemoryCPUBest For
1-2 KBExcellentMinimalHigh overheadReal-time, WASM
4-8 KBVery GoodLowModerateStandard RPC
16-32 KBGoodModerateLowLarge payloads
64+ KBFairHighMinimalBulk transfers

Sources:


Prebuffering vs Streaming

The system supports two payload transmission modes with different performance trade-offs.

graph TB
    subgraph "Prebuffered Mode"
        PB1["RpcRequest\nis_finalized=true"]
PB2["Single write_bytes"]
PB3["Immediate end_stream"]
PB4["Low Latency\nHigh Memory"]
PB1 -->
 PB2 -->
 PB3 --> PB4
    end
    
    subgraph "Streaming Mode"
        ST1["RpcRequest\nis_finalized=false"]
ST2["Multiple write_bytes\ncalls"]
ST3["Delayed end_stream"]
ST4["Higher Latency\nLow Memory"]
ST1 -->
 ST2 -->
 ST3 --> ST4
    end

Mode Comparison

Prebuffered Response Handling

The prebuffer_response flag controls whether response payloads are accumulated before delivery:

ModeMemory UsageLatencyUse Case
prebuffer_response=trueAccumulates entire payloadDelivers complete responseSmall responses, simpler logic
prebuffer_response=falseStreams chunks as receivedMinimal per-chunk latencyLarge responses, progress tracking

Implementation Details:

  • Prebuffering accumulates chunks in prebuffered_responses HashMap
  • Buffer is stored until RpcStreamEvent::End is received
  • Handler is invoked once with complete payload
  • Buffer is immediately cleared after handler invocation

Sources:


Memory Management Patterns

graph LR
 
   A["Inbound Frames"] --> B["RpcDispatcher\nread_bytes"]
B --> C["Mutex Lock"]
C --> D["VecDeque\nrpc_request_queue"]
D --> E["Push/Update/Delete\nOperations"]
E --> F["Mutex Unlock"]
G["Application"] --> H["get_rpc_request"]
H --> C

Request Queue Design

The RpcDispatcher maintains an internal request queue using Arc<Mutex<VecDeque<(u32, RpcRequest)>>>. This design has specific performance implications:

Lock Contention Considerations:

OperationLock DurationFrequencyOptimization
read_bytesPer-frame decodeHighMinimize work under lock
get_rpc_requestRead access onlyMediumReturns guard, caller controls lock
delete_rpc_requestSingle element removalLowUses VecDeque::remove

Memory Overhead:

  • Each in-flight request: ~100-200 bytes base + payload size
  • VecDeque capacity grows as needed
  • Payload bytes accumulated until is_finalized=true

Sources:

Preventing Memory Leaks

The dispatcher must explicitly clean up completed or failed requests:

Critical Pattern:

  1. Request added to queue on Header event
  2. Payload accumulated on PayloadChunk events
  3. Finalized on End event
  4. Application must calldelete_rpc_request() to free memory

Failure to delete finalized requests causes unbounded memory growth.

Sources:


graph LR
 
   A["next_rpc_request_id\nu32 counter"] --> B["increment_u32_id()"]
B --> C["Assign to\nRpcHeader"]
C --> D["Store in\nresponse_handlers"]
D --> E["Match on\ninbound response"]

Request Correlation Overhead

Each outbound request is assigned a unique u32 ID for response correlation. The system uses monotonic ID generation with wraparound.

ID Generation Strategy

Performance Characteristics:

AspectCostJustification
ID generationMinimal (single addition)u32::wrapping_add(1)
HashMap insertionO(1) averageresponse_handlers.insert()
Response lookupO(1) averageresponse_handlers.get_mut()
Memory per handler~24 bytes + closure sizeBox<dyn FnMut> overhead

Concurrency Considerations:

  • next_rpc_request_id is NOT thread-safe
  • Each client connection should have its own RpcDispatcher
  • Sharing a dispatcher across threads requires external synchronization

Sources:


graph TB
 
   A["Connection\nClosed"] --> B["fail_all_pending_requests"]
B --> C["std::mem::take\nresponse_handlers"]
C --> D["For each handler"]
D --> E["Create\nRpcStreamEvent::Error"]
E --> F["Invoke handler\nwith error"]
F --> G["Drop handler\nboxed closure"]

Handler Cleanup and Backpressure

Failed Request Handling

When a transport connection drops, all pending response handlers must be notified to prevent resource leaks and hung futures:

Implementation:

The fail_all_pending_requests() method takes ownership of all handlers and invokes them with an error event. This ensures:

  • Awaiting futures are woken with error result
  • Callback memory is freed immediately
  • No handlers remain registered after connection failure

Performance Impact:

  • Invocation cost: O(n) where n = number of pending requests
  • Each handler invocation is synchronous
  • Memory freed immediately after iteration

Sources:


Benchmarking with Criterion

The codebase uses criterion for performance benchmarking. To run benchmarks:

Benchmark Structure

Key Metrics to Track:

MetricWhat It MeasuresTarget
ThroughputBytes/sec processedMaximize
LatencyTime per operationMinimize
Allocation RateHeap allocationsMinimize
Frame OverheadProtocol bytes vs payload< 5%

Sources:


Platform-Specific Optimizations

Native (Tokio) vs WASM

Platform Tuning:

PlatformChunk SizeBuffer StrategyConcurrency
Native (Tokio)8-16 KBReuse buffersMultiple connections
WASM (Browser)2-4 KBSmall allocationsSingle connection
Native (Server)16-32 KBPre-allocated poolsConnection pooling

WASM-Specific Considerations:

  • JavaScript boundary crossings have cost (~1-5 μs per call)
  • Minimize calls to wasm-bindgen functions
  • Use larger RPC payloads to amortize overhead
  • Prebuffer responses when possible to reduce event callbacks

Sources:

  • muxio-tokio-rpc-client vs muxio-wasm-rpc-client crate comparison
  • Cargo.lock:935-953 (WASM dependencies)

Profiling and Diagnostics

Tracing Integration

The system uses tracing for instrumentation. Enable logging to identify bottlenecks:

Key Trace Points:

LocationEventPerformance Insight
RpcDispatcher::callRequest initiationCall frequency, payload sizes
read_bytesFrame processingDecode latency, lock contention
Handler callbacksResponse processingHandler execution time

Sources:

Detecting Performance Issues

Tooling:

Sources:


Best Practices Summary

OptimizationTechniqueImpact
Minimize allocationsReuse buffers, use Vec::with_capacityHigh
Choose optimal chunk size8-16 KB for typical RPCMedium
Prebuffer small responsesEnable prebuffer_response < 64KBMedium
Clean up completed requestsCall delete_rpc_request() promptlyHigh
Use fixed-size typesPrefer [u8; N] over Vec<u8> in hot pathsLow
Profile before optimizingUse criterion + flamegraphCritical

Sources:

Dismiss

Refresh this wiki

Enter email to refresh