Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

GitHub

This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Performance Considerations

Loading…

Performance Considerations

Relevant source files

Purpose and Scope

This document covers performance characteristics, optimization strategies, and trade-offs in the muxio RPC framework. Topics include binary protocol efficiency, chunking strategies, payload size management, prebuffering versus streaming patterns, and memory management considerations.

For general architecture and design principles, see Design Philosophy. For detailed information about streaming RPC patterns, see Streaming RPC Calls. For cross-platform deployment strategies, see Cross-Platform Deployment.


Binary Protocol Efficiency

The muxio framework is designed for low-overhead communication through several architectural decisions:

Compact Binary Serialization

The framework uses bitcode for serialization instead of text-based formats like JSON. This provides:

  • Smaller payload sizes : Binary encoding reduces network transfer costs
  • Faster encoding/decoding : No string parsing or formatting overhead
  • Type safety : Compile-time verification of serialized structures
  • Zero schema overhead : No field names transmitted in messages

The serialization occurs at the RPC service definition layer, where RpcMethodPrebuffered::encode_request and RpcMethodPrebuffered::decode_response handle type conversion.

Schemaless Framing Protocol

The underlying framing protocol is schema-agnostic, meaning:

  • No metadata about message structure is transmitted
  • Frame headers contain only essential routing information (stream ID, flags)
  • Method identification uses 64-bit xxhash values computed at compile time
  • Response correlation uses numeric request IDs

This minimalist approach reduces per-message overhead while maintaining full type safety through shared service definitions.

Sources:


Chunking and Payload Size Management

DEFAULT_SERVICE_MAX_CHUNK_SIZE

The framework defines a constant chunk size used for splitting large payloads:

This value represents the maximum size of a single frame’s payload. Any data exceeding this size is automatically chunked by the RpcDispatcher and RpcSession layers.

Rationale for 64 KB chunks:

FactorConsideration
WebSocket compatibilityMany WebSocket implementations handle 64 KB frames efficiently
Memory footprintLimits per-stream buffer requirements
Latency vs throughputBalances sending small chunks quickly vs fewer total frames
TCP segment alignmentAligns reasonably with typical TCP maximum segment sizes

Smart Transport Strategy for Large Payloads

The framework implements an adaptive strategy for transmitting RPC arguments based on their encoded size:

Small Payload Path ( < 64 KB):

flowchart TB
    EncodeArgs["RpcCallPrebuffered::call\nEncode input arguments"]
CheckSize{"encoded_args.len() >=\nDEFAULT_SERVICE_MAX_CHUNK_SIZE?"}
SmallPath["Small payload path:\nSet rpc_param_bytes\nHeader contains full args"]
LargePath["Large payload path:\nSet rpc_prebuffered_payload_bytes\nStreamed after header"]
Dispatcher["RpcDispatcher::call\nCreate RpcRequest"]
Session["RpcSession::write_bytes\nChunk if needed"]
Transport["WebSocket transport"]
EncodeArgs --> CheckSize
 
   CheckSize -->|< 64 KB| SmallPath
 
   CheckSize -->|>= 64 KB| LargePath
 
   SmallPath --> Dispatcher
 
   LargePath --> Dispatcher
 
   Dispatcher --> Session
 
   Session --> Transport
    
    style CheckSize fill:#f9f9f9
    style SmallPath fill:#f0f0f0
    style LargePath fill:#f0f0f0

The encoded arguments fit in the rpc_param_bytes field of the RpcRequest structure. This field is transmitted as part of the initial request header frame, minimizing round-trips.

Large Payload Path ( >= 64 KB):

The encoded arguments are placed in rpc_prebuffered_payload_bytes. The RpcDispatcher automatically chunks this data into multiple frames, each with its own stream ID and sequence flags.

This prevents request header frames from exceeding transport limitations while ensuring arguments of any size can be transmitted.

Sources:


Prebuffering vs Streaming Trade-offs

The framework provides two distinct patterns for RPC calls, each with different performance characteristics:

Prebuffered RPC Pattern

Characteristics:

  • Entire request payload buffered in memory before transmission begins
  • Entire response payload buffered before processing begins
  • Uses RpcCallPrebuffered trait and call_rpc_buffered method
  • Set is_finalized: true on RpcRequest

Performance implications:

AspectImpact
Memory usageHigher - full payload in memory simultaneously
LatencyHigher initial latency - must encode entire payload first
ThroughputOptimal for small-to-medium payloads
SimplicitySimpler error handling - all-or-nothing semantics
BackpressureNone - sender controls pacing

Optimal use cases:

  • Small payloads (< 10 MB)
  • Computations requiring full dataset before processing
  • Simple request/response patterns
  • Operations where atomicity is important

Streaming RPC Pattern

Characteristics:

  • Incremental transmission using dynamic channels
  • Processing begins before entire payload arrives
  • Uses RpcMethodStreaming trait (bounded or unbounded channels)
  • Supports bidirectional streaming

Performance implications:

AspectImpact
Memory usageLower - processes data incrementally
LatencyLower initial latency - processing begins immediately
ThroughputBetter for large payloads
ComplexityRequires async channel management
BackpressureSupported via bounded channels

Optimal use cases:

  • Large payloads (> 10 MB)
  • Real-time streaming data
  • Long-running operations
  • File uploads/downloads
  • Bidirectional communication

Sources:


Memory Management and Buffering

Per-Stream Decoder Allocation

The RpcSession maintains a separate decoder instance for each active stream:

Memory characteristics:

  • Per-stream overhead : Each active stream allocates a decoder with internal buffer
  • Buffer growth : Buffers grow dynamically as chunks arrive
  • Cleanup timing : Decoders removed on End or Error events
  • Peak memory : (concurrent_streams × average_payload_size) + overhead

Example calculation for prebuffered calls:

Scenario: 10 concurrent RPC calls, each with 5 MB response
Peak memory ≈ 10 × 5 MB = 50 MB (excluding overhead)

Encoder Lifecycle

The RpcStreamEncoder is created per-request and manages outbound chunking:

  • Created when RpcDispatcher::call initiates a request
  • Holds reference to payload bytes during transmission
  • Automatically chunks data based on DEFAULT_SERVICE_MAX_CHUNK_SIZE
  • Dropped after final chunk transmitted

For prebuffered calls, the encoder is returned to the caller, allowing explicit lifecycle management:

Pending Request Tracking

The RpcDispatcher maintains a HashMap of pending requests:

Entry lifecycle:

  1. Inserted when call or call_rpc_buffered invoked
  2. Maintained until response received or timeout
  3. Removed on successful response, error, or explicit cleanup
  4. Each entry holds oneshot::Sender or callback for result delivery

Memory impact : Proportional to number of in-flight requests. Each entry contains minimal overhead (sender channel + metadata).

Sources:


graph LR
    subgraph "Async/Await Model"
        A1["Task spawn overhead"]
A2["Future state machine"]
A3["Runtime scheduler"]
A4["Context switching"]
end
    
    subgraph "muxio Callback Model"
        M1["Direct function calls"]
M2["No state machines"]
M3["No runtime dependency"]
M4["Deterministic execution"]
end
    
    A1 -.higher overhead.-> M1
    A2 -.higher overhead.-> M2
    A3 -.higher overhead.-> M3
    A4 -.higher overhead.-> M4

Non-Async Callback Model Performance

The framework’s non-async, callback-driven architecture provides specific performance characteristics:

Runtime Overhead Comparison

Performance advantages:

FactorBenefit
No async runtimeEliminates scheduler overhead
Direct callbacksNo future polling or waker mechanisms
Deterministic flowPredictable execution timing
WASM compatibleWorks in single-threaded browser contexts
Memory efficiencyNo per-task stack allocation

Performance limitations:

FactorImpact
Synchronous processingLong-running callbacks block progress
No implicit parallelismConcurrency must be managed explicitly
Callback complexityDeep callback chains increase stack usage

Read/Write Operation Flow

This synchronous model means:

  • Low latency : No context switching between read and callback invocation
  • Predictable timing : Callback invoked immediately when data complete
  • Stack-based execution : Entire chain executes on single thread/stack
  • No allocations : No heap allocation for task state

Sources:


Connection and Stream Multiplexing Efficiency

Stream ID Allocation Strategy

The RpcSession allocates stream IDs sequentially:

Efficiency characteristics:

  • O(1) allocation : No data structure lookup required
  • Collision-free : Client/server use separate number spaces
  • Reuse strategy : IDs wrap after exhaustion (u32 range)
  • No cleanup needed : Decoders removed, IDs naturally recycled
graph TB
    SingleConnection["Single WebSocket Connection"]
Multiplexer["RpcSession Multiplexer"]
subgraph "Interleaved Streams"
        S1["Stream 1\nLarge file upload\n1000 chunks"]
S2["Stream 3\nQuick query\n1 chunk"]
S3["Stream 5\nMedium response\n50 chunks"]
end
    
 
   SingleConnection --> Multiplexer
 
   Multiplexer --> S1
 
   Multiplexer --> S2
 
   Multiplexer --> S3
    
    Timeline["Frame sequence: [1,3,1,1,5,3,1,5,1,...]"]
Multiplexer -.-> Timeline
    
    Note1["Stream 3 completes quickly\ndespite Stream 1 still transmitting"]
S2 -.-> Note1

Concurrent Request Handling

The framework supports concurrent requests over a single connection through stream multiplexing:

Performance benefits:

  1. Head-of-line avoidance : Small requests don’t wait for large transfers
  2. Resource efficiency : Single connection handles all operations
  3. Lower latency : No connection establishment overhead per request
  4. Fairness : Chunks from different streams interleave naturally

Example throughput:

Scenario: 1 large transfer (100 MB) + 10 small queries (10 KB each)
Without multiplexing: Small queries wait ~seconds for large transfer
With multiplexing: Small queries complete in ~milliseconds

Sources:


Best Practices and Recommendations

Payload Size Guidelines

Payload SizeRecommended PatternRationale
< 64 KBPrebuffered, inline paramsSingle frame, no chunking overhead
64 KB - 10 MBPrebuffered, payload_bytesAutomatic chunking, simple semantics
10 MB - 100 MBStreaming (bounded channels)Backpressure control, lower memory

100 MB| Streaming (bounded channels)| Essential for memory constraints

Concurrent Request Optimization

For high-throughput scenarios:

Maximum concurrent requests = min(
    server_handler_capacity,
    client_memory_budget / average_payload_size
)

Example calculation:

Server: 100 concurrent handlers
Client memory budget: 500 MB
Average response size: 2 MB

Optimal concurrency = min(100, 500/2) = min(100, 250) = 100 requests

Chunking Strategy Selection

When DEFAULT_SERVICE_MAX_CHUNK_SIZE (64 KB) is optimal:

  • General-purpose RPC with mixed payload sizes
  • WebSocket transport (browser or native)
  • Balanced latency/throughput requirements

When to consider smaller chunks (e.g., 16 KB):

  • Real-time streaming with low-latency requirements
  • Bandwidth-constrained networks
  • Interactive applications requiring immediate feedback

When to consider larger chunks (e.g., 256 KB):

  • High-bandwidth, low-latency networks
  • Bulk data transfer scenarios
  • When minimizing frame overhead is critical

Note: Chunk size is currently a compile-time constant. Custom chunk sizes require modifying DEFAULT_SERVICE_MAX_CHUNK_SIZE and recompiling.

Memory Optimization Patterns

Pattern 1: Limit concurrent streams

Pattern 2: Streaming for large data

Use streaming RPC methods instead of prebuffered when dealing with large datasets to process data incrementally.

Pattern 3: Connection pooling

For client-heavy scenarios, consider connection pooling to distribute load across multiple connections, avoiding single-connection bottlenecks.

Monitoring and Profiling

The framework uses tracing for observability. Key metrics to monitor:

  • RpcDispatcher::call : Request initiation timing
  • RpcSession::write_bytes : Frame transmission timing
  • RpcStreamDecoder : Chunk reassembly timing
  • Pending request count : Memory pressure indicator
  • Active stream count : Multiplexing efficiency indicator

Sources:


Performance Testing Results

The integration test suite includes performance validation scenarios:

Large Payload Test (200x Chunk Size)

Test configuration:

  • Payload size: 200 × 64 KB = 12.8 MB
  • Pattern: Prebuffered echo (round-trip)
  • Transport: WebSocket over TCP
  • Client: WASM client with bridge

Results demonstrate:

  • Successful transmission of 12.8 MB payload
  • Automatic chunking into 200 frames
  • Correct reassembly and verification
  • No memory leaks or decoder issues

This validates the framework’s ability to handle multi-megabyte payloads using the prebuffered pattern with automatic chunking.

Sources: