This documentation is part of the "Projects with Books" initiative at zenOSmosis.

The source code for this project is available on GitHub.

Performance Considerations

Loading…

Performance Considerations

Relevant source files

Purpose and Scope

This document covers performance characteristics, optimization strategies, and trade-offs in the muxio RPC framework. Topics include binary protocol efficiency, chunking strategies, payload size management, prebuffering versus streaming patterns, and memory management considerations.

For general architecture and design principles, see Design Philosophy. For detailed information about streaming RPC patterns, see Streaming RPC Calls. For cross-platform deployment strategies, see Cross-Platform Deployment.

Binary Protocol Efficiency

The muxio framework is designed for low-overhead communication through several architectural decisions:

Compact Binary Serialization

The framework uses bitcode for serialization instead of text-based formats like JSON. This provides:

Smaller payload sizes : Binary encoding reduces network transfer costs
Faster encoding/decoding : No string parsing or formatting overhead
Type safety : Compile-time verification of serialized structures
Zero schema overhead : No field names transmitted in messages

The serialization occurs at the RPC service definition layer, where RpcMethodPrebuffered::encode_request and RpcMethodPrebuffered::decode_response handle type conversion.

Schemaless Framing Protocol

The underlying framing protocol is schema-agnostic, meaning:

No metadata about message structure is transmitted
Frame headers contain only essential routing information (stream ID, flags)
Method identification uses 64-bit xxhash values computed at compile time
Response correlation uses numeric request IDs

This minimalist approach reduces per-message overhead while maintaining full type safety through shared service definitions.

Sources:

Chunking and Payload Size Management

DEFAULT_SERVICE_MAX_CHUNK_SIZE

The framework defines a constant chunk size used for splitting large payloads:

This value represents the maximum size of a single frame’s payload. Any data exceeding this size is automatically chunked by the RpcDispatcher and RpcSession layers.

Rationale for 64 KB chunks:

Factor	Consideration
WebSocket compatibility	Many WebSocket implementations handle 64 KB frames efficiently
Memory footprint	Limits per-stream buffer requirements
Latency vs throughput	Balances sending small chunks quickly vs fewer total frames
TCP segment alignment	Aligns reasonably with typical TCP maximum segment sizes

Smart Transport Strategy for Large Payloads

The framework implements an adaptive strategy for transmitting RPC arguments based on their encoded size:

Small Payload Path ( < 64 KB):

flowchart TB
    EncodeArgs["RpcCallPrebuffered::call\nEncode input arguments"]
CheckSize{"encoded_args.len() >=\nDEFAULT_SERVICE_MAX_CHUNK_SIZE?"}
SmallPath["Small payload path:\nSet rpc_param_bytes\nHeader contains full args"]
LargePath["Large payload path:\nSet rpc_prebuffered_payload_bytes\nStreamed after header"]
Dispatcher["RpcDispatcher::call\nCreate RpcRequest"]
Session["RpcSession::write_bytes\nChunk if needed"]
Transport["WebSocket transport"]
EncodeArgs --> CheckSize
 
   CheckSize -->|< 64 KB| SmallPath
 
   CheckSize -->|>= 64 KB| LargePath
 
   SmallPath --> Dispatcher
 
   LargePath --> Dispatcher
 
   Dispatcher --> Session
 
   Session --> Transport
    
    style CheckSize fill:#f9f9f9
    style SmallPath fill:#f0f0f0
    style LargePath fill:#f0f0f0

The encoded arguments fit in the rpc_param_bytes field of the RpcRequest structure. This field is transmitted as part of the initial request header frame, minimizing round-trips.

Large Payload Path ( >= 64 KB):

The encoded arguments are placed in rpc_prebuffered_payload_bytes. The RpcDispatcher automatically chunks this data into multiple frames, each with its own stream ID and sequence flags.

This prevents request header frames from exceeding transport limitations while ensuring arguments of any size can be transmitted.

Sources:

Prebuffering vs Streaming Trade-offs

The framework provides two distinct patterns for RPC calls, each with different performance characteristics:

Prebuffered RPC Pattern

Characteristics:

Entire request payload buffered in memory before transmission begins
Entire response payload buffered before processing begins
Uses RpcCallPrebuffered trait and call_rpc_buffered method
Set is_finalized: true on RpcRequest

Performance implications:

Aspect	Impact
Memory usage	Higher - full payload in memory simultaneously
Latency	Higher initial latency - must encode entire payload first
Throughput	Optimal for small-to-medium payloads
Simplicity	Simpler error handling - all-or-nothing semantics
Backpressure	None - sender controls pacing

Optimal use cases:

Small payloads (< 10 MB)
Computations requiring full dataset before processing
Simple request/response patterns
Operations where atomicity is important

Streaming RPC Pattern

Characteristics:

Incremental transmission using dynamic channels
Processing begins before entire payload arrives
Uses RpcMethodStreaming trait (bounded or unbounded channels)
Supports bidirectional streaming

Performance implications:

Aspect	Impact
Memory usage	Lower - processes data incrementally
Latency	Lower initial latency - processing begins immediately
Throughput	Better for large payloads
Complexity	Requires async channel management
Backpressure	Supported via bounded channels

Optimal use cases:

Large payloads (> 10 MB)
Real-time streaming data
Long-running operations
File uploads/downloads
Bidirectional communication

Sources:

Memory Management and Buffering

Per-Stream Decoder Allocation

The RpcSession maintains a separate decoder instance for each active stream:

Memory characteristics:

Per-stream overhead : Each active stream allocates a decoder with internal buffer
Buffer growth : Buffers grow dynamically as chunks arrive
Cleanup timing : Decoders removed on End or Error events
Peak memory : (concurrent_streams × average_payload_size) + overhead

Example calculation for prebuffered calls:

Scenario: 10 concurrent RPC calls, each with 5 MB response
Peak memory ≈ 10 × 5 MB = 50 MB (excluding overhead)

Encoder Lifecycle

The RpcStreamEncoder is created per-request and manages outbound chunking:

Created when RpcDispatcher::call initiates a request
Holds reference to payload bytes during transmission
Automatically chunks data based on DEFAULT_SERVICE_MAX_CHUNK_SIZE
Dropped after final chunk transmitted

For prebuffered calls, the encoder is returned to the caller, allowing explicit lifecycle management:

Pending Request Tracking

The RpcDispatcher maintains a HashMap of pending requests:

Entry lifecycle:

Inserted when call or call_rpc_buffered invoked
Maintained until response received or timeout
Removed on successful response, error, or explicit cleanup
Each entry holds oneshot::Sender or callback for result delivery

Memory impact : Proportional to number of in-flight requests. Each entry contains minimal overhead (sender channel + metadata).

Sources:

extensions/muxio-rpc-service-caller/src/prebuffered/traits.rs:75-81

graph LR
    subgraph "Async/Await Model"
        A1["Task spawn overhead"]
A2["Future state machine"]
A3["Runtime scheduler"]
A4["Context switching"]
end
    
    subgraph "muxio Callback Model"
        M1["Direct function calls"]
M2["No state machines"]
M3["No runtime dependency"]
M4["Deterministic execution"]
end
    
    A1 -.higher overhead.-> M1
    A2 -.higher overhead.-> M2
    A3 -.higher overhead.-> M3
    A4 -.higher overhead.-> M4

Non-Async Callback Model Performance

The framework’s non-async, callback-driven architecture provides specific performance characteristics:

Runtime Overhead Comparison

Performance advantages:

Factor	Benefit
No async runtime	Eliminates scheduler overhead
Direct callbacks	No future polling or waker mechanisms
Deterministic flow	Predictable execution timing
WASM compatible	Works in single-threaded browser contexts
Memory efficiency	No per-task stack allocation

Performance limitations:

Factor	Impact
Synchronous processing	Long-running callbacks block progress
No implicit parallelism	Concurrency must be managed explicitly
Callback complexity	Deep callback chains increase stack usage

Read/Write Operation Flow

This synchronous model means:

Low latency : No context switching between read and callback invocation
Predictable timing : Callback invoked immediately when data complete
Stack-based execution : Entire chain executes on single thread/stack
No allocations : No heap allocation for task state

Sources:

DRAFT.md:48-52

Connection and Stream Multiplexing Efficiency

Stream ID Allocation Strategy

The RpcSession allocates stream IDs sequentially:

Efficiency characteristics:

O(1) allocation : No data structure lookup required
Collision-free : Client/server use separate number spaces
Reuse strategy : IDs wrap after exhaustion (u32 range)
No cleanup needed : Decoders removed, IDs naturally recycled

graph TB
    SingleConnection["Single WebSocket Connection"]
Multiplexer["RpcSession Multiplexer"]
subgraph "Interleaved Streams"
        S1["Stream 1\nLarge file upload\n1000 chunks"]
S2["Stream 3\nQuick query\n1 chunk"]
S3["Stream 5\nMedium response\n50 chunks"]
end
    
 
   SingleConnection --> Multiplexer
 
   Multiplexer --> S1
 
   Multiplexer --> S2
 
   Multiplexer --> S3
    
    Timeline["Frame sequence: [1,3,1,1,5,3,1,5,1,...]"]
Multiplexer -.-> Timeline
    
    Note1["Stream 3 completes quickly\ndespite Stream 1 still transmitting"]
S2 -.-> Note1

Concurrent Request Handling

The framework supports concurrent requests over a single connection through stream multiplexing:

Performance benefits:

Head-of-line avoidance : Small requests don’t wait for large transfers
Resource efficiency : Single connection handles all operations
Lower latency : No connection establishment overhead per request
Fairness : Chunks from different streams interleave naturally

Example throughput:

Scenario: 1 large transfer (100 MB) + 10 small queries (10 KB each)
Without multiplexing: Small queries wait ~seconds for large transfer
With multiplexing: Small queries complete in ~milliseconds

Sources:

extensions/muxio-wasm-rpc-client/tests/prebuffered_integration_tests.rs:125-142

Best Practices and Recommendations

Payload Size Guidelines

Payload Size	Recommended Pattern	Rationale
< 64 KB	Prebuffered, inline params	Single frame, no chunking overhead
64 KB - 10 MB	Prebuffered, payload_bytes	Automatic chunking, simple semantics
10 MB - 100 MB	Streaming (bounded channels)	Backpressure control, lower memory

100 MB| Streaming (bounded channels)| Essential for memory constraints

Concurrent Request Optimization

For high-throughput scenarios:

Maximum concurrent requests = min(
    server_handler_capacity,
    client_memory_budget / average_payload_size
)

Example calculation:

Server: 100 concurrent handlers
Client memory budget: 500 MB
Average response size: 2 MB

Optimal concurrency = min(100, 500/2) = min(100, 250) = 100 requests