This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Performance Considerations
Loading…
Performance Considerations
Relevant source files
- DRAFT.md
- extensions/muxio-rpc-service-caller/src/prebuffered/traits.rs
- extensions/muxio-wasm-rpc-client/tests/prebuffered_integration_tests.rs
Purpose and Scope
This document covers performance characteristics, optimization strategies, and trade-offs in the muxio RPC framework. Topics include binary protocol efficiency, chunking strategies, payload size management, prebuffering versus streaming patterns, and memory management considerations.
For general architecture and design principles, see Design Philosophy. For detailed information about streaming RPC patterns, see Streaming RPC Calls. For cross-platform deployment strategies, see Cross-Platform Deployment.
Binary Protocol Efficiency
The muxio framework is designed for low-overhead communication through several architectural decisions:
Compact Binary Serialization
The framework uses bitcode for serialization instead of text-based formats like JSON. This provides:
- Smaller payload sizes : Binary encoding reduces network transfer costs
- Faster encoding/decoding : No string parsing or formatting overhead
- Type safety : Compile-time verification of serialized structures
- Zero schema overhead : No field names transmitted in messages
The serialization occurs at the RPC service definition layer, where RpcMethodPrebuffered::encode_request and RpcMethodPrebuffered::decode_response handle type conversion.
Schemaless Framing Protocol
The underlying framing protocol is schema-agnostic, meaning:
- No metadata about message structure is transmitted
- Frame headers contain only essential routing information (stream ID, flags)
- Method identification uses 64-bit xxhash values computed at compile time
- Response correlation uses numeric request IDs
This minimalist approach reduces per-message overhead while maintaining full type safety through shared service definitions.
Sources:
Chunking and Payload Size Management
DEFAULT_SERVICE_MAX_CHUNK_SIZE
The framework defines a constant chunk size used for splitting large payloads:
This value represents the maximum size of a single frame’s payload. Any data exceeding this size is automatically chunked by the RpcDispatcher and RpcSession layers.
Rationale for 64 KB chunks:
| Factor | Consideration |
|---|---|
| WebSocket compatibility | Many WebSocket implementations handle 64 KB frames efficiently |
| Memory footprint | Limits per-stream buffer requirements |
| Latency vs throughput | Balances sending small chunks quickly vs fewer total frames |
| TCP segment alignment | Aligns reasonably with typical TCP maximum segment sizes |
Smart Transport Strategy for Large Payloads
The framework implements an adaptive strategy for transmitting RPC arguments based on their encoded size:
Small Payload Path ( < 64 KB):
flowchart TB
EncodeArgs["RpcCallPrebuffered::call\nEncode input arguments"]
CheckSize{"encoded_args.len() >=\nDEFAULT_SERVICE_MAX_CHUNK_SIZE?"}
SmallPath["Small payload path:\nSet rpc_param_bytes\nHeader contains full args"]
LargePath["Large payload path:\nSet rpc_prebuffered_payload_bytes\nStreamed after header"]
Dispatcher["RpcDispatcher::call\nCreate RpcRequest"]
Session["RpcSession::write_bytes\nChunk if needed"]
Transport["WebSocket transport"]
EncodeArgs --> CheckSize
CheckSize -->|< 64 KB| SmallPath
CheckSize -->|>= 64 KB| LargePath
SmallPath --> Dispatcher
LargePath --> Dispatcher
Dispatcher --> Session
Session --> Transport
style CheckSize fill:#f9f9f9
style SmallPath fill:#f0f0f0
style LargePath fill:#f0f0f0
The encoded arguments fit in the rpc_param_bytes field of the RpcRequest structure. This field is transmitted as part of the initial request header frame, minimizing round-trips.
Large Payload Path ( >= 64 KB):
The encoded arguments are placed in rpc_prebuffered_payload_bytes. The RpcDispatcher automatically chunks this data into multiple frames, each with its own stream ID and sequence flags.
This prevents request header frames from exceeding transport limitations while ensuring arguments of any size can be transmitted.
Sources:
- extensions/muxio-rpc-service-caller/src/prebuffered/traits.rs:30-48
- extensions/muxio-rpc-service-caller/src/prebuffered/traits.rs:58-72
- extensions/muxio-wasm-rpc-client/tests/prebuffered_integration_tests.rs:297-300
Prebuffering vs Streaming Trade-offs
The framework provides two distinct patterns for RPC calls, each with different performance characteristics:
Prebuffered RPC Pattern
Characteristics:
- Entire request payload buffered in memory before transmission begins
- Entire response payload buffered before processing begins
- Uses
RpcCallPrebufferedtrait andcall_rpc_bufferedmethod - Set
is_finalized: trueonRpcRequest
Performance implications:
| Aspect | Impact |
|---|---|
| Memory usage | Higher - full payload in memory simultaneously |
| Latency | Higher initial latency - must encode entire payload first |
| Throughput | Optimal for small-to-medium payloads |
| Simplicity | Simpler error handling - all-or-nothing semantics |
| Backpressure | None - sender controls pacing |
Optimal use cases:
- Small payloads (< 10 MB)
- Computations requiring full dataset before processing
- Simple request/response patterns
- Operations where atomicity is important
Streaming RPC Pattern
Characteristics:
- Incremental transmission using dynamic channels
- Processing begins before entire payload arrives
- Uses
RpcMethodStreamingtrait (bounded or unbounded channels) - Supports bidirectional streaming
Performance implications:
| Aspect | Impact |
|---|---|
| Memory usage | Lower - processes data incrementally |
| Latency | Lower initial latency - processing begins immediately |
| Throughput | Better for large payloads |
| Complexity | Requires async channel management |
| Backpressure | Supported via bounded channels |
Optimal use cases:
- Large payloads (> 10 MB)
- Real-time streaming data
- Long-running operations
- File uploads/downloads
- Bidirectional communication
Sources:
- extensions/muxio-rpc-service-caller/src/prebuffered/traits.rs:11-21
- extensions/muxio-wasm-rpc-client/tests/prebuffered_integration_tests.rs:229-312
Memory Management and Buffering
Per-Stream Decoder Allocation
The RpcSession maintains a separate decoder instance for each active stream:
Memory characteristics:
- Per-stream overhead : Each active stream allocates a decoder with internal buffer
- Buffer growth : Buffers grow dynamically as chunks arrive
- Cleanup timing : Decoders removed on
EndorErrorevents - Peak memory :
(concurrent_streams × average_payload_size) + overhead
Example calculation for prebuffered calls:
Scenario: 10 concurrent RPC calls, each with 5 MB response
Peak memory ≈ 10 × 5 MB = 50 MB (excluding overhead)
Encoder Lifecycle
The RpcStreamEncoder is created per-request and manages outbound chunking:
- Created when
RpcDispatcher::callinitiates a request - Holds reference to payload bytes during transmission
- Automatically chunks data based on
DEFAULT_SERVICE_MAX_CHUNK_SIZE - Dropped after final chunk transmitted
For prebuffered calls, the encoder is returned to the caller, allowing explicit lifecycle management:
Pending Request Tracking
The RpcDispatcher maintains a HashMap of pending requests:
Entry lifecycle:
- Inserted when
callorcall_rpc_bufferedinvoked - Maintained until response received or timeout
- Removed on successful response, error, or explicit cleanup
- Each entry holds
oneshot::Senderor callback for result delivery
Memory impact : Proportional to number of in-flight requests. Each entry contains minimal overhead (sender channel + metadata).
Sources:
graph LR
subgraph "Async/Await Model"
A1["Task spawn overhead"]
A2["Future state machine"]
A3["Runtime scheduler"]
A4["Context switching"]
end
subgraph "muxio Callback Model"
M1["Direct function calls"]
M2["No state machines"]
M3["No runtime dependency"]
M4["Deterministic execution"]
end
A1 -.higher overhead.-> M1
A2 -.higher overhead.-> M2
A3 -.higher overhead.-> M3
A4 -.higher overhead.-> M4
Non-Async Callback Model Performance
The framework’s non-async, callback-driven architecture provides specific performance characteristics:
Runtime Overhead Comparison
Performance advantages:
| Factor | Benefit |
|---|---|
| No async runtime | Eliminates scheduler overhead |
| Direct callbacks | No future polling or waker mechanisms |
| Deterministic flow | Predictable execution timing |
| WASM compatible | Works in single-threaded browser contexts |
| Memory efficiency | No per-task stack allocation |
Performance limitations:
| Factor | Impact |
|---|---|
| Synchronous processing | Long-running callbacks block progress |
| No implicit parallelism | Concurrency must be managed explicitly |
| Callback complexity | Deep callback chains increase stack usage |
Read/Write Operation Flow
This synchronous model means:
- Low latency : No context switching between read and callback invocation
- Predictable timing : Callback invoked immediately when data complete
- Stack-based execution : Entire chain executes on single thread/stack
- No allocations : No heap allocation for task state
Sources:
Connection and Stream Multiplexing Efficiency
Stream ID Allocation Strategy
The RpcSession allocates stream IDs sequentially:
Efficiency characteristics:
- O(1) allocation : No data structure lookup required
- Collision-free : Client/server use separate number spaces
- Reuse strategy : IDs wrap after exhaustion (u32 range)
- No cleanup needed : Decoders removed, IDs naturally recycled
graph TB
SingleConnection["Single WebSocket Connection"]
Multiplexer["RpcSession Multiplexer"]
subgraph "Interleaved Streams"
S1["Stream 1\nLarge file upload\n1000 chunks"]
S2["Stream 3\nQuick query\n1 chunk"]
S3["Stream 5\nMedium response\n50 chunks"]
end
SingleConnection --> Multiplexer
Multiplexer --> S1
Multiplexer --> S2
Multiplexer --> S3
Timeline["Frame sequence: [1,3,1,1,5,3,1,5,1,...]"]
Multiplexer -.-> Timeline
Note1["Stream 3 completes quickly\ndespite Stream 1 still transmitting"]
S2 -.-> Note1
Concurrent Request Handling
The framework supports concurrent requests over a single connection through stream multiplexing:
Performance benefits:
- Head-of-line avoidance : Small requests don’t wait for large transfers
- Resource efficiency : Single connection handles all operations
- Lower latency : No connection establishment overhead per request
- Fairness : Chunks from different streams interleave naturally
Example throughput:
Scenario: 1 large transfer (100 MB) + 10 small queries (10 KB each)
Without multiplexing: Small queries wait ~seconds for large transfer
With multiplexing: Small queries complete in ~milliseconds
Sources:
Best Practices and Recommendations
Payload Size Guidelines
| Payload Size | Recommended Pattern | Rationale |
|---|---|---|
| < 64 KB | Prebuffered, inline params | Single frame, no chunking overhead |
| 64 KB - 10 MB | Prebuffered, payload_bytes | Automatic chunking, simple semantics |
| 10 MB - 100 MB | Streaming (bounded channels) | Backpressure control, lower memory |
100 MB| Streaming (bounded channels)| Essential for memory constraints
Concurrent Request Optimization
For high-throughput scenarios:
Maximum concurrent requests = min(
server_handler_capacity,
client_memory_budget / average_payload_size
)
Example calculation:
Server: 100 concurrent handlers
Client memory budget: 500 MB
Average response size: 2 MB
Optimal concurrency = min(100, 500/2) = min(100, 250) = 100 requests
Chunking Strategy Selection
When DEFAULT_SERVICE_MAX_CHUNK_SIZE (64 KB) is optimal:
- General-purpose RPC with mixed payload sizes
- WebSocket transport (browser or native)
- Balanced latency/throughput requirements
When to consider smaller chunks (e.g., 16 KB):
- Real-time streaming with low-latency requirements
- Bandwidth-constrained networks
- Interactive applications requiring immediate feedback
When to consider larger chunks (e.g., 256 KB):
- High-bandwidth, low-latency networks
- Bulk data transfer scenarios
- When minimizing frame overhead is critical
Note: Chunk size is currently a compile-time constant. Custom chunk sizes require modifying DEFAULT_SERVICE_MAX_CHUNK_SIZE and recompiling.
Memory Optimization Patterns
Pattern 1: Limit concurrent streams
Pattern 2: Streaming for large data
Use streaming RPC methods instead of prebuffered when dealing with large datasets to process data incrementally.
Pattern 3: Connection pooling
For client-heavy scenarios, consider connection pooling to distribute load across multiple connections, avoiding single-connection bottlenecks.
Monitoring and Profiling
The framework uses tracing for observability. Key metrics to monitor:
RpcDispatcher::call: Request initiation timingRpcSession::write_bytes: Frame transmission timingRpcStreamDecoder: Chunk reassembly timing- Pending request count : Memory pressure indicator
- Active stream count : Multiplexing efficiency indicator
Sources:
Performance Testing Results
The integration test suite includes performance validation scenarios:
Large Payload Test (200x Chunk Size)
Test configuration:
- Payload size: 200 × 64 KB = 12.8 MB
- Pattern: Prebuffered echo (round-trip)
- Transport: WebSocket over TCP
- Client: WASM client with bridge
Results demonstrate:
- Successful transmission of 12.8 MB payload
- Automatic chunking into 200 frames
- Correct reassembly and verification
- No memory leaks or decoder issues
This validates the framework’s ability to handle multi-megabyte payloads using the prebuffered pattern with automatic chunking.
Sources: