This documentation is part of the "Projects with Books" initiative at zenOSmosis.
The source code for this project is available on GitHub.
Performance Optimization
Relevant source files
This page describes techniques and strategies for optimizing throughput, latency, and memory usage in rust-muxio applications. It covers serialization efficiency, chunking strategies, memory allocation patterns, and profiling approaches. For general architectural patterns, see Core Concepts. For transport-specific tuning, see Transport Implementations.
Binary Serialization Efficiency
The system uses bitcode for binary serialization of RPC method parameters and responses. This provides compact encoding with minimal overhead compared to text-based formats like JSON.
graph LR A["Application\nRust Types"] --> B["encode_request\nbitcode::encode"] B --> C["Vec<u8>\nBinary Buffer"] C --> D["RpcHeader\nrpc_metadata_bytes"] D --> E["Frame Protocol\nLow-Level Transport"] E --> F["decode_request\nbitcode::decode"] F --> A
Serialization Strategy
Sources:
- src/rpc/rpc_dispatcher.rs:249-251
- Diagram 6 in high-level architecture
Optimization Guidelines
| Technique | Impact | Implementation |
|---|---|---|
Use #[derive(bitcode::Encode, bitcode::Decode)] | Automatic optimal encoding | Applied in service definitions |
Avoid nested Option<Option<T>> | Reduces byte overhead | Flatten data structures |
| Prefer fixed-size types over variable-length | Predictable buffer sizes | Use [u8; N] instead of Vec<u8> when size is known |
Use u32 instead of u64 when range allows | Halves integer encoding size | RPC method IDs use u32 |
Sources:
- Service definition patterns in example-muxio-rpc-service-definition
- Cargo.lock:158-168 (bitcode dependencies)
Chunking Strategy and Throughput
The max_chunk_size parameter controls how large payloads are split into multiple frames. Optimal chunk size balances latency, memory usage, and transport efficiency.
graph TB
subgraph "Small Chunks (e.g., 1KB)"
A1["Lower Memory\nPer Request"] --> A2["More Frames"]
A2 --> A3["Higher CPU\nFraming Overhead"]
end
subgraph "Large Chunks (e.g., 64KB)"
B1["Higher Memory\nPer Request"] --> B2["Fewer Frames"]
B2 --> B3["Lower CPU\nFraming Overhead"]
end
subgraph "Optimal Range"
C1["8KB - 16KB"] --> C2["Balance of\nMemory & CPU"]
end
Chunk Size Selection
Performance Characteristics:
| Chunk Size | Latency | Memory | CPU | Best For |
|---|---|---|---|---|
| 1-2 KB | Excellent | Minimal | High overhead | Real-time, WASM |
| 4-8 KB | Very Good | Low | Moderate | Standard RPC |
| 16-32 KB | Good | Moderate | Low | Large payloads |
| 64+ KB | Fair | High | Minimal | Bulk transfers |
Sources:
- src/rpc/rpc_dispatcher.rs230 (
max_chunk_sizeparameter) - src/rpc/rpc_dispatcher.rs:260-266 (encoder initialization with chunking)
Prebuffering vs Streaming
The system supports two payload transmission modes with different performance trade-offs.
graph TB
subgraph "Prebuffered Mode"
PB1["RpcRequest\nis_finalized=true"]
PB2["Single write_bytes"]
PB3["Immediate end_stream"]
PB4["Low Latency\nHigh Memory"]
PB1 -->
PB2 -->
PB3 --> PB4
end
subgraph "Streaming Mode"
ST1["RpcRequest\nis_finalized=false"]
ST2["Multiple write_bytes\ncalls"]
ST3["Delayed end_stream"]
ST4["Higher Latency\nLow Memory"]
ST1 -->
ST2 -->
ST3 --> ST4
end
Mode Comparison
Prebuffered Response Handling
The prebuffer_response flag controls whether response payloads are accumulated before delivery:
| Mode | Memory Usage | Latency | Use Case |
|---|---|---|---|
prebuffer_response=true | Accumulates entire payload | Delivers complete response | Small responses, simpler logic |
prebuffer_response=false | Streams chunks as received | Minimal per-chunk latency | Large responses, progress tracking |
Implementation Details:
- Prebuffering accumulates chunks in
prebuffered_responsesHashMap - Buffer is stored until
RpcStreamEvent::Endis received - Handler is invoked once with complete payload
- Buffer is immediately cleared after handler invocation
Sources:
- src/rpc/rpc_dispatcher.rs233 (
prebuffer_responseparameter) - src/rpc/rpc_dispatcher.rs:269-283 (prebuffered payload handling)
- src/rpc/rpc_internals/rpc_respondable_session.rs:26-27 (prebuffering state)
- src/rpc/rpc_internals/rpc_respondable_session.rs:115-147 (prebuffering logic)
Memory Management Patterns
graph LR A["Inbound Frames"] --> B["RpcDispatcher\nread_bytes"] B --> C["Mutex Lock"] C --> D["VecDeque\nrpc_request_queue"] D --> E["Push/Update/Delete\nOperations"] E --> F["Mutex Unlock"] G["Application"] --> H["get_rpc_request"] H --> C
Request Queue Design
The RpcDispatcher maintains an internal request queue using Arc<Mutex<VecDeque<(u32, RpcRequest)>>>. This design has specific performance implications:
Lock Contention Considerations:
| Operation | Lock Duration | Frequency | Optimization |
|---|---|---|---|
read_bytes | Per-frame decode | High | Minimize work under lock |
get_rpc_request | Read access only | Medium | Returns guard, caller controls lock |
delete_rpc_request | Single element removal | Low | Uses VecDeque::remove |
Memory Overhead:
- Each in-flight request: ~100-200 bytes base + payload size
VecDequecapacity grows as needed- Payload bytes accumulated until
is_finalized=true
Sources:
- src/rpc/rpc_dispatcher.rs50 (
rpc_request_queuedeclaration) - src/rpc/rpc_dispatcher.rs:362-374 (
read_bytesimplementation) - src/rpc/rpc_dispatcher.rs:381-394 (
get_rpc_requestwith lock guard) - src/rpc/rpc_dispatcher.rs:411-420 (
delete_rpc_request)
Preventing Memory Leaks
The dispatcher must explicitly clean up completed or failed requests:
Critical Pattern:
- Request added to queue on
Headerevent - Payload accumulated on
PayloadChunkevents - Finalized on
Endevent - Application must call
delete_rpc_request()to free memory
Failure to delete finalized requests causes unbounded memory growth.
Sources:
- src/rpc/rpc_dispatcher.rs:121-141 (request creation)
- src/rpc/rpc_dispatcher.rs:144-169 (payload accumulation)
- src/rpc/rpc_dispatcher.rs:171-185 (finalization)
- src/rpc/rpc_dispatcher.rs:411-420 (cleanup)
graph LR A["next_rpc_request_id\nu32 counter"] --> B["increment_u32_id()"] B --> C["Assign to\nRpcHeader"] C --> D["Store in\nresponse_handlers"] D --> E["Match on\ninbound response"]
Request Correlation Overhead
Each outbound request is assigned a unique u32 ID for response correlation. The system uses monotonic ID generation with wraparound.
ID Generation Strategy
Performance Characteristics:
| Aspect | Cost | Justification |
|---|---|---|
| ID generation | Minimal (single addition) | u32::wrapping_add(1) |
| HashMap insertion | O(1) average | response_handlers.insert() |
| Response lookup | O(1) average | response_handlers.get_mut() |
| Memory per handler | ~24 bytes + closure size | Box<dyn FnMut> overhead |
Concurrency Considerations:
next_rpc_request_idis NOT thread-safe- Each client connection should have its own
RpcDispatcher - Sharing a dispatcher across threads requires external synchronization
Sources:
- src/rpc/rpc_dispatcher.rs42 (
next_rpc_request_idfield) - src/rpc/rpc_dispatcher.rs:241-242 (ID assignment)
- src/rpc/rpc_internals/rpc_respondable_session.rs24 (
response_handlersHashMap)
graph TB A["Connection\nClosed"] --> B["fail_all_pending_requests"] B --> C["std::mem::take\nresponse_handlers"] C --> D["For each handler"] D --> E["Create\nRpcStreamEvent::Error"] E --> F["Invoke handler\nwith error"] F --> G["Drop handler\nboxed closure"]
Handler Cleanup and Backpressure
Failed Request Handling
When a transport connection drops, all pending response handlers must be notified to prevent resource leaks and hung futures:
Implementation:
The fail_all_pending_requests() method takes ownership of all handlers and invokes them with an error event. This ensures:
- Awaiting futures are woken with error result
- Callback memory is freed immediately
- No handlers remain registered after connection failure
Performance Impact:
- Invocation cost: O(n) where n = number of pending requests
- Each handler invocation is synchronous
- Memory freed immediately after iteration
Sources:
- src/rpc/rpc_dispatcher.rs:428-456 (
fail_all_pending_requestsimplementation)
Benchmarking with Criterion
The codebase uses criterion for performance benchmarking. To run benchmarks:
Benchmark Structure
Key Metrics to Track:
| Metric | What It Measures | Target |
|---|---|---|
| Throughput | Bytes/sec processed | Maximize |
| Latency | Time per operation | Minimize |
| Allocation Rate | Heap allocations | Minimize |
| Frame Overhead | Protocol bytes vs payload | < 5% |
Sources:
- Cargo.lock:317-338 (criterion dependency)
- example-muxio-ws-rpc-app benchmarks
Platform-Specific Optimizations
Native (Tokio) vs WASM
Platform Tuning:
| Platform | Chunk Size | Buffer Strategy | Concurrency |
|---|---|---|---|
| Native (Tokio) | 8-16 KB | Reuse buffers | Multiple connections |
| WASM (Browser) | 2-4 KB | Small allocations | Single connection |
| Native (Server) | 16-32 KB | Pre-allocated pools | Connection pooling |
WASM-Specific Considerations:
- JavaScript boundary crossings have cost (~1-5 μs per call)
- Minimize calls to
wasm-bindgenfunctions - Use larger RPC payloads to amortize overhead
- Prebuffer responses when possible to reduce event callbacks
Sources:
- muxio-tokio-rpc-client vs muxio-wasm-rpc-client crate comparison
- Cargo.lock:935-953 (WASM dependencies)
Profiling and Diagnostics
Tracing Integration
The system uses tracing for instrumentation. Enable logging to identify bottlenecks:
Key Trace Points:
| Location | Event | Performance Insight |
|---|---|---|
RpcDispatcher::call | Request initiation | Call frequency, payload sizes |
read_bytes | Frame processing | Decode latency, lock contention |
| Handler callbacks | Response processing | Handler execution time |
Sources:
- src/rpc/rpc_dispatcher.rs12 (tracing import)
- src/rpc/rpc_dispatcher.rs98 (
#[instrument]macro usage)
Detecting Performance Issues
Tooling:
Sources:
- DRAFT.md:29-31 (coverage tooling)
- DRAFT.md:34-40 (module analysis)
Best Practices Summary
| Optimization | Technique | Impact |
|---|---|---|
| Minimize allocations | Reuse buffers, use Vec::with_capacity | High |
| Choose optimal chunk size | 8-16 KB for typical RPC | Medium |
| Prebuffer small responses | Enable prebuffer_response < 64KB | Medium |
| Clean up completed requests | Call delete_rpc_request() promptly | High |
| Use fixed-size types | Prefer [u8; N] over Vec<u8> in hot paths | Low |
| Profile before optimizing | Use criterion + flamegraph | Critical |
Sources:
- Performance patterns observed across src/rpc/rpc_dispatcher.rs
- Memory management in src/rpc/rpc_internals/rpc_respondable_session.rs
Dismiss
Refresh this wiki
Enter email to refresh