4. Architecture Decision Records¶
4.1. ADR-001: Use mimalloc as Default Memory Allocator¶
4.1.1. Status¶
Accepted
Date: 2026-03-06
4.1.2. Context¶
The Classic Diagnostic Adapter requires efficient memory allocation to handle diagnostic communication workloads. The choice of memory allocator significantly impacts both runtime performance and memory footprint. Two options were evaluated:
mimalloc: A performance-oriented general-purpose allocator developed by Microsoft Research, which uses arena-based pooling strategies
System allocator: The native platform allocator (macOS system allocator in this evaluation)
Profiling was conducted on macOS / Apple Silicon (arm64) using Xcode Instruments to compare the two allocators under realistic workload conditions.
4.1.3. Decision¶
We will use mimalloc as the default memory allocator for the Classic Diagnostic Adapter.
The performance advantages of mimalloc outweigh the increased memory overhead. While the system allocator demonstrates better memory efficiency, the ~29% execution time improvement provided by mimalloc is critical for diagnostic operations where response time directly impacts user experience and system throughput.
4.1.4. Rationale¶
4.1.4.1. Performance Comparison¶
Comprehensive profiling revealed the following metrics:
Metric |
mimalloc |
System (macOS) |
Winner |
|---|---|---|---|
CPU Time |
3.35 s |
4.33 s |
mimalloc (29% faster) |
Real Memory |
400.22 MiB |
329.78 MiB |
System (17% less) |
Heap & Anon VM |
1.24 GiB |
434.01 MiB |
System (65% less) |
Total Allocated |
4.29 GiB |
3.19 GiB |
System (26% less) |
Allocation Count |
7,896 |
2,273,227 |
mimalloc (287x fewer) |
Persistent Allocs |
1,084 |
202,138 |
mimalloc (186x fewer) |
Dirty Memory |
60.73 MiB |
60.01 MiB |
~ Tie |
Thread Count |
15 |
15 |
Tie |
Fragmentation Ratio |
0.34% / 0.83% |
0.16% / 1.02% |
Mixed |
4.1.4.2. mimalloc Advantages¶
Execution Speed: ~29% faster execution time (3.35s vs 4.33s)
Fewer, larger allocations with arena-based pooling reduce per-allocation overhead
Critical for diagnostic operations requiring low latency
Reduced System Calls: Drastically fewer allocation syscalls (~8K vs ~2.3M)
Batches allocations into large arenas, reducing kernel interaction
Approximately 287x fewer allocations significantly reduces context switching overhead
186x fewer persistent allocations simplify memory management
4.1.4.3. System Allocator Advantages¶
Virtual Memory Usage: ~65% less virtual memory (434 MiB vs 1.24 GiB)
Does not pre-reserve large arenas; allocates only what is needed
More conservative approach to address space usage
Resident Memory: ~17% lower physical memory footprint (329.78 MiB vs 400.22 MiB)
Tighter memory utilization for current working set
Total Allocation Efficiency: ~26% less total bytes allocated (3.19 GiB vs 4.29 GiB)
Tighter lifetime tracking and faster release to OS
4.1.4.4. Trade-offs¶
mimalloc trades memory for speed by pre-allocating large memory areas (arenas). This leads to faster throughput but higher memory overhead. The system allocator is more memory-efficient but pays for it with more frequent, fine-grained allocations and ~1 second slower total runtime.
For the Classic Diagnostic Adapter use case:
Performance is prioritized over memory efficiency in typical deployment scenarios
Diagnostic operations are latency-sensitive
Alternative memory optimization strategies exist (e.g., mmap for mdd files and other large data structures)
4.1.5. Consequences¶
4.1.5.1. Positive¶
Improved Response Times: 29% faster execution directly improves diagnostic operation latency
Reduced System Overhead: 287x fewer allocation calls minimize kernel involvement and context switching
Better Throughput: Arena-based pooling enables handling of concurrent diagnostic sessions more efficiently
Predictable Performance: Pre-allocated arenas provide more consistent allocation times
4.1.5.2. Negative¶
Higher Memory Footprint: ~65% more virtual memory and ~17% more physical memory consumption
Increased Total Allocations: ~26% more bytes allocated over time due to arena pre-allocation strategy
4.1.5.3. Mitigation Strategies¶
The memory overhead can be mitigated through:
mmap Usage: Large data structures can use memory-mapped files to reduce heap pressure. Other memory optimizations like pooling memory for the databases could further reduce the memory overhead of mimalloc, and bring it closer to the system allocator, while maintaining its performance benefits.
memory pooling and LRU caching: Implementing custom pooling strategies for frequently used data structures can further optimize memory usage while leveraging mimalloc’s performance advantages.
4.1.6. Alternatives Considered¶
4.1.6.1. System Allocator¶
The native platform allocator was evaluated as the primary alternative. While it offers superior memory efficiency (17-65% less memory usage), the performance penalty (~29% slower execution and 287x more allocation syscalls) makes it unsuitable as the default choice.
The system allocator remains a viable option for:
Extremely memory-constrained embedded deployments
Scenarios where memory footprint is more critical than latency
Development/debugging when allocator-specific behavior needs to be isolated
Further optimization of the CDA might make this obsolete, as these will bring a larger benefit compared to taking the performance cost of the system allocator.
4.1.6.2. Other Allocators¶
Other allocators such as jemalloc or tcmalloc were not formally evaluated in this decision.
4.1.7. References¶
Profiling conducted using Xcode Instruments on macOS / Apple Silicon (arm64)
4.2. ADR-002: mbedTLS as Alternative TLS Backend for DoIP¶
4.2.1. Status¶
Experimental
Date: 2026-03-23
4.2.2. Context¶
The Classic Diagnostic Adapter (CDA) uses TLS-secured DoIP (Diagnostics over IP) connections to communicate with ECUs. The default TLS backend is OpenSSL via the openssl and tokio-openssl Rust crates.
OpenSSL does not implement the record_size_limit TLS extension (RFC 8449). Some ECUs require this extension and close the connection when the peer sends records exceeding the limit they were trying to negotiate.
Since OpenSSL neither advertises nor honours the extension, a stable connection with such ECUs cannot be guaranteed. See openssl/openssl#27916 and auroralabs-loci/openssl#342 for the upstream status.
mbedTLS 4.0.0 supports record_size_limit, but only for TLS 1.3. Since DoIP connections commonly use TLS 1.2, a patch is needed to extend this support.
Additionally, mbedTLS 4.0.0 has no Ed25519 (PureEdDSA) signature support, which is required by some ECUs. A additional patch adds this capability via a custom PSA accelerator driver.
Upstreaming the changes for Ed25519 at the current time is not feasible as the patch in this repository only covers the necessary parts required for ECU communication and would most likely need to be extended to be acceptable for upstream inclusion. The mbedTLS maintainers have planned Ed25519 support in their roadmap, but with no concrete date for now. Once the upstream supports it natively the patch can simply be dropped.
4.2.3. Decision¶
Add mbedTLS 4.0.0 as an optional, feature-gated TLS backend in the comm-mbedtls module. The module is structured as two Rust crates:
mbedtls-sys – downloads, patches, and compiles mbedTLS from source; generates FFI bindings via
bindgen.mbedtls-rs – safe Rust wrapper exposing synchronous and asynchronous (Tokio) TLS streams, X.509 certificate handling, and TLS configuration via a builder API.
Two patches are applied to upstream mbedTLS 4.0.0 at build time (see comm-mbedtls/mbedtls-sys/patches/README.md for details):
record-size-limit-tls12.patch – extends the existing TLS 1.3
record_size_limitimplementation to TLS 1.2 (RFC 8449).ed25519-psa-driver.patch – adds Ed25519 support to mbedTLS across the PSA crypto, PK, X.509, and TLS 1.2 layers, which upstream mbedTLS 4.0.0 does not provide. The actual Ed25519 cryptographic operations are performed in Rust (
ed25519-dalek) and called from C via a PSA accelerator driver FFI bridge.
4.2.3.1. Backend Selection¶
The TLS backend is selected at compile time via Cargo feature flags:
Feature |
Default |
Effect |
|---|---|---|
|
yes |
Use OpenSSL |
|
no |
Use mbedTLS (experimental) |
If both features are enabled, OpenSSL takes precedence. The selection is enforced through #[cfg] guards in cda-comm-doip; there is no runtime dispatch or shared TLS provider trait. Both backends produce streams implementing Tokio’s AsyncRead + AsyncWrite, which is sufficient for the generic DoIPConnection<T> transport layer.
4.2.4. Consequences¶
4.2.4.1. Positive¶
ECUs requiring
record_size_limitextension can be supported without any potentially unstable workarounds (such as limiting the max transfer size to a size smaller than reported by the extension).mbedTLS is statically linked with no system-level dependency, simplifying cross-compilation and embedded deployment.
The module is self-contained and designed to be extractable into its own repository once deemed stable.
4.2.4.2. Risks and Tradeoffs¶
Experimental status. The mbedTLS backend has not undergone a dedicated security review.
Custom patches. Two patches against upstream mbedTLS must be maintained. Future mbedTLS releases may incorporate these features natively, at which point the patches can be dropped.
Build complexity. The
mbedtls-sysbuild script downloads a source tarball at build time (unlessMBEDTLS_DIRis set), applies patches, and invokes CMake. Offline builds require pre-fetching the source.Additional licenses. For building mbedtls-sys from source, ureq and bzip2 are added as build dependencies. Those bring in additional licenses which were explictly allowed for those crates. This can lead to additional maintanance effort when updating dependencies.
4.2.4.3. Future Direction¶
Promote to production-ready after potentially a security review and broader ECU testing.
Extract
comm-mbedtlsinto a standalone crate/repository.Unify TLS configuration so cipher suites, curves, and signature algorithms are driven by the CDA config file rather than compile-time constants.
Drop patches if/when upstream mbedTLS gains native Ed25519 and TLS 1.2
record_size_limitsupport, or when OpenSSL addsrecord_size_limit.
4.2.5. References¶
RFC 8032 – Edwards-Curve Digital Signature Algorithm (EdDSA)
comm-mbedtls/mbedtls-sys/patches/README.md– detailed patch documentation
4.3. ADR-003: Memory-Map Uncompressed MDD Files for FlatBuffers Access¶
4.3.1. Status¶
Accepted
Date: 2026-03-10
4.3.2. Context¶
The Classic Diagnostic Adapter loads ECU diagnostic databases stored as MDD files. Each MDD file is a protobuf container whose chunks hold FlatBuffers data compressed with LZMA. At startup every MDD file must be read, the protobuf parsed, the FlatBuffers payload decompressed, and the resulting data kept available for the lifetime of the process.
The main target platform is Linux (onboard automotive ECUs), where RAM is limited and the system may reclaim memory aggressively under pressure via the kernel page cache.
Three strategies were evaluated:
Heap – decompress into heap-allocated
Vec<u8>buffers.MmapSidecar – decompress into separate
.fbsidecar files next to the MDD files, then memory-map those sidecar files.MmapMdd (in-place) – decompress the MDD files themselves once (i.e. during a software update), rewriting them with uncompressed chunk data, then memory-map the MDD files directly with zero-copy protobuf decoding.
4.3.3. Decision¶
We will use the MmapMdd (in-place) strategy: MDD files are decompressed
once and are subsequently used read-only via
mmap. The protobuf layer uses prost’s Bytes support
(Bytes::from_owner(mmap)) so that chunk data fields are zero-copy slices
into the memory-mapped file – no heap allocation is required for the
FlatBuffers payload.
Before the atomic rename of a rewritten MDD file, the written data is verified by re-parsing the temporary file and comparing SHA-512 checksums of every chunk against the expected values.
4.3.4. Rationale¶
4.3.4.1. Performance Comparison¶
Benchmarking was conducted on Linux 6.18.2-arch2-1 (x86_64, i5-7200U CPU)
with 32 GB RAM using 68 MDD files (~47 MB compressed, 242 MB uncompressed),
Rust 1.92.0, --release profile, ~3 minutes idle warm-up, and swap disabled.
Strategy |
Idle |
Under Pressure |
Disk Usage |
|---|---|---|---|
Heap (baseline) |
486,900 |
469,320 |
47 MB |
MmapSidecar |
307,552 |
171,904 |
~282 MB |
MmapMdd (in-place) |
152,780 |
118,988 |
242 MB |
Note
The MmapMdd implementation uses memmap2::Advice::Random
(MADV_RANDOM) immediately after mmap() to disable read-ahead for
the sparse FlatBuffers vtable lookups that dominate runtime access. This
avoids a libc dependency – the hint is set directly via memmap2
before ownership is transferred to Bytes::from_owner().
4.3.4.2. MmapMdd Advantages over Heap¶
RSS under pressure: -75 % (119 MB vs 469 MB)
All FlatBuffers data is backed by the MDD file on disk. Under memory pressure the kernel cleanly drops those pages and re-reads them on demand – no swap I/O required. On the heap strategy, anonymous pages can only be compressed or swapped, incurring significant I/O overhead with a modest -3.7 % reduction.
Idle RSS: -69 % (153 MB vs 487 MB)
The zero-copy protobuf decode (
Bytes::from_owner(mmap)) avoids copying everybytesfield to the heap. Chunk data fields are slices into the mmap, so there is no second copy of the decompressed data in memory.Setting
MADV_RANDOMviamemmap2prevents the kernel from prefetching adjacent pages during random-access FlatBuffers queries, keeping idle RSS well below the heap baseline.
4.3.4.3. MmapMdd Advantages over MmapSidecar¶
Simpler file management
No additional
.fbsidecar files to create, track, or clean up. The MDD files are the single source of truth. This eliminates an entire class of consistency bugs (stale sidecar, missing sidecar, partial write).Lower RSS under pressure (119 MB vs 172 MB)
The in-place strategy benefits from zero-copy protobuf decoding (
Bytes::from_owner) which the sidecar approach did not use. All data – protobuf metadata and FlatBuffers payloads – lives in the single mmap, giving the kernel a unified region to evict.Lower idle RSS (153 MB vs 308 MB)
Zero-copy decoding avoids duplicating chunk data on the heap, resulting in 50 % lower idle RSS than the sidecar approach.
Less extra disk space (+195 MB vs +235 MB)
Sidecar files duplicated the FlatBuffers payload alongside the original compressed MDD. In-place rewriting replaces the compressed data, so the growth is only the difference between compressed and uncompressed sizes.
4.3.4.4. Runtime CPU Performance (perf Profiling)¶
In addition to the RSS benchmarks above, perf profiling was conducted under
a realistic end-to-end workload on the target Linux system to compare the
MmapMdd implementation against the main (heap/compressed) branch.
Test setup
Platform: Linux target (i5-7200U), Release build
Workload: CDA started, 20s idle warm-up, then
perfattached for profiling, followed by filling ~54 GB of memory with garbage data (two 27 GBbytearrayallocations in parallel to induce memory pressure), then a full ECU flash session via DoIP.Tool:
perf statattached to the running process (after warm-up) with events:cycles,instructions,faults,cache-references,cache-misses.
Metric |
Main (compressed) |
MmapMdd (decompressed) |
Delta |
|---|---|---|---|
cycles |
448,473,942 |
437,102,422 |
-2.5 % |
instructions |
194,934,031 |
194,316,925 |
-0.3 % |
IPC (insn/cycle) |
0.43 |
0.44 |
+2.3 % |
page faults |
340 |
1,206 |
+255 % (see note) |
cache-references |
25,838,457 |
26,031,415 |
+0.7 % |
cache-misses |
18,142,040 (70.21 %) |
18,478,329 (70.98 %) |
-0.77 pp |
wall time |
42.19 s |
42.19 s |
negligible |
Note
The higher page-fault count in MmapMdd (1,206 vs 340) reflects the kernel mapping mmap pages on first access rather than heap pages already loaded in at startup. The absolute numbers are negligible (< 1,500 faults over ~42 s) and have no measurable impact on wall time.
Runtime profiling conclusions
Under realistic ECU-flash load with concurrent memory pressure both implementations are effectively equivalent in CPU efficiency (within 2.5 % of each other) and identical in wall time. The workload is dominated by network I/O (DoIP) rather than database access, so the expected RSS savings of MmapMdd (-75 % under pressure) are realized without any runtime CPU regression.
perf report call-graph analysis confirmed that the top hotspots
(alloc::vec::in_place_collect, flatbuffers::vtable::VTable::get,
cda_database::datatypes::DiagService::find_request, mimalloc
internals) are present in both branches with similar weights, confirming
that no new hot paths were introduced by the MmapMdd implementation.
4.3.4.5. Trade-offs¶
Disk usage increases: MDD files grow from ~47 MB to 242 MB (~5.1x). This is a one-time cost during the software update and is acceptable on the target platform where storage is less constrained than RAM.
MDD files are modified: The original compressed MDD files are replaced with uncompressed versions. This is acceptable because:
Decompression happens once during a controlled update step, not at runtime. - This will be implemented at a later point in time in the update plugin, for now the CDA does this at runtime.
SHA-512 verification ensures data integrity before the atomic rename.
4.3.5. Consequences¶
4.3.5.1. Positive¶
75 % RSS reduction under memory pressure compared to the heap baseline (119 MB vs 469 MB), critical for embedded Linux targets with limited RAM.
69 % lower idle RSS (153 MB vs 487 MB) due to zero-copy protobuf decoding and
MADV_RANDOMviamemmap2to suppress wasteful read-ahead during sparse FlatBuffers lookups.Zero-copy data path: mmap –>
Bytes–> FlatBuffers – no intermediate heap allocations for the diagnostic payload.Single file, single source of truth: no sidecar files to manage, eliminating consistency and cleanup issues.
Atomic, verified writes: SHA-512 checksums and temp-file + rename ensure data integrity even if the update is interrupted.
Read-only at runtime: after the initial update, MDD files are opened read-only, compatible with read-only filesystems or integrity-checked partitions.
No libc dependency:
MADV_RANDOMis set viamemmap2before ownership transfer, avoiding the need for directlibc::madvise()calls.
4.3.5.2. Negative¶
5.1x disk usage increase for the MDD database directory.
One-time decompression cost i.e. during software update or first startup
Platform dependency: relies on OS-level mmap, page cache behaviour, and
madvise(2)support (POSIX systems), although the latter is guarded by a cfg flag, so the CDA still works on platforms withoutMADV_RANDOMsupport (e.g. Windows) possibly with higher idle RSS.
4.3.6. Alternatives Considered¶
4.3.6.1. Heap (Baseline)¶
Decompress FlatBuffers data into heap-allocated Vec<u8> buffers. Simplest
implementation but RSS remains high (~487 MB idle, ~469 MB under pressure).
Anonymous heap pages cannot be cleanly evicted by the kernel – they must be
compressed or swapped, incurring I/O overhead. Unsuitable for
memory-constrained targets.
4.3.6.2. Separate Flatbuffer file (Sidecar)¶
Decompress into separate .fb files and memory-map those. Achieves good
pressure behaviour (~172 MB) but introduces additional file management
complexity: sidecar files must be created, kept in sync with MDD files, and
cleaned up on updates. Uses more disk space (+235 MB) because both compressed
MDD and uncompressed sidecar exist side by side. The sidecar approach was
prototyped and benchmarked but rejected in favour of the simpler in-place
strategy.
4.3.7. References¶
4.4. ADR-004: Binary WAL Format for Crash-Safe Storage Transactions¶
4.4.1. Status¶
Accepted
Date: 2026-05-19
4.4.2. Context¶
The cda-storage crate provides a crash-safe, transactional storage backend
for diagnostic data (MDD files, configuration). Mutations are journaled to a
Write-Ahead Log (WAL) before being applied to the filesystem. The WAL must:
Record operations durably before they are applied.
Support crash recovery: detect incomplete transactions and roll back partial commits.
Minimize I/O overhead on flash-based storage where write amplification matters.
The key decision is the WAL’s on-disk encoding: binary (e.g., rkyv, wincode) vs text-based (e.g., JSON, TOML).
4.4.3. Decision¶
Use a binary WAL format with rkyv (zero-copy deserialization) for operation payloads and CRC32 checksums per entry.
4.4.3.1. Commit Strategy¶
The WAL uses a one-phase commit with checksums (1PC+C) strategy:
Operations are appended to the WAL during the transaction without
fsync.Before applying, the header status is flipped from
RECORDINGtoCOMMITTINGvia an in-place write to the file header.A single
fsyncmakes both the status change and all entries durable.Operations are applied to the filesystem.
The WAL file is deleted (point of no return).
4.4.3.2. On-Disk Format¶
[u8 magic][u8 status][u16 reserved][u32 header_crc32] [u32 crc32][u32 len][payload] ...
|-------------- 8-byte file header -----------------| |------ per-entry data -----|
File header (8 bytes): magic
0xCA, status byte (0x00= recording,0x01= committing), 2 bytes reserved padding, CRC32 over the header fields.Entry envelope (8 + N bytes): CRC32 checksum of the payload,
u32payload length, followed by the rkyv-serializedOperationenum.All fields are little-endian. The 8-byte header and 8-byte entry headers maintain 4-byte alignment as required by rkyv deserialization.
4.4.3.3. Recovery¶
On startup, LocalStorage::new() inspects the WAL:
No WAL: clean state, nothing to do.
RECORDING: transaction never reached commit. discard WAL and staging.
COMMITTING: commit was in progress. read entries, undo applied operations via
.bakfile restoration and new-artifact removal.Truncated COMMITTING WAL with no evidence of application: discard WAL, no operations were applied.
Truncated COMMITTING WAL with evidence of partial application: return
StorageError::Corruptionso the caller can decide how to handle it.
4.4.4. Rationale¶
4.4.4.1. Why Binary over Text¶
Zero-copy deserialization. rkyv deserializes directly from the memory-mapped / read buffer without parsing or allocating. Text formats (JSON, TOML) require a full parse pass and allocate intermediate structures.
Simple deserialization. The serialized bytes are the payload directly. No text encoding layer (escaping, quoting, base64) sits between the raw operation data and its on-disk representation.
Compact. Typical
Operationpayloads are < 1 KiB. A JSON equivalent with escaped strings, keys, and formatting would be 2-5x larger, increasing flash write amplification for no benefit.Checksumming is simpler. CRC32 over raw bytes. With text, checksums would need to account for encoding differences (line endings, whitespace normalization).
4.4.4.2. Why u32 Payload Length¶
The payload length field is u32:
The WAL is an on-disk format. Using
usizewould make files non-portable between 32-bit and 64-bit targets.u32fits the 4-byte alignment rkyv requires.u16would need 2 bytes of padding for no benefit.u64(andusizeon 64-bit) would waste 4 bytes per entry as with the current operation sizes, the length will never even reachu16::MAX.Actual payloads are well under 1 KiB (the largest variant,
Operation::Write, contains three short strings bounded by filesystemNAME_MAX).u32provides ~6 orders of magnitude of headroom.
4.4.4.3. Why rkyv over Other Binary Formats¶
Zero-copy: unlike bincode or postcard, rkyv does not need a deserialization pass. The archived data is accessed in-place. Bincode was a contender, but is discontinued and should not be used for new projects.
Derive-based:
#[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)]on theOperationenum. No manual codec.Deterministic layout: same input always produces the same bytes, making CRC32 checksums reliable.
Alignment-aware: produces 4-byte aligned output, matching the WAL entry header layout without additional padding logic.
4.4.5. Consequences¶
4.4.5.1. Positive¶
Single
fsyncper transaction commit minimizes flash wear.CRC32 per entry detects partial writes and corruption during recovery.
Zero-copy deserialization keeps recovery fast even with many entries.
Fixed-size headers simplify sequential reading and offset arithmetic.
4.4.5.2. Negative¶
The WAL is not human-readable. Debugging requires tooling (e.g., a
wal-dumputility or logging during recovery).rkyv’s archive format is not stable across major versions. A rkyv version upgrade may require a WAL migration or version field in the header. The reserved header bytes can be used for this purpose.
4.4.6. References¶
cda-storage/src/wal.rs: WAL implementationcda-storage/src/recovery.rs: startup recovery logic