4. Architecture Decision Records

4.1. ADR-001: Use mimalloc as Default Memory Allocator

4.1.1. Status

Accepted

Date: 2026-03-06

4.1.2. Context

The Classic Diagnostic Adapter requires efficient memory allocation to handle diagnostic communication workloads. The choice of memory allocator significantly impacts both runtime performance and memory footprint. Two options were evaluated:

  1. mimalloc: A performance-oriented general-purpose allocator developed by Microsoft Research, which uses arena-based pooling strategies

  2. System allocator: The native platform allocator (macOS system allocator in this evaluation)

Profiling was conducted on macOS / Apple Silicon (arm64) using Xcode Instruments to compare the two allocators under realistic workload conditions.

4.1.3. Decision

We will use mimalloc as the default memory allocator for the Classic Diagnostic Adapter.

The performance advantages of mimalloc outweigh the increased memory overhead. While the system allocator demonstrates better memory efficiency, the ~29% execution time improvement provided by mimalloc is critical for diagnostic operations where response time directly impacts user experience and system throughput.

4.1.4. Rationale

4.1.4.1. Performance Comparison

Comprehensive profiling revealed the following metrics:

Allocator Performance Comparison

Metric

mimalloc

System (macOS)

Winner

CPU Time

3.35 s

4.33 s

mimalloc (29% faster)

Real Memory

400.22 MiB

329.78 MiB

System (17% less)

Heap & Anon VM

1.24 GiB

434.01 MiB

System (65% less)

Total Allocated

4.29 GiB

3.19 GiB

System (26% less)

Allocation Count

7,896

2,273,227

mimalloc (287x fewer)

Persistent Allocs

1,084

202,138

mimalloc (186x fewer)

Dirty Memory

60.73 MiB

60.01 MiB

~ Tie

Thread Count

15

15

Tie

Fragmentation Ratio

0.34% / 0.83%

0.16% / 1.02%

Mixed

4.1.4.2. mimalloc Advantages

  1. Execution Speed: ~29% faster execution time (3.35s vs 4.33s)

    • Fewer, larger allocations with arena-based pooling reduce per-allocation overhead

    • Critical for diagnostic operations requiring low latency

  2. Reduced System Calls: Drastically fewer allocation syscalls (~8K vs ~2.3M)

    • Batches allocations into large arenas, reducing kernel interaction

    • Approximately 287x fewer allocations significantly reduces context switching overhead

    • 186x fewer persistent allocations simplify memory management

4.1.4.3. System Allocator Advantages

  1. Virtual Memory Usage: ~65% less virtual memory (434 MiB vs 1.24 GiB)

    • Does not pre-reserve large arenas; allocates only what is needed

    • More conservative approach to address space usage

  2. Resident Memory: ~17% lower physical memory footprint (329.78 MiB vs 400.22 MiB)

    • Tighter memory utilization for current working set

  3. Total Allocation Efficiency: ~26% less total bytes allocated (3.19 GiB vs 4.29 GiB)

    • Tighter lifetime tracking and faster release to OS

4.1.4.4. Trade-offs

mimalloc trades memory for speed by pre-allocating large memory areas (arenas). This leads to faster throughput but higher memory overhead. The system allocator is more memory-efficient but pays for it with more frequent, fine-grained allocations and ~1 second slower total runtime.

For the Classic Diagnostic Adapter use case:

  • Performance is prioritized over memory efficiency in typical deployment scenarios

  • Diagnostic operations are latency-sensitive

  • Alternative memory optimization strategies exist (e.g., mmap for mdd files and other large data structures)

4.1.5. Consequences

4.1.5.1. Positive

  • Improved Response Times: 29% faster execution directly improves diagnostic operation latency

  • Reduced System Overhead: 287x fewer allocation calls minimize kernel involvement and context switching

  • Better Throughput: Arena-based pooling enables handling of concurrent diagnostic sessions more efficiently

  • Predictable Performance: Pre-allocated arenas provide more consistent allocation times

4.1.5.2. Negative

  • Higher Memory Footprint: ~65% more virtual memory and ~17% more physical memory consumption

  • Increased Total Allocations: ~26% more bytes allocated over time due to arena pre-allocation strategy

4.1.5.3. Mitigation Strategies

The memory overhead can be mitigated through:

  1. mmap Usage: Large data structures can use memory-mapped files to reduce heap pressure. Other memory optimizations like pooling memory for the databases could further reduce the memory overhead of mimalloc, and bring it closer to the system allocator, while maintaining its performance benefits.

  2. memory pooling and LRU caching: Implementing custom pooling strategies for frequently used data structures can further optimize memory usage while leveraging mimalloc’s performance advantages.

4.1.6. Alternatives Considered

4.1.6.1. System Allocator

The native platform allocator was evaluated as the primary alternative. While it offers superior memory efficiency (17-65% less memory usage), the performance penalty (~29% slower execution and 287x more allocation syscalls) makes it unsuitable as the default choice.

The system allocator remains a viable option for:

  • Extremely memory-constrained embedded deployments

  • Scenarios where memory footprint is more critical than latency

  • Development/debugging when allocator-specific behavior needs to be isolated

Further optimization of the CDA might make this obsolete, as these will bring a larger benefit compared to taking the performance cost of the system allocator.

4.1.6.2. Other Allocators

Other allocators such as jemalloc or tcmalloc were not formally evaluated in this decision.

4.1.7. References

4.2. ADR-002: mbedTLS as Alternative TLS Backend for DoIP

4.2.1. Status

Experimental

Date: 2026-03-23

4.2.2. Context

The Classic Diagnostic Adapter (CDA) uses TLS-secured DoIP (Diagnostics over IP) connections to communicate with ECUs. The default TLS backend is OpenSSL via the openssl and tokio-openssl Rust crates.

OpenSSL does not implement the record_size_limit TLS extension (RFC 8449). Some ECUs require this extension and close the connection when the peer sends records exceeding the limit they were trying to negotiate. Since OpenSSL neither advertises nor honours the extension, a stable connection with such ECUs cannot be guaranteed. See openssl/openssl#27916 and auroralabs-loci/openssl#342 for the upstream status.

mbedTLS 4.0.0 supports record_size_limit, but only for TLS 1.3. Since DoIP connections commonly use TLS 1.2, a patch is needed to extend this support. Additionally, mbedTLS 4.0.0 has no Ed25519 (PureEdDSA) signature support, which is required by some ECUs. A additional patch adds this capability via a custom PSA accelerator driver.

Upstreaming the changes for Ed25519 at the current time is not feasible as the patch in this repository only covers the necessary parts required for ECU communication and would most likely need to be extended to be acceptable for upstream inclusion. The mbedTLS maintainers have planned Ed25519 support in their roadmap, but with no concrete date for now. Once the upstream supports it natively the patch can simply be dropped.

4.2.3. Decision

Add mbedTLS 4.0.0 as an optional, feature-gated TLS backend in the comm-mbedtls module. The module is structured as two Rust crates:

  • mbedtls-sys – downloads, patches, and compiles mbedTLS from source; generates FFI bindings via bindgen.

  • mbedtls-rs – safe Rust wrapper exposing synchronous and asynchronous (Tokio) TLS streams, X.509 certificate handling, and TLS configuration via a builder API.

Two patches are applied to upstream mbedTLS 4.0.0 at build time (see comm-mbedtls/mbedtls-sys/patches/README.md for details):

  • record-size-limit-tls12.patch – extends the existing TLS 1.3 record_size_limit implementation to TLS 1.2 (RFC 8449).

  • ed25519-psa-driver.patch – adds Ed25519 support to mbedTLS across the PSA crypto, PK, X.509, and TLS 1.2 layers, which upstream mbedTLS 4.0.0 does not provide. The actual Ed25519 cryptographic operations are performed in Rust (ed25519-dalek) and called from C via a PSA accelerator driver FFI bridge.

4.2.3.1. Backend Selection

The TLS backend is selected at compile time via Cargo feature flags:

Feature

Default

Effect

openssl

yes

Use OpenSSL

mbedtls

no

Use mbedTLS (experimental)

If both features are enabled, OpenSSL takes precedence. The selection is enforced through #[cfg] guards in cda-comm-doip; there is no runtime dispatch or shared TLS provider trait. Both backends produce streams implementing Tokio’s AsyncRead + AsyncWrite, which is sufficient for the generic DoIPConnection<T> transport layer.

4.2.4. Consequences

4.2.4.1. Positive

  • ECUs requiring record_size_limit extension can be supported without any potentially unstable workarounds (such as limiting the max transfer size to a size smaller than reported by the extension).

  • mbedTLS is statically linked with no system-level dependency, simplifying cross-compilation and embedded deployment.

  • The module is self-contained and designed to be extractable into its own repository once deemed stable.

4.2.4.2. Risks and Tradeoffs

  • Experimental status. The mbedTLS backend has not undergone a dedicated security review.

  • Custom patches. Two patches against upstream mbedTLS must be maintained. Future mbedTLS releases may incorporate these features natively, at which point the patches can be dropped.

  • Build complexity. The mbedtls-sys build script downloads a source tarball at build time (unless MBEDTLS_DIR is set), applies patches, and invokes CMake. Offline builds require pre-fetching the source.

  • Additional licenses. For building mbedtls-sys from source, ureq and bzip2 are added as build dependencies. Those bring in additional licenses which were explictly allowed for those crates. This can lead to additional maintanance effort when updating dependencies.

4.2.4.3. Future Direction

  • Promote to production-ready after potentially a security review and broader ECU testing.

  • Extract comm-mbedtls into a standalone crate/repository.

  • Unify TLS configuration so cipher suites, curves, and signature algorithms are driven by the CDA config file rather than compile-time constants.

  • Drop patches if/when upstream mbedTLS gains native Ed25519 and TLS 1.2 record_size_limit support, or when OpenSSL adds record_size_limit.

4.2.5. References

4.3. ADR-003: Memory-Map Uncompressed MDD Files for FlatBuffers Access

4.3.1. Status

Accepted

Date: 2026-03-10

4.3.2. Context

The Classic Diagnostic Adapter loads ECU diagnostic databases stored as MDD files. Each MDD file is a protobuf container whose chunks hold FlatBuffers data compressed with LZMA. At startup every MDD file must be read, the protobuf parsed, the FlatBuffers payload decompressed, and the resulting data kept available for the lifetime of the process.

The main target platform is Linux (onboard automotive ECUs), where RAM is limited and the system may reclaim memory aggressively under pressure via the kernel page cache.

Three strategies were evaluated:

  1. Heap – decompress into heap-allocated Vec<u8> buffers.

  2. MmapSidecar – decompress into separate .fb sidecar files next to the MDD files, then memory-map those sidecar files.

  3. MmapMdd (in-place) – decompress the MDD files themselves once (i.e. during a software update), rewriting them with uncompressed chunk data, then memory-map the MDD files directly with zero-copy protobuf decoding.

4.3.3. Decision

We will use the MmapMdd (in-place) strategy: MDD files are decompressed once and are subsequently used read-only via mmap. The protobuf layer uses prost’s Bytes support (Bytes::from_owner(mmap)) so that chunk data fields are zero-copy slices into the memory-mapped file – no heap allocation is required for the FlatBuffers payload.

Before the atomic rename of a rewritten MDD file, the written data is verified by re-parsing the temporary file and comparing SHA-512 checksums of every chunk against the expected values.

4.3.4. Rationale

4.3.4.1. Performance Comparison

Benchmarking was conducted on Linux 6.18.2-arch2-1 (x86_64, i5-7200U CPU) with 32 GB RAM using 68 MDD files (~47 MB compressed, 242 MB uncompressed), Rust 1.92.0, --release profile, ~3 minutes idle warm-up, and swap disabled.

RSS Comparison (KB)

Strategy

Idle

Under Pressure

Disk Usage

Heap (baseline)

486,900

469,320

47 MB

MmapSidecar

307,552

171,904

~282 MB

MmapMdd (in-place)

152,780

118,988

242 MB

Note

The MmapMdd implementation uses memmap2::Advice::Random (MADV_RANDOM) immediately after mmap() to disable read-ahead for the sparse FlatBuffers vtable lookups that dominate runtime access. This avoids a libc dependency – the hint is set directly via memmap2 before ownership is transferred to Bytes::from_owner().

4.3.4.2. MmapMdd Advantages over Heap

  1. RSS under pressure: -75 % (119 MB vs 469 MB)

    All FlatBuffers data is backed by the MDD file on disk. Under memory pressure the kernel cleanly drops those pages and re-reads them on demand – no swap I/O required. On the heap strategy, anonymous pages can only be compressed or swapped, incurring significant I/O overhead with a modest -3.7 % reduction.

  2. Idle RSS: -69 % (153 MB vs 487 MB)

    The zero-copy protobuf decode (Bytes::from_owner(mmap)) avoids copying every bytes field to the heap. Chunk data fields are slices into the mmap, so there is no second copy of the decompressed data in memory.

    Setting MADV_RANDOM via memmap2 prevents the kernel from prefetching adjacent pages during random-access FlatBuffers queries, keeping idle RSS well below the heap baseline.

4.3.4.3. MmapMdd Advantages over MmapSidecar

  1. Simpler file management

    No additional .fb sidecar files to create, track, or clean up. The MDD files are the single source of truth. This eliminates an entire class of consistency bugs (stale sidecar, missing sidecar, partial write).

  2. Lower RSS under pressure (119 MB vs 172 MB)

    The in-place strategy benefits from zero-copy protobuf decoding (Bytes::from_owner) which the sidecar approach did not use. All data – protobuf metadata and FlatBuffers payloads – lives in the single mmap, giving the kernel a unified region to evict.

  3. Lower idle RSS (153 MB vs 308 MB)

    Zero-copy decoding avoids duplicating chunk data on the heap, resulting in 50 % lower idle RSS than the sidecar approach.

  4. Less extra disk space (+195 MB vs +235 MB)

    Sidecar files duplicated the FlatBuffers payload alongside the original compressed MDD. In-place rewriting replaces the compressed data, so the growth is only the difference between compressed and uncompressed sizes.

4.3.4.4. Runtime CPU Performance (perf Profiling)

In addition to the RSS benchmarks above, perf profiling was conducted under a realistic end-to-end workload on the target Linux system to compare the MmapMdd implementation against the main (heap/compressed) branch.

Test setup

  • Platform: Linux target (i5-7200U), Release build

  • Workload: CDA started, 20s idle warm-up, then perf attached for profiling, followed by filling ~54 GB of memory with garbage data (two 27 GB bytearray allocations in parallel to induce memory pressure), then a full ECU flash session via DoIP.

  • Tool: perf stat attached to the running process (after warm-up) with events: cycles, instructions, faults, cache-references, cache-misses.

perf stat Results – Main vs MmapMdd (under load)

Metric

Main (compressed)

MmapMdd (decompressed)

Delta

cycles

448,473,942

437,102,422

-2.5 %

instructions

194,934,031

194,316,925

-0.3 %

IPC (insn/cycle)

0.43

0.44

+2.3 %

page faults

340

1,206

+255 % (see note)

cache-references

25,838,457

26,031,415

+0.7 %

cache-misses

18,142,040 (70.21 %)

18,478,329 (70.98 %)

-0.77 pp

wall time

42.19 s

42.19 s

negligible

Note

The higher page-fault count in MmapMdd (1,206 vs 340) reflects the kernel mapping mmap pages on first access rather than heap pages already loaded in at startup. The absolute numbers are negligible (< 1,500 faults over ~42 s) and have no measurable impact on wall time.

Runtime profiling conclusions

Under realistic ECU-flash load with concurrent memory pressure both implementations are effectively equivalent in CPU efficiency (within 2.5 % of each other) and identical in wall time. The workload is dominated by network I/O (DoIP) rather than database access, so the expected RSS savings of MmapMdd (-75 % under pressure) are realized without any runtime CPU regression.

perf report call-graph analysis confirmed that the top hotspots (alloc::vec::in_place_collect, flatbuffers::vtable::VTable::get, cda_database::datatypes::DiagService::find_request, mimalloc internals) are present in both branches with similar weights, confirming that no new hot paths were introduced by the MmapMdd implementation.

4.3.4.5. Trade-offs

  • Disk usage increases: MDD files grow from ~47 MB to 242 MB (~5.1x). This is a one-time cost during the software update and is acceptable on the target platform where storage is less constrained than RAM.

  • MDD files are modified: The original compressed MDD files are replaced with uncompressed versions. This is acceptable because:

    • Decompression happens once during a controlled update step, not at runtime. - This will be implemented at a later point in time in the update plugin, for now the CDA does this at runtime.

    • SHA-512 verification ensures data integrity before the atomic rename.

4.3.5. Consequences

4.3.5.1. Positive

  • 75 % RSS reduction under memory pressure compared to the heap baseline (119 MB vs 469 MB), critical for embedded Linux targets with limited RAM.

  • 69 % lower idle RSS (153 MB vs 487 MB) due to zero-copy protobuf decoding and MADV_RANDOM via memmap2 to suppress wasteful read-ahead during sparse FlatBuffers lookups.

  • Zero-copy data path: mmap –> Bytes –> FlatBuffers – no intermediate heap allocations for the diagnostic payload.

  • Single file, single source of truth: no sidecar files to manage, eliminating consistency and cleanup issues.

  • Atomic, verified writes: SHA-512 checksums and temp-file + rename ensure data integrity even if the update is interrupted.

  • Read-only at runtime: after the initial update, MDD files are opened read-only, compatible with read-only filesystems or integrity-checked partitions.

  • No libc dependency: MADV_RANDOM is set via memmap2 before ownership transfer, avoiding the need for direct libc::madvise() calls.

4.3.5.2. Negative

  • 5.1x disk usage increase for the MDD database directory.

  • One-time decompression cost i.e. during software update or first startup

  • Platform dependency: relies on OS-level mmap, page cache behaviour, and madvise(2) support (POSIX systems), although the latter is guarded by a cfg flag, so the CDA still works on platforms without MADV_RANDOM support (e.g. Windows) possibly with higher idle RSS.

4.3.6. Alternatives Considered

4.3.6.1. Heap (Baseline)

Decompress FlatBuffers data into heap-allocated Vec<u8> buffers. Simplest implementation but RSS remains high (~487 MB idle, ~469 MB under pressure). Anonymous heap pages cannot be cleanly evicted by the kernel – they must be compressed or swapped, incurring I/O overhead. Unsuitable for memory-constrained targets.

4.3.6.2. Separate Flatbuffer file (Sidecar)

Decompress into separate .fb files and memory-map those. Achieves good pressure behaviour (~172 MB) but introduces additional file management complexity: sidecar files must be created, kept in sync with MDD files, and cleaned up on updates. Uses more disk space (+235 MB) because both compressed MDD and uncompressed sidecar exist side by side. The sidecar approach was prototyped and benchmarked but rejected in favour of the simpler in-place strategy.

4.3.7. References

4.4. ADR-004: Binary WAL Format for Crash-Safe Storage Transactions

4.4.1. Status

Accepted

Date: 2026-05-19

4.4.2. Context

The cda-storage crate provides a crash-safe, transactional storage backend for diagnostic data (MDD files, configuration). Mutations are journaled to a Write-Ahead Log (WAL) before being applied to the filesystem. The WAL must:

  • Record operations durably before they are applied.

  • Support crash recovery: detect incomplete transactions and roll back partial commits.

  • Minimize I/O overhead on flash-based storage where write amplification matters.

The key decision is the WAL’s on-disk encoding: binary (e.g., rkyv, wincode) vs text-based (e.g., JSON, TOML).

4.4.3. Decision

Use a binary WAL format with rkyv (zero-copy deserialization) for operation payloads and CRC32 checksums per entry.

4.4.3.1. Commit Strategy

The WAL uses a one-phase commit with checksums (1PC+C) strategy:

  1. Operations are appended to the WAL during the transaction without fsync.

  2. Before applying, the header status is flipped from RECORDING to COMMITTING via an in-place write to the file header.

  3. A single fsync makes both the status change and all entries durable.

  4. Operations are applied to the filesystem.

  5. The WAL file is deleted (point of no return).

4.4.3.2. On-Disk Format

[u8 magic][u8 status][u16 reserved][u32 header_crc32] [u32 crc32][u32 len][payload] ...
|-------------- 8-byte file header -----------------| |------ per-entry data -----|
  • File header (8 bytes): magic 0xCA, status byte (0x00 = recording, 0x01 = committing), 2 bytes reserved padding, CRC32 over the header fields.

  • Entry envelope (8 + N bytes): CRC32 checksum of the payload, u32 payload length, followed by the rkyv-serialized Operation enum.

  • All fields are little-endian. The 8-byte header and 8-byte entry headers maintain 4-byte alignment as required by rkyv deserialization.

4.4.3.3. Recovery

On startup, LocalStorage::new() inspects the WAL:

  • No WAL: clean state, nothing to do.

  • RECORDING: transaction never reached commit. discard WAL and staging.

  • COMMITTING: commit was in progress. read entries, undo applied operations via .bak file restoration and new-artifact removal.

  • Truncated COMMITTING WAL with no evidence of application: discard WAL, no operations were applied.

  • Truncated COMMITTING WAL with evidence of partial application: return StorageError::Corruption so the caller can decide how to handle it.

4.4.4. Rationale

4.4.4.1. Why Binary over Text

  1. Zero-copy deserialization. rkyv deserializes directly from the memory-mapped / read buffer without parsing or allocating. Text formats (JSON, TOML) require a full parse pass and allocate intermediate structures.

  2. Simple deserialization. The serialized bytes are the payload directly. No text encoding layer (escaping, quoting, base64) sits between the raw operation data and its on-disk representation.

  3. Compact. Typical Operation payloads are < 1 KiB. A JSON equivalent with escaped strings, keys, and formatting would be 2-5x larger, increasing flash write amplification for no benefit.

  4. Checksumming is simpler. CRC32 over raw bytes. With text, checksums would need to account for encoding differences (line endings, whitespace normalization).

4.4.4.2. Why u32 Payload Length

The payload length field is u32:

  • The WAL is an on-disk format. Using usize would make files non-portable between 32-bit and 64-bit targets.

  • u32 fits the 4-byte alignment rkyv requires. u16 would need 2 bytes of padding for no benefit.

  • u64 (and usize on 64-bit) would waste 4 bytes per entry as with the current operation sizes, the length will never even reach u16::MAX.

  • Actual payloads are well under 1 KiB (the largest variant, Operation::Write, contains three short strings bounded by filesystem NAME_MAX). u32 provides ~6 orders of magnitude of headroom.

4.4.4.3. Why rkyv over Other Binary Formats

  • Zero-copy: unlike bincode or postcard, rkyv does not need a deserialization pass. The archived data is accessed in-place. Bincode was a contender, but is discontinued and should not be used for new projects.

  • Derive-based: #[derive(rkyv::Archive, rkyv::Serialize, rkyv::Deserialize)] on the Operation enum. No manual codec.

  • Deterministic layout: same input always produces the same bytes, making CRC32 checksums reliable.

  • Alignment-aware: produces 4-byte aligned output, matching the WAL entry header layout without additional padding logic.

4.4.5. Consequences

4.4.5.1. Positive

  • Single fsync per transaction commit minimizes flash wear.

  • CRC32 per entry detects partial writes and corruption during recovery.

  • Zero-copy deserialization keeps recovery fast even with many entries.

  • Fixed-size headers simplify sequential reading and offset arithmetic.

4.4.5.2. Negative

  • The WAL is not human-readable. Debugging requires tooling (e.g., a wal-dump utility or logging during recovery).

  • rkyv’s archive format is not stable across major versions. A rkyv version upgrade may require a WAL migration or version field in the header. The reserved header bytes can be used for this purpose.

4.4.6. References

  • rkyv documentation

  • cda-storage/src/wal.rs: WAL implementation

  • cda-storage/src/recovery.rs: startup recovery logic