Performance-investigation tooling for miniextendr: how to run the bench suite, how to capture a baseline, and what the published numbers actually mean.

miniextendr-bench embeds R through miniextendr-engine, exercises every FFI path the runtime cares about (conversions, ALTREP, ExternalPtr, worker routing, trait ABI, unwind protection, GC protection, class dispatch, …), and prints numbers that engineers can compare against an earlier baseline before landing a change.

The crate is publish = false and is never shipped in a CRAN tarball. It is for maintainer and contributor use only.

🔗Why benchmarks exist

miniextendr’s selling points are safety, small FFI overhead, and ALTREP throughput. Those are empirical claims. The only way to keep them true across refactors is to measure them. Every non-trivial change to a hot path (conversions, ALTREP callbacks, unwind protect, worker dispatch, class-system codegen) should be accompanied by at least one just bench/just bench-save run and, ideally, a just bench-drift check against the prior baseline.

🔗The FFI model being measured

miniextendr exposes Rust to R the same way cpp11 exposes C++: a shared library (.so / .dylib / .dll) is loaded into the R session, and R calls into it via registered .Call() entry points. R hands the callee an array of SEXP arguments and receives a SEXP back. No marshalling, no serialisation, no IPC. Most of what these benchmarks measure is the thin work that happens on top of that already-fast call: argument conversion, GC protection, unwind guarding, ALTREP materialisation, and the occasional hop to the worker thread. If a benchmark moves, it is one of those layers that moved, not the .Call() dispatch itself.

🔗Running benchmarks

Prerequisites:

  • R installed and reachable on PATH
  • rustup with the workspace-pinned toolchain
  • cargo-limit (optional, via just dev-tools-install)

🔗Everyday recipes

# Full Rust suite (all targets in miniextendr-bench/benches/)
just bench

# A single target (e.g. trait_abi dispatch microbenchmarks)
just bench --bench trait_abi

# Core high-signal subset (ffi_calls, into_r, from_r, translate,
# strings, externalptr, worker, unwind_protect)
just bench-core

# Feature-gated (connections, rayon, refcount-fast-hash)
just bench-features

# Full matrix: core + feature targets
just bench-full

🔗Specialised targets

# Lint-scan benchmarks (miniextendr-lint/benches/lint_scan.rs)
just bench-lint

# Macro compile-time benchmark (synthetic crates, cold/warm/incr)
just bench-compile

# R-side benchmarks (requires rpkg installed + the R `bench` package)
just bench-r

🔗Filtering inside a target

Everything after -- is forwarded to cargo bench, and everything after a second -- is forwarded to the divan harness. Divan supports regex filtering and baseline save/compare:

# Only run the `altrep_int_dataptr` benchmarks inside the altrep target
just bench --bench altrep -- altrep_int_dataptr

# Save a divan baseline named "main" for the worker target
cargo bench --manifest-path=miniextendr-bench/Cargo.toml \
  --bench worker -- --save-baseline main

# …make changes, then compare against that baseline
cargo bench --manifest-path=miniextendr-bench/Cargo.toml \
  --bench worker -- --baseline main

🔗Structured baselines and drift detection

The repository ships a small wrapper around divan that stores raw output, a machine-readable CSV, and git metadata under miniextendr-bench/baselines/.

RecipeWhat it does
just bench-saveRun the suite, parse output, write bench-<timestamp>.txt + bench-<timestamp>.csv + bench-<timestamp>.meta.json to miniextendr-bench/baselines/
just bench-compare [file.csv]Print the slowest/fastest benchmarks from the most recent (or specified) baseline
just bench-drift [--threshold=20]Compare the two most recent baselines; fail if any median regressed beyond the threshold percentage
just bench-infoList saved baselines with git-SHA, R version, CPU, date

The CSV schema is:

timestamp,bench_target,group,name,args,median_ns,unit

Drift detection uses median_ns so unit-changes (ns ↔ us ↔ ms) never produce false positives. The implementation lives in tests/perf/bench_baseline.sh.

🔗Environment setup for reproducible runs

Noise is the enemy. Before a baseline run:

  1. Close heavy background workloads (editors with LSP servers, Docker, video calls). A 10% jitter floor is normal; 30%+ usually means something else is hitting the CPU.
  2. Disable CPU turbo and frequency scaling if possible.
  3. On macOS, disable App Nap for the shell running the bench.
  4. On Linux, pin to specific cores via taskset -c <N> and disable hyper-thread siblings. DIVAN_SKIP_SLOW=1 skips the 50 K-size end of the size matrix for faster iteration while developing.

Before saving a baseline, capture the environment so a later reader can reproduce it:

# System / OS
uname -a
sysctl -n machdep.cpu.brand_string            # macOS
cat /proc/cpuinfo | grep -m1 'model name'     # Linux

# R
R --version | head -1
Rscript -e 'sessionInfo()'

# Rust
rustc --version
cargo --version

# miniextendr
git describe --always --dirty

just bench-save captures most of this automatically into the .meta.json sidecar.

🔗Environment variables

VariableEffect
R_HOMESelects the R installation the embedder uses
DIVAN_SKIP_SLOW=1Skips the largest size in each size matrix
RAYON_NUM_THREADSThread count for the rayon benchmarks
CARGO_INCREMENTAL=0Matches CI; avoids per-run hash noise in sccache

🔗What each target measures

Each file under miniextendr-bench/benches/ focuses on one subsystem so iterations stay fast and failures point at a single owner. The high-level plan lives in miniextendr-bench/src/bench_plan.rs.

TargetMeasures
allocatorR allocator vs system allocator across size matrix
altrep / altrep_advanced / altrep_iterALTREP element access, DATAPTR materialisation, guard modes, string ALTREP, zero-alloc patterns
coerceR-side Rf_coerceVector vs Rust-side coercion
connectionsCustom R connections: build/open, read, write, burst write (feature connections)
dataframeVec<Struct> → R data.frame transpose + full pipelines
externalptrCreation, access, downcast, N-ptr churn; vs plain Box baseline
factorCached vs uncached level lookup; Vec<Factor> throughput
ffi_callsRaw Rf_* calls (ScalarInteger, allocVector, protect/unprotect, install)
from_rTryFromSexp scalar/slice/map/set paths
gc_protectOwnedProtect, ProtectScope, manual PROTECT/UNPROTECT
gc_protection_compareSeven protection mechanisms head-to-head (stack, vec pool, slotmap, precious list, DLL preserve, reprotect-slot, DLL reinsert)
into_rIntoR scalar and vector (incl. 1 M-element scale benchmarks)
listList construction, named lookup, derive-based builders
native_vs_coerceRNative memcpy path vs element-wise coercion
panic_telemetryPanic-hook RwLock read cost, fire cost
preservePreserve list insert/release vs PROTECT/UNPROTECT
raw_accessr_slice, unchecked slice, raw pointer access across INTSXP/REALSXP
rarrayRArray/RMatrix row/col access patterns
rayonrayon_bridge parallel helpers (feature rayon)
refcount_protectRefCountedArena vs ProtectScope vs raw preserve
rffi_checked#[r_ffi_checked] overhead vs *_unchecked variant
sexp_extSexpExt helpers vs raw pointer deref
strictStrict-mode scalar/vector conversions vs normal
stringsUTF-8/Latin-1 extraction, mkCharLen, translateCharUTF8
trait_abimx_erased vtable query + dispatch (single-method and 5-method)
translateR_CHAR vs translateCharUTF8 extraction costs
typed_listtyped_list! validation across field count
unwind_protectwith_r_unwind_protect overhead; nested layers; panic path
workerWorker-thread round-trip, channel saturation, batching, payload size
wrappersGenerated R wrapper dispatch (plain, env, R6, S3, S4, S7); requires rpkg installed

Each benchmark uses the shared miniextendr_bench::init() harness from miniextendr-bench/src/lib.rs, which embeds R on the init thread and asserts thread affinity on every call. The canonical size matrix is [1, 16, 256, 4096, 65536]; named-list benchmarks use [16, 256, 4096].

🔗Interpreting results

Divan prints four numbers per case:

  • fastest: the single fastest sample. Useful for latency-bound comparisons but noisy across runs.
  • slowest: the slowest sample. A sudden slowest spike hints at GC pauses, OS scheduling, or an allocator interaction worth chasing.
  • median: the canonical number for drift detection. Not skewed by a single spike.
  • mean / std-dev: use when comparing two allocation-heavy paths where allocator variance matters.

Rules of thumb:

  • For anything under ~50 ns, absolute deltas are almost always below measurement noise. Compare ratios across a size matrix instead.
  • For ALTREP and worker paths, always report both the cold path (first call, includes class/channel init) and the hot path.
  • String and CHARSXP allocation dominates large-vector conversion timings. Changes in strings/into_r that don’t touch mkCharLenCE should not move those numbers.

🔗Reference baseline: 2026-04-20 (Apple M3 Max)

Full raw data, all tables, and methodology notes are in miniextendr-bench/BENCH_RESULTS_2026-04-20.md. The headline numbers below are a subset for quick reference.

Environment: rustc 1.95.0, R 4.5.2, macOS 25.4.0 arm64, 36 GB RAM. Commit 00df79d4 on main.

🔗Quick reference

SubsystemOperationMedianNotes
Worker threadrun_on_worker_no_r0.46 nscompiles to inline call
Worker threadwith_r_thread (on main)11.6 nsnear free
Unwind protectwith_r_unwind_protect32.6 nsoverhead vs direct
Unwind protectnested 5 layers161 nslinear
catch_unwindsuccess path0.46 ns
catch_unwindpanic caught5.4 µspanic + catch
ExternalPtrcreate (8 B)209 nsBox baseline 18 ns (~12×)
ExternalPtrcreate (64 KB)209 nsBox baseline 667 ns (allocator fast)
Trait ABImx_query_vtable2.3 nscache-hot
Trait ABIsingle-method dispatch55–60 nsview + call
Trait ABIall-5-method dispatch417 nsmulti-method hot loop
R allocatorsmall (8 B)18.0 nssystem 16.5 ns
R allocatorlarge (64 KB)797 nssystem 82 ns

🔗Type conversions (64 K elements unless noted)

Typeinto_sexp mediantry_from_sexp median
i32 scalar11.7 ns23.8 ns
f64 scalar12.0 ns21.9 ns
Vec<i32>9.0 µs-
Vec<f64>16.6 µs-
Vec<String>3.87 ms-
Vec<Option<i32>> (50% NA)29.2 µs-
slice_i32 / slice_f64-21 ns (zero-copy)
HashSet<i32> from 64 K-1.49 ms

1 M element scale: i32 106 µs; f64 233 µs; String 94.6 ms (CHARSXP allocation dominates); Option<i32> 440 µs.

🔗ALTREP (64 K elements)

OperationALTREPPlain INTSXP
element access (elt)15.7 µs9.1 ns
DATAPTR materialisation17.1 µs9.3 ns
full scan via elt loop4.79 ms-
full scan via DATAPTR21.5 µs9.4 ns

Guard modes (64 K full scan): unsafe 1.07 ms • rust_unwind (default) 4.71 ms • r_unwind 4.70 ms • plain INTSXP 256 µs. unsafe is ~4× faster than the default guard; r_unwind and default are similar (both use catch_unwind internally; r_unwind adds R_UnwindProtect per callback).

String ALTREP (64 K strings): create 2.51 ms • elt access 2.81 ms • elt with NA 2.72 ms • force materialise 6.22 ms • plain STRSXP elt 4.47 ms.

🔗GC protection (per-op)

MechanismSingle opNotes
Protect stack1.9 nsarray write + counter decrement
Vec pool (VECSXP)13.0 ns+ free list
Slotmap pool14.4 ns+ generational safety check
Precious list15.3 nsCONS alloc + prepend
DLL preserve29.1 nsCONS alloc + doubly-linked splice

Replace-in-loop (10 K iterations): Vec pool 1.14 ms • DLL preserve 1.30 ms • Precious churn 74 ms (O(n²); do not use for reassignment).

🔗Lint (miniextendr-lint)

BenchmarkScaleMedian
full_scan10 modules2.2 ms
full_scan100 modules19.8 ms
full_scan500 modules114 ms
impl_scan10 types2.4 ms
impl_scan100 types22.2 ms
scaling500 fns / 10 files8.6 ms
scaling500 fns / 500 files76.8 ms

Linear in both module count and file count.

🔗Connections (feature connections)

OperationSizeMedian
connection_build + open-625 ns
connection_write-28 ns
connection_read64 B24 ns
connection_read16 KB273 ns
connection_write_sized16 KB922 ns
connection_burst_write50 items1.04 µs

🔗Publishing new baselines

When a PR changes numbers that show up in this document, attach a fresh baseline to the PR and point BENCH_RESULTS_<date>.md at it. The workflow:

  1. just bench-save on a quiet machine.
  2. Commit the new raw results file under miniextendr-bench/BENCH_RESULTS_<YYYY-MM-DD>.md (re-using the template from the 2026-02-18 file).
  3. Update the “Reference baseline” section in this document to point at the new file.
  4. Mention in the PR description which subsystem moved and by how much. If the reference baseline’s machine differs from the one used for the run, note it. Don’t silently mix hardware.

Old baselines stay in the tree. miniextendr-bench/baselines/ plus the dated BENCH_RESULTS_*.md files are the historical record used by just bench-drift.

🔗Skipped / known issues

  • wrappers: Requires rpkg installed to provide R class methods. Skipped on a cold checkout; run just rcmdinstall && just bench --bench wrappers.
  • R-side benchmarks: Need rpkg installed and the R bench package. Run just bench-r.
  • Macro compile-time: Generates synthetic crates on a temp-dir. Run just bench-compile; the results are written to the temp-dir and summarised on stdout.
  • Feature combinations that conflict with --all-features: default-r6 and default-s7 are mutually exclusive, so --all-features fails. Use the explicit feature list from CLAUDE.md’s “Reproducing CI clippy before PR” section when you need the CI-equivalent set.

🔗Reference