StereumLabs introduced: the stack behind our Ethereum client measurements

May 20, 2026 · 21 min read

Founder RockLogic

Artificial Intelligence

Ethereum runs on two layers: an execution client (EC) handles the EVM, transactions, and world state, and a consensus client (CC) handles Proof-of-Stake fork choice and validator duties. Six EC implementations and seven CC implementations exist in production today, paired in dozens of combinations across the network. StereumLabs runs all of them, side by side, on identical hardware in our Vienna data center, and publishes the numbers.

This post is the technical introduction to that platform: the bare-metal fleet, the metrics and logs pipeline, the label conventions that make the pairings comparable, and the in-house AI workflow that turns the resulting telemetry into the blog posts you are reading.

RockLogic publishes a separate case study on the business side of this workflow: how the same "AI on own data" pattern keeps customer telemetry inside the perimeter while still producing useful answers. This post is the technical view from the other side of the same workflow.

Inside the StereumLabs stack: how we measure Ethereum clients from bare metal up

The measurement question

Every Ethereum client team publishes their own benchmarks. Every cloud provider publishes their own reference architectures. Every node operator has a folder of opinions. What is rarer is a fleet where all six execution clients and all seven consensus clients run on identical hardware, identical scenarios, identical scrape cadence, and identical label conventions, with the raw data kept long enough to ask new questions about old runs.

StereumLabs exists to fill that gap. The technical choices below are not the only ones that could work, but they are the ones that keep comparability cheap, reproducibility verifiable, and post-hoc questions answerable without re-running anything.

The hardware base

The bare-metal anchor lives in our Vienna NDC2 data center, run end-to-end by RockLogic. Bare metal cuts out three moving parts that contaminate cloud measurements: instance families that change underneath you, noisy neighbours that blur tail latencies, and IOPS guarantees that come with footnotes. For a measurement product, all three are problems we prefer not to have.

Component	Detail	Why it matters
CPU	AMD EPYC 9654P (96 cores, single socket)	96 cores let us isolate one client per VM with substantial headroom, so a busy client cannot crowd out its co-tenants
Memory	ECC DDR5	Bit-flip detection is a measurement-quality concern, not a paranoia tax. Silent corruption skews log-derived percentiles
Storage	ZFS raidz1 on Solidigm D7-P5520 NVMe	Predictable IOPS, durable through single-disk failure, ARC behavior that we can describe rather than guess at
Network	2x Broadcom P225P (each dual-port 25 GbE)	Four 25 GbE ports per host. Enough headroom to separate measurement traffic from client P2P without crosstalk
Hypervisor	Proxmox VE 9.1	Open source, scriptable, lets us pin CPU sets per VM. Snapshots make scenario rollbacks cheap
OS in guests	Ubuntu 22.04 LTS	Boring, well-understood, matches what most operators run
Time sync	NTP	Clocks synchronised across the fleet so `head updated` timestamps are comparable across hosts

The bare-metal fleet is paired with a smaller cloud footprint in GCP. The cloud cohort exists for one specific reason: to surface, at indicator level, how much of what we measure is the client and how much is the environment. Without any cloud comparator, every measurement we publish is implicitly "what Erigon does on a bare-metal NVMe box in Vienna". With one, we can flag when a result changes meaningfully across environments. The GCP cohort is small and currently focused on Geth and a subset of pairings. It is sized to catch direction-of-effect differences, not to characterise the full performance surface of any commercial cloud. Where a finding turns on the comparator (such as the inbound-firewall observation discussed later in this post), we say so explicitly.

A typical EC plus CC pairing occupies two VMs: one for the execution client, one for the consensus client. Splitting the roles across two hosts is deliberate. It means node_exporter on each VM gives us clean per-layer resource consumption, with no need to attribute CPU or RAM to one or the other after the fact.

The math is straightforward: six EC families times six CC families is 36 pairings, plus the standalone Caplin host that runs both layers in one binary, plus the GCP cohort that mirrors a subset of those pairings. The full fleet sits at roughly 90 hosts across both deployments. The 36 pairing matrix below is the bare-metal NDC2 core; the GCP comparator widens the picture but does not change its shape.

Fleet coverage matrix: every consensus client paired with every execution client, plus standalone Caplin

The data pipeline

The pipeline has three rails: metrics, logs, and dashboards. Each rail is boring on its own. The interesting part is how they are wired together so that a question asked today can be answered against data captured weeks ago.

Metrics

Data pipeline: sources to consumers across the StereumLabs fleet

Every EC and CC process exposes a /metrics endpoint in Prometheus text format. Every host runs node_exporter for OS-level counters. Scrape cadence is 15 seconds, which is a deliberate compromise: short enough to catch most state transitions, long enough that cardinality stays manageable.

We run two Prometheus tiers:

prometheus-free: short retention, mirrored subset of metrics, available to Free-tier subscribers with a 7-day data delay.
prometheus-cold: full metric set, no retention cap. This is where we run avg_over_time(...[7d:1h]) queries when we write a blog post.

Cold storage is what makes most of the StereumLabs blog series possible. The Caplin standalone analysis from last week pulled metrics that were captured before Caplin v3.3.10 had been added to the fleet, and ran the same queries against the new host. No re-runs, no warm-up periods, no "we'll publish in two weeks once we have data".

Logs

Logs travel a parallel path on the same diagram above: Filebeat streams logs from every container into Elasticsearch, where Kibana and the MCP server (Model Context Protocol, the read-only query gateway our AI workflow uses; covered later in this post) consume them in the same way Grafana consumes Prometheus. Logs are the second half of what we measure, and frequently the half that surfaces things metrics cannot. The classic example: the per-block head updated line in Erigon contains four fields (execution time, commit time, age, mgas/s) that no /metrics endpoint exposes. Without log capture, the Caplin standalone vs classic comparison would have been a far shallower piece.

Logs land in Elasticsearch with structured fields for container_name, host, ec_client, cc_client, and the rest of the StereumLabs label vocabulary. Fleet-wide ingest currently runs around 30,000 log lines per second with debug logging enabled across every container.

A side effect of capturing everything is that we can write log-derived percentile tables that ask questions the original log format was not designed for. We do not need to instrument the client. We just need it to log enough.

Dashboards

Grafana sits on top of both Prometheus and Elasticsearch. Dashboards are the public face of the platform: panels with info icons that link back to the definitions page, time-aligned charts for cross-client comparison, lifecycle badges (active, experimental, legacy) so you know whether to cite a result. Plan-tier-gated access controls how much retention and how many users you get.

The naming convention is Category – Metric (Scope), units are SI, and per-core normalization is always labeled. The reason we are strict about it: the only way cross-client comparisons stay legible is if a panel titled "CPU – Utilization (EC process)" means the same thing on every dashboard regardless of which client it is plotting.

Custom labels: the comparability layer

Every metric series in prometheus-cold carries a fixed set of StereumLabs labels:

ec_client: execution client name (besu, erigon, ethrex, geth, nethermind, reth).
ec_version: pinned version string.
cc_client: consensus client name (grandine, lighthouse, lodestar, nimbus, prysm, teku, plus caplin when running standalone).
cc_version: pinned version string.
role: ec or cc, so a query can target the layer regardless of which process the metric came from.
deployment: provider code (NDC2 for bare-metal Vienna, GCP for cloud).
location: provider-specific region or zone identifier.

The result is that a query like

avg_over_time(
  node_cpu_seconds_total{role="ec", ec_client="erigon", deployment="NDC2"}[7d:1h]
)

returns exactly the slice you would ask a human for in plain English: "the 7-day Erigon EC CPU average on bare metal". No prior knowledge of the underlying scrape topology needed.

Most of the unglamorous engineering effort goes into keeping this label vocabulary consistent. It is also what made the EC P2P peering deep dive feasible: that post compared roughly 90 hosts across six EC families and two deployment classes in one chart. Without consistent labels, the same comparison would have required a custom ETL job for each section.

Neutrality and how we keep runs comparable

Neutral measurement is a methodological commitment, not a marketing claim. The concrete rules we enforce inside the fleet:

Start from defaults. Every client runs vendor defaults unless a documented, client-team-recommended deviation is required for stability or scenario correctness. Each deviation is annotated in the run manifest.
Equalize budgets. Every VM in a comparison cohort gets the same vCPU, RAM, and disk allocation. The Caplin standalone post explicitly called out the asymmetry where it existed, because pretending it did not is the fastest way to publish noise as signal.
Pin versions. Versions are explicit in the labels, in the manifest, and in the dashboard headers. When a client team ships a new release, we add a new host. We do not silently upgrade the existing one underneath the data.
Reject skew. NTP is enforced across the fleet. Runs with material clock drift during the measurement window are excluded from comparisons and flagged in the run notes.
Document trade-offs. Pruning modes, cache sizes, peer caps: anything that affects interpretation is in the run notes, not the appendix of an internal wiki.

A run manifest is the concrete artifact that carries all of this. It records the pinned client versions, the host specs (vCPU, RAM, disk, hypervisor), the start and end of the measurement window, any deviations from defaults with their justifications, and the acceptance checks that gate publication. Together with the panel-definition links from the dashboard each chart was drawn against, the manifest is what makes a published result re-runnable later. Enterprise subscribers get the full manifest attached to every dashboard view; public posts cite the relevant subset in their methodology notes.

The hard part of this work is rarely the infrastructure itself. Most of the effort goes into the rules that keep runs comparable: label vocabulary, version pinning, deviation logging, scenario manifests, deciding what counts as "enough" data. The RockLogic AI case study makes a parallel observation: hooking a model up to the data was an afternoon; defining what a useful answer looks like took weeks. Spinning up Prometheus and Grafana is well-documented and quick. Turning that into a substrate where a researcher in 2027 can re-run today's query against the same definitions and get a comparable result is the ongoing commitment, and it is what subscribers are paying for.

How this fits alongside other measurement work

StereumLabs is not the only public effort to characterize Ethereum client behavior, and not the first. A non-exhaustive list of work that we read, learn from, and consider complementary:

MigaLabs' Armiarma: a libp2p crawler focused on Ethereum's CL network, producing peer-set and gossipsub data. MigaLabs also publishes broader network analysis at migalabs.es. Their network-wide view is wider than ours; our per-pairing resource view is more granular.
The EF's ethpandaops: Kurtosis-based reproducible Ethereum devnets (ethereum-package), the ethereum-metrics-exporter, checkpointz, mainnet monitoring tooling, and related infrastructure. Their toolset covers a much wider operator surface than ours; our addition is the controlled cross-client comparison on identical hardware.
Probe-Lab: P2P network measurement across libp2p ecosystems (Ethereum, IPFS, Filecoin and others) using their own tooling stack (Parsec, Nebula, Hermes, Ukla). Different protocol layers, complementary insights.
MEV-Boost relay observability via projects like relayscan.io and mevboost.pics: payload-flow and relay-bid visibility that sits one level above what we measure.
Client-team published benchmarks: Nethermind, Sigma Prime, the Erigon team and others publish their own perf work. Our value-add is running their releases head-to-head on neutral hardware rather than each team optimising their own demo setup.

Our niche is the cross-product: every EC and CC implementation, paired with every other, on identical hardware under identical scenarios, with the raw telemetry retained long enough to ask new questions of old runs. Where another effort goes wider or deeper on one axis, we cite it; where our data complements theirs, we say so.

What this stack has produced

What matters is what the platform makes visible. A non-exhaustive recap of recent output:

The Caplin standalone vs classic Erigon comparison found that the monolithic node executes EVM blocks 12 to 31% faster at the median but pays a 2x penalty on MDBX commits, ending up roughly tied at end-to-end p50 across 50,000 sampled blocks.
The EC P2P peering deep dive found that Reth maintains over 5x the peer count of Besu under stock defaults on our fleet, and that in our GCP cohort the default cloud firewall rules silently blocked DevP2P inbound on Geth. An operator using the same provider defaults would inherit that artifact unless they opened the port explicitly.
The EC sync speed comparison quantified the gap between the fastest and slowest initial sync at over an order of magnitude on identical hardware.
The Nimbus v26.3.1 block-building post characterised how each EC behaves when Nimbus drives block-building duties, using a shadow setup that mirrors 1,000 validator pubkeys across five EC pairings over a 48-hour window.
The Teku cross-version analysis traced how resource consumption shifted across three Teku releases.
A two-week AI-assisted security-audit campaign (May 4 to May 18, 2026) used the fleet as a live test-bed, producing 54 finding documents, 6 Ethereum Foundation Bug Bounty submissions, 13 paste-ready upstream PR drafts across 7 client projects, and 24 Kurtosis devnet PoCs. Several findings (gossip-rejection spikes, engine-API timeout storms, P99 latency divergences) surfaced first as anomalies in the Prometheus and Elasticsearch stack. A dedicated post will cover the campaign.

None of these required new instrumentation. They required asking new questions of data that was already in prometheus-cold and Elasticsearch. That is the payoff of the upfront effort to keep everything labelled, retained, and reproducible.

The AI workflow: closing the loop

This is where the post connects back to the RockLogic case study.

Every StereumLabs blog post you read carries stereumlabs-ai as a co-author. The label is literal. Anthropic Claude has read-only MCP access to our Prometheus and Elasticsearch instances, drives the queries, reads the results, drafts the prose, and proposes the tables. A human editor reviews, corrects, and approves before publication. The case study describes that workflow from a business angle: data stays in-house, output is cited and reproducible, weeks of analysis collapse into hours. This post is the technical view of the same loop from the other side.

The wiring looks like this:

AI workflow: telemetry to blog post, with raw data confined to the EU perimeter

A few properties of that loop that are worth being explicit about:

Raw telemetry stays on EU infrastructure. The MCP (Model Context Protocol) server runs inside the same perimeter as Prometheus and Elasticsearch. What crosses the boundary to the Claude API is a curated query slice and its result, not the underlying logs or metric series.
Citations are mandatory and auditable. Every number that lands in a published post is required to come back from a query the model can name and a result the editor can re-execute. The editor can re-run the cited PromQL or Elasticsearch query before publication to confirm the number, and reject the claim if it does not reproduce. "The system claims" is not a publishable sentence.
Context is trimmed aggressively. A 7-day avg_over_time query returns one number, not a histogram. A head updated log extraction returns a summary table, not 50,000 lines. The model is given the smallest input that answers the question. The trimming is a structural mitigation, not a guarantee of correctness; it makes hallucinated numbers much harder to slip past a citation check, but it does not eliminate them.
Human review is the final filter. Stefan Kobrc (Founder, RockLogic) is the named editor on every StereumLabs post. The editor's job is to re-run a sample of the cited queries, sanity-check the conclusions against domain knowledge, and reject claims that the data does not support. When a correction lands after publication, it is logged in the post's revision history rather than silently rewritten.
Format compliance is enforced. The post you are reading uses the same frontmatter, the same table style, and the same link conventions as every other StereumLabs post because the workflow checks for it before output.

The same pattern is what we offer to customers in the RockLogic case study. We run it on ourselves first, on our own infrastructure with our own data, before pointing it at anyone else's.

What is harder than it looks, and what we cannot fully control

Five observations from running this platform. The first three are operational costs we accept. The last two are biases worth flagging so that readers can weigh them against their own deployment.

Metric heterogeneity across clients is the dominant operational cost. Caplin v3.3.10 ships a comprehensive libp2p_rcmgr_* family but no current libp2p peers gauge. Prysm exposes connected_libp2p_peers{agent="..."} with per-implementation breakdowns. Lighthouse, Lodestar, Nimbus, Teku, and Grandine each use different metric names for what is conceptually the same thing. Every cross-client query is implicitly a translation layer. We maintain that layer as part of the dashboard panel definitions and as part of the AI workflow's instruction set.

Log format chaos is the second cost. Some clients log structured JSON, some log key=value, some log unstructured prose, some log all three depending on subsystem. Stripping ANSI color codes is a normal preprocessing step. Writing a parser like the one for Erigon's head updated line is the easy part; validating it against edge cases is what takes time.

Keeping runs comparable is the third cost, and the largest. The same conclusion the AI case study reaches about its own workflow: the hard part is not the technology but the editorial spine. For us, that means resisting the temptation to publish numbers from runs that were "almost equivalent". Almost is where bias lives.

Single-site bias. All bare-metal hosts sit in one Vienna data center (NDC2) and share an upstream BGP path and ASN-level peer reachability. That means peer-set composition, geographic latency to other mainnet nodes, and inbound discoverability are correlated across the fleet. Findings about peer count, peer diversity, or P2P churn carry an implicit "as seen from one network position". We add the GCP cohort to break that correlation in one direction, but two environments are an indicator, not a characterisation. Reproducing a finding from a different site or ASN is one good way to test how site-bound it is.

Scrape cadence has a tail-latency floor. 15-second Prometheus scrapes are sufficient for resource averages and steady-state behaviour, but they are too coarse for tail-latency analysis. GC pauses, individual slot timings (Ethereum slots are 12 seconds, so aliasing is possible on slot-aligned metrics), and sub-second spikes are visible only through log-derived percentiles, not through the metrics graph. Where we publish p99 numbers, they come from log parsing, not from histogram_quantile on 15s buckets. The relevant posts call this out where it affects interpretation.

Plans, access, and what is public

Three subscription tiers, structured around what we can responsibly support at each price point:

Plan	Price	Data delay	Retention	Grafana users	Custom runs	Exports
Free	EUR 0	7 days	90 days	1	no	no
Pro	EUR 3,200 / month, billed annually	near real-time	unlimited	5	no	no
Enterprise	custom	near real-time	unlimited	custom	yes	yes

The tiers map to different reader profiles. Free is the right level for ecosystem observers, students, and curious operators who want to read the published numbers and verify our claims against a delayed but otherwise full dataset. Pro is sized for client teams, research groups, consultants, competitive-intelligence shops, and press who need real-time access and unlimited retention across the fleet but do not need their own cohorts. Enterprise is for institutional operators and infrastructure providers who want to commission custom runs against their own questions, ingest data into their own pipelines via exports, and get a contracted scope of work with named contacts. Prices are EUR, excluding VAT, per the public plans page.

The Free tier exists because we think the public Ethereum ecosystem benefits from having neutral measurements visible to anyone. Paid tiers fund the bare-metal fleet, datacenter, hardware refresh, on-call rotation, and ongoing engineering effort that keeps the data reproducible week over week.

The blog and the public dashboards are open without subscription. Every post links back to the definitions, the methodology, and the scope page, so a critical reader can trace any claim back to the query that produced it.

Who this is for

Client teams comparing your own release against the rest of the field on identical hardware, without having to stand up the rest of the field yourself.
Institutional node operators sizing infrastructure decisions against measurements taken at production-equivalent scale.
Researchers running cross-client studies who need a reproducible substrate that is documented down to the firmware version.
Security researchers and bug-bounty hunters using cross-client production observability to surface anomalies that source review alone misses.
Builders of derivative tooling (indexers, RPC providers, MEV searchers) who want to know how the client they depend on behaves under load.

If you are in one of the above categories and want a custom dimension of the fleet measured against your specific question, Enterprise scope custom runs are designed for that. Start with the question, not with the SKU. Write to contact@stereumlabs.com describing what you want to learn, and we will scope it together, either inside the current scenario matrix or by spinning up a new VM cohort.

Summary

Dimension	What StereumLabs does
Hardware anchor	🟢 Bare-metal EPYC 9654P fleet in Vienna NDC2, with GCP cloud comparator
Coverage	🟢 6 EC × 6 CC pairings on bare-metal (36 hosts) plus 1 standalone Caplin plus a GCP comparator cohort; ~90 hosts total
Metrics retention	🟢 `prometheus-cold` keeps everything, scrape every 15s
Log capture	🟢 Filebeat to Elasticsearch from every container
Comparability	🟢 Fixed label vocabulary across all series
Neutrality	🟢 Vendor defaults except for documented, justified deviations
Reproducibility	🟢 Definitions, manifests, and procedures published per dashboard
AI workflow	🟢 Claude via MCP against own data; raw telemetry stays in EU; mandatory citations
Public access	🟢 Free tier with 7-day delay; Pro and Enterprise for full retention
Source of insight	🟢 Weekly blog series on EC and CC behavior, archive growing since launch
Closed surface	🟡 Custom run authoring is Enterprise-only
Coverage gap	🟡 Pre-merge clients and L2 sequencers out of current scope

Where to go next

If you read one thing next, make it the Caplin standalone vs classic Erigon comparison. It is the most recent end-to-end example of this stack answering a concrete question, and it shows what the manifest, the labels, and the AI workflow combined produce.

After that:

Browse the blog archive for previous analyses across the EC and CC fleet.
Follow RockLogic on X for new posts as they ship.
For the methodology in depth, read Purpose and Scope and the Client Metrics List.
For the business-side companion to this post, see the RockLogic AI on own data case study.

If you want to talk about a custom measurement, a Pro or Enterprise account, or a question you would like the fleet pointed at, write to contact@stereumlabs.com. Tell us the question, not the SKU, and we will scope it with you.

The measurement question​

The hardware base​

The data pipeline​

Metrics​

Logs​

Dashboards​

Custom labels: the comparability layer​

Neutrality and how we keep runs comparable​

How this fits alongside other measurement work​

What this stack has produced​

The AI workflow: closing the loop​

What is harder than it looks, and what we cannot fully control​

Plans, access, and what is public​

Who this is for​

Summary​

Where to go next​