Tracing a Besu memory leak to a one-line method

June 10, 2026 · 9 min read

Founder RockLogic

Artificial Intelligence

Six Besu nodes, same version, same hardware, same config. Five held a flat JVM heap around 1.0 to 1.3 GB. The sixth climbed about 10 GB a day and was on track to be OOM-killed roughly 30 hours after a restart. The one thing different about it was the consensus client on the other side of the engine API.

This is a walkthrough of how StereumLabs AI, reading our fleet's metrics and logs, took that one anomalous node, traced it to a single method, and filed it upstream. Besu shipped a round of mitigations and closed the issue. A later devnet reproduction showed the underlying layers still pile up, the issue was reopened, and the fix that followed is now in review. The bug is operational: recoverable by a restart, no consensus impact, no double-sign, no state-root divergence. It is also the kind of cross-client interaction a single-node test will never surface, because it only appears when a live pairing lands in a specific state.

Six identical Besu nodes over time: five hold a flat JVM heap near 1 GB while the Prysm-paired node climbs about 10 GB per day toward an out-of-memory kill

One odd node out of six is a signal, not noise

We run every major execution client paired with every major consensus client, on identical bare metal, with one telemetry pipeline across the whole set. The point of that layout is differential: when six Besu nodes share a version, a host spec, and a config, and differ only in their paired consensus client, any divergence between them isolates the variable for you. So one of six otherwise-identical Besu nodes leaking heap while its five siblings stay flat is not a mystery to explain away. It is a starting point.

StereumLabs AI sits on top of that telemetry. It read the divergence, formed a root-cause hypothesis, and walked it down to a single method. The steps below are ordinary observability: fleet Prometheus, JVM GC metrics, container logs, and one flight-recorder window. Nothing exotic.

The hunt

Isolate. A cross-pair comparison over fleet Prometheus showed five flat Besu nodes and one climbing. The leaking node was the Besu paired with Prysm v7.1.3. That alone narrowed the cause to something about that pairing, not Besu in general.

JVM heap over 30 hours for the six Besu nodes: five flat near 1 GB, the Prysm-paired one rising linearly at about 10 GB per day toward the OOM ceiling

Characterise. The garbage-collection signature said live leak, not allocation pressure. After every collection, Old Gen occupancy still climbed about 0.4 GB/h, and the leaking node had the lowest allocation rate of the six. Post-collection Old Gen used equalled live data: the collector had nothing left to reclaim. Heap was being retained, not churned.

Corroborate. The leaking node emitted roughly seven times the log volume of the median Besu pair, dominated by one engine new-payload hot path: "block already present". It logged 53,756 engine_newPayloadV4 calls in 17.5 hours, against a one-per-slot baseline near 5,250, for blocks it already had.

Confirm. A Java Flight Recorder window settled it. About 84 percent of all CPU samples sat inside one method, LayeredKeyValueStorage.isClosed(), with most of the rest in the layered store's comparator. The chain head had not advanced in over 18 hours, despite tens of thousands of new-payload calls.

The root cause, in plain terms

The cause sat in the seam between the two clients; the symptom showed up only on Besu.

On the consensus side, the paired Prysm had fallen behind and was catching up. It fed Besu a run of blocks through engine_newPayload without sending a forkchoiceUpdated between them. Many of those calls timed out from its HTTP client, so it retried, which is where the loud "block already present" log volume came from. The Prysm-side issue that drops it out of sync, and into this catch-up, is tracked in OffchainLabs/prysm#16096. Consensus clients that send a forkchoice update first, such as Lighthouse and Nimbus, take a different path on Besu and do not trigger it.

On the Besu side sat the part that had not been seen before. For each new block applied this way, Besu freezes a Bonsai LayeredKeyValueStorage and, on the no-forkchoice path, never closes it, so eviction falls to the garbage collector. Under sustained catch-up the layers pile up faster than the collector frees them. The repeat "block already present" calls do not add layers themselves, but every state access now has to check whether its layer is closed, and that check delegates straight up the parent chain:

public boolean isClosed() {
  return parent.isClosed();
}

That one line runs on every state access. With a flat layer stack it is free. With a thousand-plus accumulated layers it becomes a walk up the entire parent chain on every access, which is why it rose to about 84 percent of CPU. Block validation could then no longer finish inside the engine timeout, so the chain head stalled, which made Prysm time out and retry more, which deepened the stack further. A self-reinforcing loop, with the heap curve in the chart above as its outward sign.

What shipped, what reopened it, and the fix

We filed this as besu-eth/besu#10498. Besu triaged it P2 and landed a set of mitigations:

PR #10508 caches a layer's closed-state so the parent-chain walk can short-circuit.
Three allocation cleanups (#10523, #10526, #10527) replace per-call exception objects with stackless singletons, cutting the GC pressure the stall generated.
PR #10559 caches the validated engine JWT so a consensus-client reconnect storm no longer forces a Besu restart.

Those merged and the issue was closed on 2026-05-28. The direct structural fix, PR #10509, which would have capped the layer stack and returned SYNCING past a bound, was closed on 2026-05-27 without merging.

It came back. On a mainnet nightly node running 26.6-develop with Prysm 7.1.4, the heap climb returned, and Besu's maintainer reopened the issue on 2026-06-05. The isClosed() cache from #10508 only short-circuits once a layer is closed, and these layers are never closed, so a deep open chain still recurses fully on every call. The caches relieve the surrounding pressure. They do not stop the layers from accumulating.

To pin the mechanism down we built a Kurtosis devnet reproduction: a Besu 26.6.0 target paired with Prysm, plus two controls, the same Besu paired with Lighthouse and a Prysm paired with Geth. Stopping the target's Prysm, letting the network run ahead, then feeding the new blocks back through engine_newPayloadV4 with no forkchoice update between calls drove the target from 58 to 1,362 retained layers while the control Besu stayed at 0. Replays of already-present blocks stayed flat over 5,000 calls, which isolated the growth to new blocks applied without an intervening forkchoice update. isClosed() was again the dominant cost, and engine_newPayload latency tracked the depth: about 3.5 ms against an already-present block near the top of the stack, rising to a median around 245 ms and up to 1.4 s at depth 1,300 to 1,500. On that evidence the issue was reopened, and this time the fix came quickly.

Besu opened PR #10600, which returns SYNCING when the parent world state is not immediately cached. That routes a run of no-forkchoice new-payload calls into backward sync, the same path Lighthouse and Nimbus already take, instead of the trie-log replay that builds layers. Besu validated it against our Kurtosis file: on the unpatched 26.6.0 image, 30 new-payload calls in 437 ms with no forkchoice update between them left 30 unflushable Bonsai layers; on the patched build the same test left one layer, answered 27 of the calls with SYNCING, and recovered through backward sync in about two minutes. A companion change, PR #10603, makes LayeredKeyValueStorage.isClosed() itself O(1), so the CPU hot path is gone even if a deep chain ever forms. Both are in review as of this writing, so the issue stays open until they land, but the fix is written and confirmed against the same reproduction that reopened it.

Credit to both teams. The Prysm-side sync loss is a known issue with a fix in progress, and Besu has been fast and precise on every round, turning our reproduction into a validated fix. Neither client is at fault in isolation; the bug lives in the seam between them, on a path that only a specific live pairing reaches.

Why this kind of bug needs a fleet

A single Besu node, tested on its own with a healthy consensus client, will not show this. It needs a paired client that has fallen behind and is feeding new blocks through new-payload without forkchoice updates, sustained long enough for the layer stack to grow. It does not need an attacker: one restart that drops a consensus client into catch-up is enough. The operational bite is serious for the operator who hits it, a Besu that OOMs in about a day with the paired client churning peers and missing duties around it, but the path to seeing it at all runs through fleet-scale, differential observation.

That is what StereumLabs AI does on our fleet: it reads the telemetry across every client pairing, notices when one node stops matching its siblings, and turns the anomaly into something a client team can act on. If you run Ethereum clients at scale, you can point it at your own nodes and logs. Reach us at stereumlabs.com or contact@stereumlabs.com.

One odd node out of six is a signal, not noise​

The hunt​

The root cause, in plain terms​

What shipped, what reopened it, and the fix​

Why this kind of bug needs a fleet​

One odd node out of six is a signal, not noise

The hunt

The root cause, in plain terms

What shipped, what reopened it, and the fix

Why this kind of bug needs a fleet