Skip to main content

Architecture & data flow

How a measurement run becomes a queryable dashboard panel. Useful if you want to write your own queries, validate our numbers, or understand the lineage of a specific metric.

Deployment topology

StereumLabs runs across two deployments, both centrally aggregated.

                ┌──────────────────────────────┐    ┌────────────────────────┐
│ NDC2 (Vienna, Austria) │ │ GCP (europe-west3-a) │
│ Bare-metal, RockLogic │ │ Cloud, supernodes │
│ 144 hosts │ │ 24 hosts │
│ │ │ │
│ 36 EC×CC pairings │ │ 6 supernode pairings │
│ + 1 Caplin standalone │ │ (all paired with Geth)│
└──────────────┬───────────────┘ └────────────┬───────────┘
│ │
│ federate │
▼ ▼
┌────────────────────────────────────────────┐
│ Central infrastructure (NDC2) │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ Prometheus │ │ Elasticsearch │ │
│ │ │ │ (logs) │ │
│ └──────┬───────┘ └──────────┬───────┘ │
│ │ │ │
│ ┌──────▼──────────────────────▼──────┐ │
│ │ Grafana │ │
│ └────────────────────────────────────┘ │
└────────────────────┬───────────────────────┘


┌──────────────┐
│ Users │
│ Free │
│ Pro/Enterp. │
└──────────────┘

End-to-end latency from exporter scrape to query is typically 25 to 30 seconds on the full Prometheus tier.

Per-host instrumentation

Every measured host runs three exporter/agent processes alongside the client under test:

ProcessWhat it shipsDestination
client process (EC or CC)Native Prometheus metrics on the client's HTTP endpointlocal Prometheus scrape
Node Exporter (github.com/prometheus/node_exporter)OS-level metrics: CPU, memory, disk, network, filesystem, NIClocal Prometheus scrape
FilebeatContainer logs (Docker JSON logs, INFO + DEBUG)Elasticsearch

The up{job="<client-name>"} time series reflects whether each exporter answered its last scrape.

Custom labels

We rewrite incoming metrics to attach our custom labels at scrape time:

  • ec_client, ec_version, cc_client, cc_version: what's running on this host.
  • role: ec, cc, or standalone (Caplin).
  • deployment, location: where this host runs.

These labels are consistent across all client metrics, so a single PromQL filter works regardless of implementation.

Prometheus tiers

TierRetentionDelayPlan access
Public Prometheus (delayed rollups)90 days7 daysAll tiers
Full Prometheus (real-time)UnlimitedNone (near real-time)Pro, Enterprise

The public tier exposes recording rules, which are aggregated rollups rather than the raw metric set. The full tier exposes the full scrape, every label.

Why two tiers?

  • The full tier is the authoritative source: everything we scraped, queryable as raw series. It powers all standard dashboards.
  • The public tier is a curated, delayed surface for public access. Recording rules are pre-aggregated, so high-cardinality queries don't burden the public surface.

Internally, a federation job pulls recent samples from each deployment's local Prometheus into the central full tier so the archive is complete.

Logs (Elasticsearch)

Filebeat ships container logs from every EC and CC process to Elasticsearch. Logs are queryable from Grafana for Pro and Enterprise users.

Two log levels are retained with different retention windows:

  • INFO and above (warnings, errors): retained for ~30 days. The standard source for fleet-wide pattern analysis and historical investigations.
  • DEBUG: retained for a shorter window (typically a few days). Useful for in-flight investigations and reproducing recent issues. Not suitable for long-term trend analysis.

Common uses:

  • Block-processing latency extraction (e.g., the head updated regex in the Caplin standalone post).
  • Peer-count extraction for clients that don't expose a current gauge (e.g., Caplin's P2P app=caplin peers=N).
  • Error and warning pattern analysis across the fleet.

Run-manifest metadata

VM profiles, run manifests, and lifecycle metadata are kept in an internal store and surfaced to Enterprise customers on request. The user-facing surface is the List of VMs.

Datasource access by plan

DatasourceFreeProEnterprise
Public Prometheus (delayed rollups)
Full Prometheus (real-time)
Elasticsearch (INFO logs)
Elasticsearch (DEBUG logs)
Run-manifest metadata
Direct API access

Pro and Enterprise customers each receive a private Grafana organization with these datasources provisioned. Dashboard and datasource UIDs are unique per org, so refer to the user-facing names rather than UIDs in shared notes or runbooks.

For details on what each plan unlocks, see Plans & Prices.

Reliability boundaries

  • No co-tenancy on bare-metal hosts. Each NDC2 host runs exactly one EC, one CC, or one standalone, with no unrelated workloads. See Neutral methodology.
  • Time sync enforced. NTP or Chrony on every host. Runs with material clock skew are rejected from publication.
  • Scrape interval pinned at 15 seconds across all jobs in our Prometheus configuration.
  • Federation lag. The federation job adds a small ingestion delay. Total end-to-end latency typically stays under 30 seconds. Consider this when comparing samples to wall-clock events.
  • Cloud hosts (GCP) are subject to noisy-neighbor and burst-credit effects. Documented per Cloud profile. Cross-environment comparisons label the variance.

See also


Change control for this page: material edits will be logged in the global Changelog with a short rationale and effective date.