Architecture & data flow

How a measurement run becomes a queryable dashboard panel. Useful if you want to write your own queries, validate our numbers, or understand the lineage of a specific metric.

Deployment topology

StereumLabs runs across two deployments, both centrally aggregated.

                ┌──────────────────────────────┐    ┌────────────────────────┐
                │  NDC2 (Vienna, Austria)      │    │  GCP (europe-west3-a)  │
                │  Bare-metal, RockLogic       │    │  Cloud, supernodes     │
                │  144 hosts                   │    │   24 hosts             │
                │                              │    │                        │
                │  36 EC×CC pairings           │    │  6 supernode pairings  │
                │  + 1 Caplin standalone       │    │  (all paired with Geth)│
                └──────────────┬───────────────┘    └────────────┬───────────┘
                               │                                  │
                               │   federate                       │
                               ▼                                  ▼
                       ┌────────────────────────────────────────────┐
                       │  Central infrastructure (NDC2)             │
                       │  ┌──────────────┐   ┌──────────────────┐   │
                       │  │ Prometheus   │   │ Elasticsearch    │   │
                       │  │              │   │ (logs)           │   │
                       │  └──────┬───────┘   └──────────┬───────┘   │
                       │         │                      │           │
                       │  ┌──────▼──────────────────────▼──────┐    │
                       │  │             Grafana                │    │
                       │  └────────────────────────────────────┘    │
                       └────────────────────┬───────────────────────┘
                                            │
                                            ▼
                                    ┌──────────────┐
                                    │  Users       │
                                    │  Free        │
                                    │  Pro/Enterp. │
                                    └──────────────┘

End-to-end latency from exporter scrape to query is typically 25 to 30 seconds on the full Prometheus tier.

Per-host instrumentation

Every measured host runs three exporter/agent processes alongside the client under test:

Process	What it ships	Destination
`client process` (EC or CC)	Native Prometheus metrics on the client's HTTP endpoint	local Prometheus scrape
Node Exporter (github.com/prometheus/node_exporter)	OS-level metrics: CPU, memory, disk, network, filesystem, NIC	local Prometheus scrape
Filebeat	Container logs (Docker JSON logs, INFO + DEBUG)	Elasticsearch

The up{job="<client-name>"} time series reflects whether each exporter answered its last scrape.

Custom labels

We rewrite incoming metrics to attach our custom labels at scrape time:

ec_client, ec_version, cc_client, cc_version: what's running on this host.
role: ec, cc, or standalone (Caplin).
deployment, location: where this host runs.

These labels are consistent across all client metrics, so a single PromQL filter works regardless of implementation.

Prometheus tiers

Tier	Retention	Delay	Plan access
Public Prometheus (delayed rollups)	90 days	7 days	All tiers
Full Prometheus (real-time)	Unlimited	None (near real-time)	Pro, Enterprise

The public tier exposes recording rules, which are aggregated rollups rather than the raw metric set. The full tier exposes the full scrape, every label.

Why two tiers?

The full tier is the authoritative source: everything we scraped, queryable as raw series. It powers all standard dashboards.
The public tier is a curated, delayed surface for public access. Recording rules are pre-aggregated, so high-cardinality queries don't burden the public surface.

Internally, a federation job pulls recent samples from each deployment's local Prometheus into the central full tier so the archive is complete.

Logs (Elasticsearch)

Filebeat ships container logs from every EC and CC process to Elasticsearch. Logs are queryable from Grafana for Pro and Enterprise users.

Two log levels are retained with different retention windows:

INFO and above (warnings, errors): retained for ~30 days. The standard source for fleet-wide pattern analysis and historical investigations.
DEBUG: retained for a shorter window (typically a few days). Useful for in-flight investigations and reproducing recent issues. Not suitable for long-term trend analysis.

Common uses:

Block-processing latency extraction (e.g., the head updated regex in the Caplin standalone post).
Peer-count extraction for clients that don't expose a current gauge (e.g., Caplin's P2P app=caplin peers=N).
Error and warning pattern analysis across the fleet.

Run-manifest metadata

VM profiles, run manifests, and lifecycle metadata are kept in an internal store and surfaced to Enterprise customers on request. The user-facing surface is the List of VMs.

Datasource access by plan

Datasource	Free	Pro	Enterprise
Public Prometheus (delayed rollups)	✅	✅	✅
Full Prometheus (real-time)	⛔	✅	✅
Elasticsearch (INFO logs)	⛔	✅	✅
Elasticsearch (DEBUG logs)	⛔	✅	✅
Run-manifest metadata	⛔	⛔	✅
Direct API access	⛔	⛔	✅

Pro and Enterprise customers each receive a private Grafana organization with these datasources provisioned. Dashboard and datasource UIDs are unique per org, so refer to the user-facing names rather than UIDs in shared notes or runbooks.

For details on what each plan unlocks, see Plans & Prices.

Reliability boundaries

No co-tenancy on bare-metal hosts. Each NDC2 host runs exactly one EC, one CC, or one standalone, with no unrelated workloads. See Neutral methodology.
Time sync enforced. NTP or Chrony on every host. Runs with material clock skew are rejected from publication.
Scrape interval pinned at 15 seconds across all jobs in our Prometheus configuration.
Federation lag. The federation job adds a small ingestion delay. Total end-to-end latency typically stays under 30 seconds. Consider this when comparing samples to wall-clock events.
Cloud hosts (GCP) are subject to noisy-neighbor and burst-credit effects. Documented per Cloud profile. Cross-environment comparisons label the variance.

Deployment topology​

Per-host instrumentation​

Custom labels​

Prometheus tiers​

Why two tiers?​

Logs (Elasticsearch)​

Run-manifest metadata​

Datasource access by plan​

Reliability boundaries​

See also​