Architecture & data flow
How a measurement run becomes a queryable dashboard panel. Useful if you want to write your own queries, validate our numbers, or understand the lineage of a specific metric.
Deployment topology
StereumLabs runs across two deployments, both centrally aggregated.
┌──────────────────────────────┐ ┌────────────────────────┐
│ NDC2 (Vienna, Austria) │ │ GCP (europe-west3-a) │
│ Bare-metal, RockLogic │ │ Cloud, supernodes │
│ 144 hosts │ │ 24 hosts │
│ │ │ │
│ 36 EC×CC pairings │ │ 6 supernode pairings │
│ + 1 Caplin standalone │ │ (all paired with Geth)│
└──────────────┬───────────────┘ └────────────┬───────────┘
│ │
│ federate │
▼ ▼
┌────────────────────────────────────────────┐
│ Central infrastructure (NDC2) │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ Prometheus │ │ Elasticsearch │ │
│ │ │ │ (logs) │ │
│ └──────┬───────┘ └──────────┬───────┘ │
│ │ │ │
│ ┌──────▼──────────────────────▼──────┐ │
│ │ Grafana │ │
│ └────────────────────────────────────┘ │
└────────────────────┬───────────────────────┘
│
▼
┌──────────────┐
│ Users │
│ Free │
│ Pro/Enterp. │
└──────────────┘
End-to-end latency from exporter scrape to query is typically 25 to 30 seconds on the full Prometheus tier.
Per-host instrumentation
Every measured host runs three exporter/agent processes alongside the client under test:
| Process | What it ships | Destination |
|---|---|---|
client process (EC or CC) | Native Prometheus metrics on the client's HTTP endpoint | local Prometheus scrape |
| Node Exporter (github.com/prometheus/node_exporter) | OS-level metrics: CPU, memory, disk, network, filesystem, NIC | local Prometheus scrape |
| Filebeat | Container logs (Docker JSON logs, INFO + DEBUG) | Elasticsearch |
The up{job="<client-name>"} time series reflects whether each exporter answered its last scrape.
Custom labels
We rewrite incoming metrics to attach our custom labels at scrape time:
ec_client,ec_version,cc_client,cc_version: what's running on this host.role:ec,cc, orstandalone(Caplin).deployment,location: where this host runs.
These labels are consistent across all client metrics, so a single PromQL filter works regardless of implementation.
Prometheus tiers
| Tier | Retention | Delay | Plan access |
|---|---|---|---|
| Public Prometheus (delayed rollups) | 90 days | 7 days | All tiers |
| Full Prometheus (real-time) | Unlimited | None (near real-time) | Pro, Enterprise |
The public tier exposes recording rules, which are aggregated rollups rather than the raw metric set. The full tier exposes the full scrape, every label.
Why two tiers?
- The full tier is the authoritative source: everything we scraped, queryable as raw series. It powers all standard dashboards.
- The public tier is a curated, delayed surface for public access. Recording rules are pre-aggregated, so high-cardinality queries don't burden the public surface.
Internally, a federation job pulls recent samples from each deployment's local Prometheus into the central full tier so the archive is complete.
Logs (Elasticsearch)
Filebeat ships container logs from every EC and CC process to Elasticsearch. Logs are queryable from Grafana for Pro and Enterprise users.
Two log levels are retained with different retention windows:
- INFO and above (warnings, errors): retained for ~30 days. The standard source for fleet-wide pattern analysis and historical investigations.
- DEBUG: retained for a shorter window (typically a few days). Useful for in-flight investigations and reproducing recent issues. Not suitable for long-term trend analysis.
Common uses:
- Block-processing latency extraction (e.g., the
head updatedregex in the Caplin standalone post). - Peer-count extraction for clients that don't expose a current gauge (e.g., Caplin's
P2P app=caplin peers=N). - Error and warning pattern analysis across the fleet.
Run-manifest metadata
VM profiles, run manifests, and lifecycle metadata are kept in an internal store and surfaced to Enterprise customers on request. The user-facing surface is the List of VMs.
Datasource access by plan
| Datasource | Free | Pro | Enterprise |
|---|---|---|---|
| Public Prometheus (delayed rollups) | ✅ | ✅ | ✅ |
| Full Prometheus (real-time) | ⛔ | ✅ | ✅ |
| Elasticsearch (INFO logs) | ⛔ | ✅ | ✅ |
| Elasticsearch (DEBUG logs) | ⛔ | ✅ | ✅ |
| Run-manifest metadata | ⛔ | ⛔ | ✅ |
| Direct API access | ⛔ | ⛔ | ✅ |
Pro and Enterprise customers each receive a private Grafana organization with these datasources provisioned. Dashboard and datasource UIDs are unique per org, so refer to the user-facing names rather than UIDs in shared notes or runbooks.
For details on what each plan unlocks, see Plans & Prices.
Reliability boundaries
- No co-tenancy on bare-metal hosts. Each NDC2 host runs exactly one EC, one CC, or one standalone, with no unrelated workloads. See Neutral methodology.
- Time sync enforced. NTP or Chrony on every host. Runs with material clock skew are rejected from publication.
- Scrape interval pinned at 15 seconds across all jobs in our Prometheus configuration.
- Federation lag. The federation job adds a small ingestion delay. Total end-to-end latency typically stays under 30 seconds. Consider this when comparing samples to wall-clock events.
- Cloud hosts (GCP) are subject to noisy-neighbor and burst-credit effects. Documented per Cloud profile. Cross-environment comparisons label the variance.
See also
- Purpose & Scope: methodology.
- List of VMs: per-host hardware and lifecycle.
- Build your own dashboards: datasource names, labels, PromQL examples.
- Glossary: definitions for the terms used above.
Change control for this page: material edits will be logged in the global Changelog with a short rationale and effective date.