Know what your server is doing at all times — CPU, RAM, disk I/O, network, log anomalies, and service health. Build a full observability stack on Rocky Linux 9.
These are the critical metrics every Rocky Linux server administrator should track — with alert thresholds.
Zero-config, real-time per-second metrics. Runs as a lightweight daemon. Best for instant visibility on any Rocky server.
Pull-based metrics collection. Node Exporter exposes 600+ OS metrics. Store in Prometheus TSDB, visualise in Grafana.
Beautiful dashboards over Prometheus data. Pre-built Rocky Linux / Node Exporter dashboards available in Grafana Hub.
Systemd journal: all service logs centralised. Query by time, unit, priority. Forward to Loki or Elasticsearch for aggregation.
Route Prometheus alerts to email, Slack, PagerDuty, or webhook. Define inhibition rules to prevent alert storms.
Grafana Loki is like Prometheus but for logs. Promtail ships journal and file logs. Query with LogQL alongside your metrics.
Netdata is a zero-config monitoring agent that starts collecting per-second metrics immediately on install. It has built-in dashboards, anomaly detection, and plugin support for MySQL, PostgreSQL, nginx, and 600+ more applications.
Prometheus scrapes metrics from Node Exporter (running on each Rocky Linux server) every 15 seconds and stores them in its time-series database. Query with PromQL and alert via Alertmanager.
Connect Grafana to your Prometheus instance and import community dashboards for instant visibility.
Systemd's journald collects all service logs, kernel messages, and boot events into a structured binary journal. On Rocky Linux this replaces scattered /var/log/ text files for most services.
promtail to ship journal entries to Grafana Loki. Then query logs with LogQL directly alongside your Prometheus metrics in the same Grafana dashboard.Define Prometheus alert rules that trigger Alertmanager to notify your on-call team via email, Slack, or PagerDuty.
| Alert Rule | Condition (PromQL) | Severity | Action |
|---|---|---|---|
| High CPU Load | node_load1 / count(node_cpu_seconds_total{mode="idle"}) > 0.9 |
WARNING | Check top, ps auxf |
| Disk > 85% | (1 - node_filesystem_avail_bytes/node_filesystem_size_bytes)*100 > 85 |
WARNING | Clean logs, expand LV |
| Disk > 95% | disk_usage_pct > 95 |
CRITICAL | URGENT: free space now |
| Service Down | up == 0 |
CRITICAL | Check systemctl status |
| High Memory | node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes < 0.1 |
WARNING | Check memory leaks |
| OOM Killer Active | increase(node_vmstat_oom_kill[5m]) > 0 |
CRITICAL | Process killed — check logs |
| RAID Degraded | node_md_disks_active < node_md_disks |
CRITICAL | Replace failed disk |
| SSH Failed Logins > 20 | increase(fail2ban_banned_ip_total[5m]) > 20 |
WARNING | Check fail2ban, firewall |
We deploy and configure the full observability stack — Netdata, Prometheus, Grafana, Loki — and tune alert rules for your specific infrastructure.