Open Source Monitoring for Linux Servers
What to Monitor on a Linux Server
Effective Linux server monitoring covers several layers of the system stack, from hardware health through operating system metrics to application-level indicators. Each layer provides different signals about system health, and monitoring all of them ensures that problems are detected early regardless of where they originate.
CPU metrics include overall utilization, per-core utilization, the breakdown of time spent in user, system, iowait, and steal states, load averages over 1, 5, and 15 minute windows, and context switch rates. High iowait indicates that processes are blocked waiting for disk I/O, which suggests a storage bottleneck rather than a CPU shortage. High steal time on virtual machines indicates that the hypervisor is not allocating enough CPU time to the guest, a common problem in overcommitted virtualization environments.
Memory metrics cover total, used, available, and free memory, swap usage, buffer and cache allocations, and the activity of the OOM (Out of Memory) killer. The distinction between "free" and "available" memory is important on Linux. Free memory is genuinely unused, while available memory includes buffer/cache allocations that the kernel can reclaim under pressure. A server showing 95% memory utilization but 40% available memory is healthy, while a server with low available memory and active swap usage is under pressure.
Disk metrics include filesystem utilization (percent full), inode utilization, disk I/O throughput (read/write bytes per second), IOPS (I/O operations per second), I/O latency, and I/O queue depth. SMART health data from physical drives provides early warning of hardware failures. Running out of inodes while disk space remains available is a common surprise that monitoring should catch before it causes service failures.
Network metrics cover interface throughput (bytes in/out), packet rates, error rates, dropped packets, TCP connection states, and socket buffer overflows. High retransmission rates indicate network quality issues. A growing number of sockets in TIME_WAIT state may indicate connection management problems in the application layer.
Process and service monitoring verifies that critical services are running, consuming expected resources, and responding correctly. This includes checking that systemd services are in the active state, monitoring process counts and resource consumption for key daemons, and verifying that service ports are listening and accepting connections. A web server that is running but not accepting connections is a failure that process-level monitoring catches while simple alive checks might miss.
Prometheus Node Exporter
Node Exporter is the standard Prometheus exporter for Linux system metrics. It reads directly from /proc and /sys to collect detailed metrics on CPU, memory, disk, network, filesystems, and hardware health, exposing them as Prometheus-format metrics on an HTTP endpoint (default port 9100). Node Exporter generates between 500 and 1,500 time series per host depending on the number of disks, network interfaces, and enabled collectors.
Node Exporter's collector architecture is modular. Individual collectors for specific metric categories can be enabled or disabled at startup. The default set covers the most commonly needed system metrics. Additional collectors provide data from sources like systemd unit states, filesystem mount statistics, hardware thermal sensors, IPMI readings, and textfile-based custom metrics. The textfile collector is particularly useful because it allows any script or application to write metrics in Prometheus format to a text file, which Node Exporter then exposes alongside its built-in metrics.
When paired with Grafana, Node Exporter data powers the widely used "Node Exporter Full" dashboard (ID 1860), which provides comprehensive visualizations for all major Linux system metrics. This combination of Node Exporter for data collection and Grafana for visualization is the most popular approach to Linux server monitoring in cloud-native environments, especially those running Kubernetes where Prometheus is already deployed for cluster monitoring.
Zabbix Agent
The Zabbix agent provides deep Linux system monitoring as part of the Zabbix monitoring platform. Unlike Node Exporter's pull-only model, the Zabbix agent supports both passive mode (the server polls the agent) and active mode (the agent pushes data to the server), making it flexible for environments with different network topologies and firewall constraints. Active mode is particularly useful when monitored hosts are behind NAT or firewalls that prevent inbound connections from the monitoring server.
Zabbix's Linux monitoring templates configure hundreds of monitoring items automatically when applied to a host, covering CPU, memory, disk, network, filesystem, process, and service metrics. The agent also supports UserParameter definitions, which are custom check commands defined in the agent configuration that extend monitoring to application-specific metrics. A UserParameter can run any shell command and return the output as a metric value, making it straightforward to monitor custom application health indicators, log file patterns, or business metrics.
Low-level discovery (LLD) is a Zabbix feature that automatically detects monitoring targets on a host, such as filesystem mount points, network interfaces, block devices, and systemd services, and creates monitoring items for each discovered entity. When a new disk is added or a new network interface appears, LLD detects it and begins monitoring it without manual intervention. This auto-discovery capability is valuable for environments where server configurations change frequently.
Netdata
Netdata takes a different approach to Linux monitoring by providing per-second metrics with zero configuration. The Netdata agent auto-detects and monitors over 800 types of data sources, including system metrics, containers, databases, web servers, message brokers, and application frameworks. It begins collecting data immediately upon installation without requiring any template application, target configuration, or dashboard creation.
The per-second collection interval sets Netdata apart from most monitoring tools that collect at 10 to 60 second intervals. This high resolution reveals transient issues like brief CPU spikes, momentary disk I/O saturation, or short-lived network errors that lower-resolution monitoring misses entirely. Despite this granularity, Netdata's resource consumption is remarkably low, typically using less than 1% of a CPU core and a few hundred megabytes of RAM, thanks to its custom time-series database engine that stores data efficiently using adaptive compression.
Netdata's built-in web dashboard provides interactive real-time visualizations that update every second. Each chart supports zooming, panning, and highlighting, making it useful for live troubleshooting sessions where an engineer is investigating a problem as it happens. The machine learning powered anomaly detection feature identifies metrics behaving unusually compared to their historical patterns, highlighting potential issues without requiring manually configured thresholds for every metric.
Checkmk Agent
The Checkmk agent collects all system data in a single invocation, sending a complete snapshot of the host's state to the Checkmk server in one network transaction. This architecture is more efficient than approaches that require a separate network round-trip for each check, reducing both network overhead and agent CPU consumption. The agent output includes CPU, memory, disk, network, process, and service data along with auto-discovered application metrics from plug-ins.
Checkmk's auto-discovery analyzes the agent output and proposes monitoring items based on what it finds on the host. When a new service is installed, a new filesystem is mounted, or a new network interface is configured, Checkmk detects the change during its next discovery run and offers to add monitoring for the new component. The rule-based configuration system then applies monitoring thresholds, notification routing, and grouping rules based on the host's properties, ensuring consistent monitoring without per-host manual configuration.
Command-Line Monitoring Tools
Linux provides several built-in and widely available command-line tools for interactive monitoring and troubleshooting. While these are not replacements for a monitoring platform, they are essential for real-time investigation when an alert fires or when an engineer needs to understand current system behavior.
top and htop show real-time process activity with CPU, memory, and I/O utilization broken down by individual process. htop provides a more interactive and visually clear interface with mouse support, tree views, and process filtering. iotop shows per-process disk I/O activity, revealing which processes are generating the most read and write traffic. vmstat provides a summary of system-wide memory, swap, I/O, and CPU statistics over time. sar (from the sysstat package) collects and reports historical system activity data, making it useful for reviewing what happened before the current moment. dstat combines the output of vmstat, iostat, and ifstat into a single color-coded display with per-second updates.
These tools read from the same /proc and /sys sources that monitoring agents use, so there is no conflict between running interactive tools and having a monitoring agent installed. Engineers often use these tools to drill deeper into a specific issue after a monitoring dashboard or alert has identified the general area of the problem.
Linux servers provide exceptionally detailed monitoring data through kernel interfaces. The choice of monitoring tool matters less than ensuring consistent agent deployment, appropriate metric coverage, and alerting rules focused on conditions that require human attention.