Nevertheless these metrics are used everywhere to measure CPU utilization, and have proven to be valuable first source of information. For the subtleties in accounting wait_io on multiprocessor systems see A. Tickless kernels and varying clock speeds (Intel Turbo Boost/Turbo Step) further blur the meaning of these metrics. There are many voluntary context switches within a single time slice that are missed this way. Also accounting takes place in units of full CPU time slices (jiffies) at time of the clock interrupt. 4 However, they should be taken with the necessary precaution: 5, 6 Those counters do not usually add up to 100%, which is already bad sign. Like all other monitoring tools we are aware of, the values are derived from `/proc/stat`. Blues represent time that the system spent doing work, yellow colors represent time where that the system spent waiting. Those metrics should give you a rough idea of what the CPU is doing during the last reporting period (1M). cpu`system + cpu`intr - time spent executing OS-kernel code (dark blue).cpu`user - time spent executing user code (blue).cpu`wait_io - time spent in the idle thread while waiting on IO (dark yellow).cpu`idle - time spent in the idle thread and not waiting on I/O (yellow).In this section, we will go over each of them and explain their significance. We have taken this list as our starting point to define a set of USE metrics for monitoring systems with Circonus. Fortunately, Gregg has compiled a sensible list of indicators 2 that are available on Linux systems. It’s not immediately clear how utilization, saturation, and errors can be quantified for different system resources. Then, utilization and saturation are checked.įor more details about the use method and its application to system performance analysis the reader is referred to the excellent book by Gregg. For each resource, errors should be investigated first, since they impact performance and might not be noticed immediately, when the failure is recoverable. The four resource types mentioned above are the most important ones, but there are more resources, like IO Bus, Memory Bus, and Network Controllers, that should be included in a thorough analysis. The USE analysis is started by creating an exhaustive list of that are consumed by the application. The USE method can be summarized as follows:įor each resource, check utilization, saturation, and errors. In other cases, random changes are applied in the hope that the problems go away. Other approaches suffer from a “street light syndrome,” in that the focus lies on parts of the system where metrics are readily available. It uses a top down approach to summarize the system resources, which ensures that every resource is covered. The USE method, by Gregg, is an excellent way to identify performance problems quickly. The degree to which a resource has extra work that it can not service is called saturation, 3, p42 and is another important indicator for performance bottlenecks. Modern software stacks use queuing in all system components to improve performance and distribute load. But queuing does not only occur in the application layer. It starts causing problems only when incoming requests start queuing up or producing errors, and hence the performance of the application is impacted. A fully utilized CPU means only that we are making good use of the system. While a fully utilized resource is an indication of a performance bottleneck, it might not be a problem at all. Typical questions are: Is my CPU fully utilized? Is my application running out of memory? Do we have enough disk capacity left? Once critical objective of system monitoring is to check that if how the available resource are utilized. 2, 3 Figure 2: High-level System Overview A more detailed version can be found in Gregg. Figure 2 illustrates the high level architecture. Instead, the Operating System abstracts and manages these resources and provides a consistent abstracted API to the application (“process environment”, system calls, etc). These computational resources are not consumed directly by the application. System monitoring is concerned with monitoring basic system resources: CPU, Memory, Network and Disks. The gathering of performance metrics is usually done using a myriad of different system tools (sar, iostat, vmstat) and tracers (dtrace, perf, ebpf). Circonus has now added the most relevant USE metrics to its monitoring agent, and conveniently presents them in the form of a USE Dashboard (Figure 1, live demo), that allows a full USE analysis at a single glance. 1 It provides a simple, top-down approach to identify bottlenecks quickly and reliably. The USE method was developed by Brendan Gregg to study performance problems in a systematic way.
0 Comments
Leave a Reply. |