Diagnosing CPU and memory related performance problems

Performance problems can be fun and interesting to identify and resolve. There are several programs available for system monitoring that provide the information needed to help identify the source of the problem. These include the venerable top, as well as atop, htop, and glances. These programs all perform essentially the same function but compared to top they may display additional information and alternate formats that may make it easier for you to spot the problem.

This particular article will focus on top because it is the oldest and most familiar of these programs, and should always be present, while the other programs may not be available. However, the steps are the same and, unless otherwise noted, the information used in these examples is provided by all of the listed monitorprograms.

The top program is a very important and powerful tool to observe memory and CPU usage as well as load averages in a dynamic setting. The information provided by top can be instrumental in helping diagnose an extant problem; it is usually the first tool I use when troubleshooting a new problem.

Understanding the information that top is presenting is key to using it to greatest effect. Let’s look at some of the data which can alert us to performance problems and explore their meanings in more depth. Much of this information also pertains to the other system monitors.

CPU Usage

CPU usage is a fairly simple measure of how much CPU time is being used by executing instructions. These numbers are displayed as percentages and represent the amount of time that a CPU is being used during the defined time period.

The default time interval is usually 3 seconds although this can be changed using the “s” key, and I normally use 1 second. Fractional seconds are also accepted down to .01 seconds. I do not recommend very short intervals, i.e., less than 1 second as this adds load to the system and makes it difficult to read the data. However, as with everything Linux and its flexibility, it may occasionally be useful to set the interval to less than one second.

A CPU is a discrete device in that it is either in use or not at any given point in time. It is is on or off at a point in time; it cannot be used at less than 100% capacity — and the key phrase — at any instant in time. The CPU usage measurements are the percentage of the time that the CPU is being utilized during the defined time period. So if the CPU usage shows 24% and the time period is 1 second, that means that the CPU was being utilized for 24% of the time during that 1 second time frame.

top - 09:47:38 up 13 days, 24 min,  6 users,  load average: 0.13, 0.04, 0.01
Tasks: 180 total,   1 running, 179 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.9%us,  0.9%sy,  0.0%ni, 98.1%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   2056456k total,   797768k used,  1258688k free,    92028k buffers
Swap:  4095992k total,       88k used,  4095904k free,   336252k cached

There are four fields which are used in determining the actual CPU usage in more detail. The us, sys, ni, and wa fields subdivide the CPU usage into categories that can provide more insight into what is using CPU time in the system.

us – User space is CPU time spent performing tasks in user space as opposed to system, or kernel space. This is where user level programs run.
sys – System is CPU time spent performing system tasks. These are mostly kernel tasks such as memory management, task dispatching and all the other tasks performed by the kernel.
ni – This is “nice” time; CPU time spent on tasks that have a positive nice number; A positive nice number makes a task nicer, that is it is less demanding of CPU time and other tasks may get priority over it. More on “nice” in another document.
wa – Wait time is the amount of time that a CPU is waiting on some I/O such as a disk read or write to occur. The program running on that CPU is waiting for the result of that I/O operation before it can continue and is blocked until then.
id – Idle time is any time that the CPU is free and is not performing any processing or waiting for I/O to occur.

These four times should add up to 100% for each CPU give or take a couple tenths of rounding error.

Things to look for with CPU Usage

You should check a couple things with CPU usage when you are troubleshooting a problem. Look for one or more CPUs that that have 0% idle time for extended periods. You especially have a problem if all CPUs have very low idle time. You should then look to the Task area of the top display to determine which process is using the CPU time.

Be careful to understand whether the high CPU usage might be normal for a particular environment or program, and whether you might be seeing normal or transient behavior. The load averages discussed below can be used to help with determination of whether the system is overloaded or just very busy.

Load Averages

The first line of the output from top contains the current load averages. Load averages represent the 1, 5, and 10 minute load average for a system. In the example below, the load averages are 2.49, 1.37 and 0.60, respectively.

top - 12:21:44 up 1 day,  3:25,  7 users,  load average: 2.49, 1.37, 0.60
Tasks: 257 total,   5 running, 252 sleeping,   0 stopped,   0 zombie
Cpu0  : 33.2%us, 32.3%sy,  0.0%ni, 34.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  : 51.7%us, 24.0%sy,  0.0%ni, 24.2%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 24.6%us, 48.5%sy,  0.0%ni, 27.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  : 67.1%us, 21.6%sy,  0.0%ni, 11.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   6122964k total,  3582032k used,  2540932k free,   358752k buffers
Swap:  8191996k total,        0k used,  8191996k free,  2596520k cached

But what does this really mean when I say that the 1 (or 5 or 10) minute load average is 2.49? Load average can be considered a measure of demand for the CPU; it is a number that represents the average number of instructions waiting for CPU time.

Thus in a single processor system a fully utilized CPU would have a load average of 1. This means that the CPU is keeping up exactly with the demand; in other words it has perfect utilization. A load average of less than one means that the CPU is underutilized and a load average of greater than 1 means that the CPU is overutilized and that there is pent-up, unsatisfied demand. For example, a load average of 1.5 in a single CPU system indicates that some instructions are forced to wait to be executed until the one preceding it has completed.

This is also true for multiple processors. If a 4 CPU system has a load average of 4 then it has perfect utilization. If it has a load average of 3.24, for example, then three of its processors are fully utilized and one is under utilized by about 76%. In the example above, a 4 CPU system has a 1 minute load average of 2.49 meaning that there is still significant capacity available among the 4 CPUs. A perfectly utilized 4 CPU system would show a load average of 4.00.

The optimum condition for load average in commercial or scientific environments is for it to equal the total number of CPUs in a system. That would mean that every CPU is fully utilized and yet no instruction must be forced to wait.

Also note that the longer-term load averages provide indication of the overall utilization trend. It appears in the above example that the short term load average is indicative of a short term peak in utilization but that there is still plenty of capacity available.

Linux Journal has an excellent article describing load averages, the theory and the math behind them and how to interpret them in the December 1, 2006 issue. This link will take you directly to that article: http://www.linuxjournal.com/article/9001?page=0,0

Memory Usage

Performance problems can also be caused by lack of memory. Without sufficient memory in which to run all the active programs, the kernel memory management subsystems will spend time moving the contents of memory between swap space on the disk and RAM in order to keep all processes running. This swapping takes CPU time and I/O bandwidth, so it slows down the progress of productive work. Ultimately a state known as “thrashing” can occur in which the majority of the computer’s time is spent on moving memory contents between disk and RAM and little or no time is available to spend on productive work.

Part of the problem with trying to diagnose this condition is that the computer will be barely responsive at best. So, if you are experiencing problems, get into a state – perhaps by rebooting – that will allow you to launch top, htop, glances or any of the real-time monitoring programs, and then start the programs that you suspect may be causing the problem. Hopefully the monitor will show you which program is the source of the problem so that it can be fixed.

The Task List

The top task list provides a view of the tasks consuming the most of a particular resource. As discussed in the document top in this DataBook, the task list can be sorted by any of the displayed columns including CPU and memory usage. By default top is sorted by CPU usage from high to low. This provides a quick way to view the processes consuming the most CPU cycles. If there is one that stands out such as sucking up 90% or more of the available CPU cycles, this could be indicative of a problem. That is not always the case; some applications just gobble huge amounts of CPU time. Again, it is imperative that you observe a correctly running system to understand what is normal so you will know when you see abnormal.