Temperature monitoring tools to keep your Linux computer cool

Linux and light bulbs

Have you ever noticed that light bulbs – the incandescent ones especially – seem to burn out most frequently at the instant when they are turned on? Or that electronic components like home theater systems or TVs worked fine yesterday but don’t today when you turn them on? I have, too.

Have you ever wondered why that happens?

Thermal Stress

There are many factors that affect the longevity of electronic equipment. One of the most ubiquitous sources of failure is heat. In fact, the heat generated by most electronic devices as they perform their assigned tasks is the very heat that shortens their electronic lives.

When I worked at IBM in Boca Raton at the dawn of the PC era, I worked as part of a group that was responsible for the maintainability of computers and other hardware of all types. One task of the labs in Boca Raton was to ensure that hardware broke very infrequently and that, when it did, it was easy to repair. I learned some interesting things about the effects of heat on the life of computers while I was there.

Let’s go back to the light bulb because it is an easily visible if somewhat infrequent example.

Every time a light bulb is turned on, electric current surges into the filament and heats it very rapidly from room temperature to about 340° F; the specific temperature depends upon the wattage of the bulb and the ambient temperature. This thermal shock causes stress through vaporization of the metal of which the filament is made, as well as rapid expansion of the metal just caused by heating. When a light bulb is turned off the thermal shock is repeated, though less severely, during the cooling phase as the filament shrinks. The more times a bulb is cycled on and off the more the effects of this thermal shock accumulate.

The primary effect of thermal shock is that some small parts of the filament – usually due to minute manufacturing variances – tend to become hotter than the other parts and this causes the metal at those points to evaporate faster. This makes the filament even weaker at that point and more susceptible to rapid overheating in subsequent power-on cycles. Eventually, the last of the metal evaporates when the bulb is turned on and the filament dies in a very bright flash.

The electrical circuitry in computers is much the same as the filament in a light bulb. Repeated heating and cooling cycles can damage the computer’s internal electronic components just as the filament of the light bulb was damaged over time. Over many years of testing, IBM and other companies that build hardware of all kinds have discovered that more damage is done by repeated power on and off cycles than by leaving the devices on all the time. The cost of a computer that is damaged by thermal shock includes the energy cost to build a new one or to replace the damaged parts.

Cooling is Essential

Keeping computers cool is essential for helping to ensure that they have a long life. Large data centers spend a great deal of energy to keep the computers in them cool. Without going into the details, designers need to ensure that the flow of cool air is directed into the data center and specifically into the racks of computers to keep them cool. It is even better if they can be kept at a fairly constant temperature.

Proper cooling is essential even in a home or office environment. In fact, it is even more essential in those environments because the ambient temperature is so much higher as it is primarily for the comfort of the humans.

Temperature Monitoring

One can measure the temperature of many different points in a data center as well as within individual racks. But how can the temperature of the internals of a computer be measured?

Fortunately, modern computers have many sensors built into various components to enable monitoring of temperatures, fan speeds and voltages. If you have ever looked at some of the data available when a computer is in BIOS configuration mode, you can see many of these values. But this does not show what is happening inside the computer when it is in a real world situation under loads of various types.

Linux has some software tools available to allow system administrators to monitor those internal sensors. Those tools are all based on the lm_sensors, SMART and hddtemp library modules which are available on all Red Hat based distributions and most others as well.

The simplest tool is the sensors command. Before the sensors command can be used, the sensors-detect command is used to detect as many of the sensors installed on the host system as possible. The sensors command then produces output including motherboard and CPU temperatures, voltages at various points on the motherboard, and fan speeds. The sensors command also displays the range temperatures considered to be normal, high, and critical.

The hddtemp command displays temperatures for a specified hard drive. The smartctl command show the current temperature of the hard drive, various measurements that indicate the potential for hard drive failure, and, in some cases, an ASCII text history graph of the hard drive temperatures. This last output can be especially helpful in some types of problems.

When used with the appropriate library modules, the glances command can display hard drive temperatures as well as all of the same temperatures provided by the sensors command. glances is a top-like command that provides a lot of information about a running system including CPU and memory usage, I/O information about the network devices and hard drive partitions, as well as a list of the processes using the highest amounts of various system resources.

There are also a number of good graphical monitoring tools that can be used to monitor the thermal status of your computers. I like GKrellM for my desktop. There are plenty of others available for you to choose from.

I suggest installing these tools and monitoring the outputs on every newly installed system. That way you can learn what temperatures are normal for your computers. Using a tool like glances allows you to monitor the temperatures in realtime and understand how added loads of various types affect those temperatures. The other tools can be used to take snapshot looks at your computers.

Taking Action

Doing something about high temperatures is pretty straightforward. It is usually a matter of replacing defective fans; installing newer, higher-capacity fans; and reducing the ambient temperature.

When building new computers, or refurbishing older ones, I always install additional case fans or replace existing ones with larger ones where possible. Maximum airflow is important to efficient cooling. In some extreme environments, such as for gamers, liquid cooling can replace air cooling; most of us don’t need to take it to that level.

I also typically replace the standard CPU cooling units with high capacity ones. At the very least, I replace the thermal compound between the CPU and the cooling radiator. I find that the thermal compound from the factory or computer store is not always evenly distributed over the surface of the CPU which can leave some areas of the CPU with insufficient heat dissipation.

I have a large room over my attached garage that my wife and I use for our offices. All together I have ten running computers, two laser printers (in sleep mode most of the time), multiple external hard drive enclosures with from 1 to 4 drives each, and 6 uninterruptable power supplies (UPS). These devices all generate significant amounts of heat.

Over the years I have had to deal with several window mounted air-conditioning units to keep our home office at a reasonable temperature. A couple years ago our HVAC unit died and it made sense to install a zoning system so that the upstairs office space would be cooled directly and the remaining cool air, being denser than any warm air downstairs, would flow down to the lower level. This works very well for me and it keeps me and the computers at a comfortable temperature.

It is also possible to test the efficacy of your cooling solutions. There are a number of options and the one I prefer also performs useful work.

I have BOINC (Berkeley Open Infrastructure for Network Computing) installed on many of my computers and I run Seti@Home to do something productive with all of the otherwise wasted CPU cycles I own. It also provides a great test of my cooling solutions. There are also commercially available test suites that allow stress testing of memory, CPU, and I/O devices, which can be used to test cooling solutions as a side benefit.

So – Keep cool and compute on!