Troubleshooting Example

Last Updated on 12/19/2008 by dboth

One example of solving a problem from my own experience occurred recently in my role as a part time Linux System Administrator. It is fairly simple but can illustrate the process flow of the steps I have outlined.

I received an email from one of our testers indicating that an application he had installed as part of a test was crashing. It was giving error messages indicating that it was out of swap space. This is the initial Observation performed by the user and transmitted to me.

My Knowledge told me that the system that was being used for testing this application has 16GB of RAM and 2GB of swap space. Previous experience (Knowledge) told me that swap space in these computers is almost never touched and RAM usage is typically far below 25% of the 16GB of RAM in these boxes.

At this point I Deduced that the problem was not really a problem with swap space as that would seem highly improbable. I could still hold that possibility open, though only very slightly. You will find that many error messages provided by programs can be quite misleading and user observations can be even more so.

I made some Observations of my own. I logged into the box and used the free command as a tool to view memory and swap space. Lots of free RAM and swap space usage was at zero. I Know that if swap space usage is actually zero, then it is very likely that none of the available swap space has never been allocated and no paging has occurred since the last boot.

I also Deduced from previous experience (Knowledge) that there might be a kernel of truth in that error message. That being it was very likely to be out of some resource or other. The other primary consumable resources are CPU cycles and disk space.

This did not seem like a CPU problem so I Observed disk space using the df command which showed that the /var filesystem was full. I Deduced that the full filesystem was the cause of the problem.

All of our systems are kickstarted with a /var filesystem of 1.5GB. Our policy is to install application programs in /opt which is where the ones we test are designed to be installed, and which is configured to take all remaining disk space so can easily be 100GB or more in size.

I discussed this with the tester and was told that he had indeed installed the application in /var. I told him to uninstall from there and install the application in /opt where it belonged. After taking this Action, I had him Test the corrective application by performing the operation that had previously failed. The test was successful and the problem solved.

As you work through a problem it will be necessary to loop back through at least some of the steps. If, for example, performing a given corrective action does not resolve the problem, you may need to try another action that has also been known to resolve the problem in the past. Or you may need to go back to the observation step and gather more information about the problem.

Leave a Reply