Once we know there is a problem, the second step is to get a clear description of the symptoms (which will hopefully lead us to an actual technical definition of the problem). And herein lies one of the biggest headaches for a troubleshooter, because the reports we get are often vague, inaccurate or misleading. An important skill for the troubleshooter is therefore the ability to extract accurate information from the people reporting the problem, to get detailed descriptions, and weed out what is just plain wrong.
There are various reasons why we can't simply trust early problem reporting, some of which has to do with exactly who is making the report. In particular, getting people to concentrate on describing the symptoms rather than jumping to conclusions can be a real chore. Here are some common issues I see frequently with problem reporting:
- End users frequently tend to describe what they feel rather than what they see, and to generalize - a lot. Descriptions such as "Everything is slow" are common. Users who can't get to a specific web site sometimes report that "The Internet is down."
- People who have experienced one kind of problem in the past sometimes think that every new problem is the same as the old one. A recent example occurred at my office when the users of an externally hosted web application experienced extreme slowness and broken app sessions due to packet loss along the path to the external hosting site. A couple of weeks after that was resolved, there were problems with a server hosting the application, and it was reported to my team that the network problem had come back, despite the fact that the symptoms were different (and that users were getting server-side error messages displayed onscreen).
- There may be one or more "human layers" between the people with the problem and the people troubleshooting, and they can muddy the waters. For example, many of our problems come to us by way of a helpdesk which takes problem calls. They provide a vital function, but inexperienced or untrained personnel may not ask the right questions, or they may provide their own interpretation before passing along the report.
- People sometimes report inaccurate information, and once it's been reported it may be hard to correct. In the case described above where users were having trouble with performance of an external site, a manager who received initial complaints from his users concluded that only users of Windows XP and older versions of Internet Explorer were affected, but that users on Windows 7 and newer browsers were fine. This incorrect information was the result of failing to gather enough data before calling the helpdesk - but it went into the ticket. The problem got kicked around for several days by other areas before landing in my team's lap, but although the manager had learned during that time that his Windows 7 users were indeed affected, that information never made it into the problem ticket. Our team started the troubleshooting process with inaccurate information.
- People sometimes think they already know what the problem is, and try to lead the troubleshooter to a particular conclusion that may not be warranted. A lot of the problems that come our way start out like this: "We need you to check the firewall, our app server can't get to the database server." Of course it may be the firewall, but troubleshooters who allow themselves to be led this way often lose precious time following false trails.
- Whenever possible, talk to the people experiencing the problem. I know a lot of IT people who just HATE this - we like having the helpdesk act as a buffer between us and our users or customers (who may be in a foul mood by the time they call in a problem). But the more layers there are between us and the people who are actually experiencing the issue, the harder it will be to make sure the right questions are asked.
- Concentrate on the basics. What is the user doing when the problem happens? What application are they using? What web site are they accessing? What specific function within the application or site are they accessing? What is it supposed to do that it isn't doing? When did the problem start? Did is used to work and now it doesn't, or is it something we've never seen work properly? How many people are affected? Is the user aware of a change - a new operating system or browser, or maybe a patch that got pushed out? Did an application administrator push out a new code release?
- Try to see the problem for yourself. Can you try to run the same application under the same circumstances as the user? Can you remotely access a workstation in the same place and do the same thing? Can you shadow or monitor a user's session so you can see what they see? If you can't see it yourself, can someone reproduce the problem and give you a description? Can you get someone to take a screen shot, or send you an error message from an application screen or from a server or application log? (On that note, it's best if you can avoid having people try to write down or type an error message, as it may not be faithfully transmitted to you - a screenshot or actual snippet from an error log is better).
- Try to recognize the difference between a problem description and a conclusion drawn by someone else about the nature of the problem - in other words, try not to be led. If the problem report tells you what needs to be checked this should be an immediate red flag. It's especially difficult to avoid if you know the person making the report and you have some respect for their technical skills, but you need to think things through for yourself - which may mean getting the reporter to back up and walk you through the symptom. If they want to describe how they came to their conclusion, that's fine as long as you can resist the temptation to let them do your work for you.