Monday, August 25, 2014

What's The Problem, Anyway?

The first step in troubleshooting a problem is knowing that you have one.  Hopefully you have some sort of monitoring system in place that can alert you to the existence of a problem in a timely manner.  Unfortunately this isn't always the case, and problems are reported to us by users, system or application administrators, or in the worst case by customers.

Once we know there is a problem, the second step is to get a clear description of the symptoms (which will hopefully lead us to an actual technical definition of the problem).  And herein lies one of the biggest headaches for a troubleshooter, because the reports we get are often vague, inaccurate or misleading.  An important skill for the troubleshooter is therefore the ability to extract accurate information from the people reporting the problem, to get detailed descriptions, and weed out what is just plain wrong.

There are various reasons why we can't simply trust early problem reporting, some of which has to do with exactly who is making the report.  In particular, getting people to concentrate on describing the symptoms rather than jumping to conclusions can be a real chore.  Here are some common issues I see frequently with problem reporting:

  • End users frequently tend to describe what they feel rather than what they see, and to generalize - a lot.  Descriptions such as "Everything is slow" are common.  Users who can't get to a specific web site sometimes report that "The Internet is down."  
  • People who have experienced one kind of problem in the past sometimes think that every new problem is the same as the old one.  A recent example occurred at my office when the users of an externally hosted web application experienced extreme slowness and broken app sessions due to packet loss along the path to the external hosting site.  A couple of weeks after that was resolved, there were problems with a server hosting the application, and it was reported to my team that the network problem had come back, despite the fact that the symptoms were different (and that users were getting server-side error messages displayed onscreen).
  • There may be one or more "human layers" between the people with the problem and the people troubleshooting, and they can muddy the waters.  For example, many of our problems come to us by way of a helpdesk which takes problem calls.  They provide a vital function, but inexperienced or untrained personnel may not ask the right questions, or they may provide their own interpretation before passing along the report.  
  • People sometimes report inaccurate information, and once it's been reported it may be hard to correct.  In the case described above where users were having trouble with performance of an external site, a manager who received initial complaints from his users concluded that only users of Windows XP and older versions of Internet Explorer were affected, but that users on Windows 7 and newer browsers were fine.  This incorrect information was the result of failing to gather enough data before calling the helpdesk - but it went into the ticket.  The problem got kicked around for several days by other areas before landing in my team's lap, but although the manager had learned during that time that his Windows 7 users were indeed affected, that information never made it into the problem ticket.  Our team started the troubleshooting process with inaccurate information.
  • People sometimes think they already know what the problem is, and try to lead the troubleshooter to a particular conclusion that may not be warranted.  A lot of the problems that come our way start out like this: "We need you to check the firewall, our app server can't get to the database server."  Of course it may be the firewall, but troubleshooters who allow themselves to be led this way often lose precious time following false trails.
It's difficult to always keep these issues from occurring, but a good troubleshooter knows the importance of getting an accurate description of the symptoms.  Here are a few ways it's done:
  • Whenever possible, talk to the people experiencing the problem.  I know a lot of IT people who just HATE this - we like having the helpdesk act as a buffer between us and our users or customers (who may be in a foul mood by the time they call in a problem).  But the more layers there are between us and the people who are actually experiencing the issue, the harder it will be to make sure the right questions are asked.
  • Concentrate on the basics.  What is the user doing when the problem happens?  What application are they using?  What web site are they accessing?  What specific function within the application or site are they accessing?  What is it supposed to do that it isn't doing?  When did the problem start? Did is used to work and now it doesn't, or is it something we've never seen work properly?  How many people are affected?  Is the user aware of a change - a new operating system or browser, or maybe a patch that got pushed out?  Did an application administrator push out a new code release?
  • Try to see the problem for yourself.  Can you try to run the same application under the same circumstances as the user?  Can you remotely access a workstation in the same place and do the same thing?  Can you shadow or monitor a user's session so you can see what they see?  If you can't see it yourself, can someone reproduce the problem and give you a description?  Can you get someone to take a screen shot, or send you an error message from an application screen or from a server or application log?  (On that note, it's best if you can avoid having people try to write down or type an error message, as it may not be faithfully transmitted to you - a screenshot or actual snippet from an error log is better).
  • Try to recognize the difference between a problem description and a conclusion drawn by someone else about the nature of the problem - in other words, try not to be led.  If the problem report tells you what needs to be checked this should be an immediate red flag.  It's especially difficult to avoid if you know the person making the report and you have some respect for their technical skills, but you need to think things through for yourself - which may mean getting the reporter to back up and walk you through the symptom.  If they want to describe how they came to their conclusion, that's fine as long as you can resist the temptation to let them do your work for you.
This isn't meant to be rude or disrespectful, but remember this - problem reports can be wildly inaccurate or so vague as to be nearly useless.  An important part of troubleshooting is to get a clear, accurate description of the symptoms.  Without that, you're half-blind and may waste a lot of valuable time and effort on the wrong path.

Thursday, August 21, 2014

It's The Network! (why network engineers get so much experience troubleshooting)

After I'd been on the job for a while I began to notice a disturbing trend - the network gets blamed for a lot of problems.  At first I thought it was something unique to my company and our IT staff, but I have learned that this is a common occurrence.  Every few months a vendor will come in trying to sell us some new-fangled network monitoring tool, and the opening pitch is always something like this:

"Are you tired of having to defend the network all the time?  With (insert product name here) you can instantly PROVE that it's not the network causing problems, and refocus your troubleshooting on finding the REAL cause!"

The fact that a market exists for such tools, and that pretty much every vendor chooses the same pitch to get network engineers to buy them, tells me that this problem is widespread.  It's very common for server and application administrators to blame the network when their systems aren't working the way they expect, and this attitude is also seen in management as well.

There are a number of reasons why this is so - I'll list and comment on some of them here, and later blog posts will explore them in further detail.

  1. The word "network" means something different to network engineers and to pretty much everyone else.  To a network engineer, the "network" is a collection of routers, switches, firewalls, and VPN devices.  When we're feeling generous it may also include other devices that can alter or affect traffic flow - security devices like IDS/IPS, load-balancers, etc.  But to many people the "network" is defined as "everything other than the system I'm responsible for."  This means that if a server or application administrator is having a problem and they don't see something wrong with their own systems (and frankly, they may not know how to look), they are going to toss it over the wall to the network team.
  2. Some parts of the network are designed to block traffic.  It's true - the very definition of a "firewall" is a system that blocks everything by default, and only allows traffic by explicit exceptions.  And intrusion prevention systems can interfere with traffic that fits (or fails to fit) a particular profile.  Which leads us to this little gem:
  3. Sometimes, it really IS the network.  No part of a large IT system is immune to problems.  Firewalls may be blocking traffic if a network engineer has failed to correctly configure a necessary exception (or if the application owner has failed to request it).  IPS systems can mis-identify traffic as malicious.  Switch ports, line cards, and routers can have hardware problems, and as with any software, the operating code on these systems can be buggy.  And that leads to yet another item:
  4. It was the network last time, so it's the network this time.  I call this The Problem of Experience.  If a server or application admin has ever been the victim of a missing or misconfigured firewall rule or a bad IPS signature or a flaky switch port, the next time they have any sort of problem they're more likely to conclude that the network is causing it THIS time, too.
  5. The network team has unique powers of observation.  In addition to our ability to look at our own systems - our switches, routers, firewalls, VPN devices - we network engineers can also look at traffic.  We are usually the folks who own and operate the packet capture and analysis devices - which makes a certain kind of sense given that we have to configure the network to copy traffic to them.  Even when someone is kind enough not to actually blame the network, they often come straight to the network team for a "sniff" (and some expert assistance with the analysis) as a shortcut to resolving their issues.
  6. We're good at troubleshooting.  I addressed this briefly in my introductory post on this blog, but it comes down to one of those self-reinforcing cycles.  We get lots of problems so we develop skill at solving problems, good at checking our own systems first and then tackling other people's issues, and then because we're good at it, we get asked to do it some more, so we get better at it...you see where this goes, right?
So if you're a network engineer wondering if it's just you, or just your company or your admins or your users...the answer is "No."  We get the same thing everywhere - if there's a problem, someone is bound to blame the network.  If you're lucky you will survive long enough to develop some skill at solving problems, and if you're really lucky you will eventually convince the people around you that it's not always the network.  But don't hold your breath waiting for them to stop asking for help. 

Welcome!

Welcome!  You have found the blog home of James V. Fields (that's me, on the left there, looking just the right amount of suave and geeky all at once).  I'm a network engineer for a semi-large company, but I spend a lot of time - a HUGE amount of it, actually - troubleshooting.

We have thousands of users, thousands of desktop computers and servers, more than a thousand pieces of network gear, and lots of network connections, public and private.  We do a lot of in-house software development, as well as using plenty of off-the-shelf stuff. 

When things go wrong I'm one of the people they call to figure out the what and the why.  And not just network issues - I end up troubleshooting server configurations, application problems - you name it, I've probably had to work on it.

There are various reasons why this is the case, not the least of which is that I'm fairly good at it.  It also helps that I don't shrug off problems or try to get people to leave me alone - I have a strong sense that regardless of where a problem lies, if I have the ability to help, then it's my duty to pitch in.  I'm not really sure if I get called so much because I'm good at the work, or if I'm good because I get a lot of calls - I guess it's a little of both. 

It has occurred to me from time to time that there are themes to the work - issues that crop up again and again.  Some of these are technical issues, while others have more to do with the psychology of the whole thing, the approach to troubleshooting (or lack thereof as often happens to be the case).  I turn this stuff over and over in my head.  Sometimes my brain just gets tired of wrestling with it all, but sometimes I snag a little piece of "truth" about how we deal with complex problems, thus the blog - this is a place I can record my observations and revelations, and maybe try to put a little structure to them. 

I'm not a professional blogger or writer, and it's hard to say where this will go.  I don't know if anyone will see it, but if you make it here, feel free to comment or drop me a line - I'd love to hear from you.