Thursday, February 26, 2015

James's Rules of Troubleshooting

While digging through some old documents this afternoon, I came across something I had written about 5 or 6 years ago.  It's a pretty fair summarization of some things I have learned about troubleshooting.  I've made a few minor edits, but for the most part it's just as I wrote it back then.

James’s Rules of Troubleshooting

Before - (things to do/know prior to a problem – when something breaks it’s too late to start working on these):

  • Know your stuff.  Be an expert on the technology for which you’re responsible.  “SME” means Subject Matter Expert – be one.  It’s MUCH easier to spot what’s wrong if you know what “right” looks like.
  • Know where to find your product manuals, configurations, and logs.  Know what’s in the logs. Know how to read them.  Make sure logs are tuned to show the right amount of data, kept for a sufficient period of time to be useful, and are time-synchronized with everything else in the network.  Keep frequent backups of configurations / changes.
  • Have the tools you need.  Have them installed.  Have them up-to-date.  Know how to use them. Try not to get tied to a single tool, no matter how good it is (when all you have is a hammer, everything looks like a nail).  Know more than one way to skin the cat.
  • Work on your people skills.  You got into I.T. to avoid dealing with people?  Hopefully I’m not the first person explaining to you that this is not possible.  The network, computers, and software exist to serve people (commonly known as “users”) and you will need to be able to deal with them.  This includes people on other teams and from other technical disciplines. Network people need to be able to talk to server people, Unix people need to be able to talk to Windows people, etc.  Don’t let personal issues fester – they’ll get in the way at the worst times.
  • Work on your communication skills.  Be able to speak and write clearly.  Clarity and accuracy are very important.  To the degree that it is possible, clarity is achieved by being as simple as you can be while retaining accuracy.  Get comfortable standing in front of small (5 to 10 people) and medium sized (10 to 25 people) groups and explaining how your technology works, how it ties in with the rest of the system, etc.  Be good at drawing diagrams.  Have a diagram of what your stuff looks like before you ever need it.

During (what to do while working a problem):

  • Get a clear description of the problem.  This is often harder than it sounds.  “Users”  can talk in vague terms – “Everything is slow.” “The network is broken.”  You’ll need to elicit the right information through direct questions.  “Exactly what were you doing when the problem occurred?  Were there any error messages displayed on screen?  Were other applications affected?”  You may have to repeat questions multiple times in order to get the user to answer what you’re asking.
  • Oddly enough, it can be even harder to get a clear story from a technician – they may be giving an edited version of events based on their own bias (they think their stuff can’t be broken, or that they know where the problem lies).  They will also likely want to tell you all the steps that were tried before you were called.  Unless they kept very good records of what was done, in what order, and the results of every test, that information is likely to be less than helpful.  If you don’t feel you can completely rely on the source, no matter how good your relationship, it is best to do your own investigation – “See for yourself!”
  • If this is a problem with something that has been previously working and is now broken, find out what changed (if anything).  Software updated (on server or workstations or network infrastructure)?  Hardware changed?  Don’t be too quick to dismiss something that you don’t think is related – the installation of new DNS servers really can be the cause of slow network performance logging into Unix boxes.  Use your company’s change control record-keeping system to research.
  • Check your stuff first.  When you are asked to join in a troubleshooting effort, make sure your components are not misconfigured or broken.  When someone asks you to take a look, your response should not be “My part can't be broken…” – rather, it should be “I’ll go check that out and get right back to you.”  If you are prepared (see the “Before” section) this should take very little effort.
  • If the problem turns out to be your area, you need to fix it, but you also need to report honestly and accurately to the team or leadership.  I won’t tell you which one to do first – that depends on what’s broken, the rules at your company, etc.  But you should make that report a priority. Honesty can be difficult when the problem reveals a personal error.  All I can do is urge you to suck it up and do the right thing – it’s worked remarkably well for me over a long period of time.  People understand that if you’re not making the occasional mistake, you probably aren’t working.  They will respect honesty (as long as you’re learning from your mistakes and not making the same ones over and over).
  • If the problem is not in your stuff, offer to help other areas however you can.  If you got called in because you have a reputation as a great troubleshooter, this will already be understood.  But even if that’s not the case, you may have valuable insight to offer – or maybe just a fresh set of eyes (and brains). If you can’t help directly, you can still learn a lot by watching the process and observing the resolution.
  • Try to recreate the problem.  If you can’t test things in the production environment, try to set up a test-bed.
  • Compare working and non-working configurations.  Got two servers that are supposed to be doing the same thing, but one isn’t working?  Find out how they are different!
  • Be persistent.  Don’t give up.  Stick with the job.  
  • Know when to get help.  Some I.T. people are loathe to open problem records with a vendor (including yours truly).  But if a production system is down and the bottom line is hurting, bring the vendor in earlier rather than later.  There maybe a known issue that they can recognize quickly.
  • Take breaks.  During a protracted issue, you’ll want to rest your brain on occasion.  Failing to do so can cause you to overlook otherwise obvious problems – when you look at something for too long, it starts to look normal even if it’s broken!  Get up and walk.  Drink water.  Don’t forget to eat.

After – because eventually, the problem will be solved…

  • FIX the problem.  If workarounds were applied, remove them.  Patch software.  Reconfigure equipment.  Whatever was broken, make it whole again.  “Bandaids” that are applied to get through an initial rough spot should not be considered a complete fix.
  • Bring the system back to “standard”.  Don’t settle for a one-off solution that no one will remember exists in a week.  If there is something wrong with the standard, FIX THE STANDARD.
  • Document what you did.  The standard documentation for the technology in question should be updated to reflect new configurations, new software versions, etc.  Diagrams should be updated.  The problem isn’t fixed until this is done.
  • Be able to express in clear language to your superiors and peers what happened, the steps taken to fix it, etc.  Make sure you understand what happened, in technically accurate terms.
  • Review your performance (and that of your peers) during the troubleshooting process.  Did you discover systems that aren’t time synchronized?  Fix it! Did you discover you really DON’T know how to use that nifty tool?  Practice!  Did you take too long to come to the right conclusion, perhaps overlooking data that was obvious in hind-sight?  Review the process you followed and try to understand how you might have gotten to the correct solution more quickly. The after-action "lessons learned" session is a valuable opportunity, not to be wasted.

No comments:

Post a Comment