Monday, October 13, 2014

Bloody Turnips

“You can’t squeeze blood from a turnip.”  This old saying is a way of expressing that some things are so obviously impossible that they aren’t worth trying, that they are a waste of time.  But sometimes the problem isn’t that we’re trying to squeeze blood from a turnip - the problem is assuming that we’re looking at a turnip in the first place.

The other day I got “the call.”  “The call” usually comes late in the day, and frequently on a Friday.  It’s when someone has been working at a problem all day, or all week, realizes they are running out of time, and in a last ditch effort at a resolution they ask for a network admin to take a packet trace.  And I’m the person that frequently gets “the call.”

This time it was an application which picks up files from a server, the application was locking up, and the people troubleshooting it explained that this is frequently a sign that there was a delay in picking up the files (this application was said to be super time-sensitive).  Server admins had found nothing wrong on the file server.  I was asked to see if there was anything causing network-based latency, or if I could at least see something in the trace that might account for the issue.

I have to admit that I did not approach this problem with any enthusiasm.  I have a life.  I do not like getting called at 3:00PM to start a multi-hour troubleshooting session on something this vague.  But it’s part of the job, these were my customers, and apparently nobody else was making any headway (including the vendor of the application, who had been called in to work on it).

Now despite being pretty good with the sniffer - and sometimes enjoying the challenge - I know that it can be a hard way to get to the root of a problem, so I made an effort to do things the easier way.  I asked the usual questions - when did the problem start, did something change, could I get a more technically accurate description of the problem, etc.  I looked at the basics - located and checked for errors on the switch ports of the file server and application system and so forth.  And then, reluctantly, I fired up the sniffer and got started.

About an hour into the session, one of my teammates came up to watch, and he asked the obvious question - “Do you really think you’re going to find the problem by looking at the packet contents?”  He was, in essence, asking me if I was trying to squeeze blood from a turnip.  And honestly I did not know how to answer him.

It’s something I’ve thought about often over the years.  I am very interested in troubleshooting - the thought processes that go into it, the practice of it, the techniques that are used.  I think that the act of trying to reverse-engineer an application by staring at the sniffer until it feels like my head is bleeding is a really hard way to do things.  But while I have not come up with a lot of amazing answers to those questions, I have learned one thing:

I can’t solve a problem if I don’t try.

There are a lot of times it feels like I’m squeezing a turnip.  But the truth is I don’t know what I’m squeezing.  It’s like sticking my hand in a bag and grabbing something, and squeezing it, and after a long time I get some blood out of it - in which case I find that it wasn’t a turnip.  And sometimes I get nothing but a turnip guts.

So I just said to him - “I have no idea.”  And I kept on squeezing.

I’d like to conclude this post by telling you about the amazing discovery I made in the packet trace.  Unfortunately that didn’t happen.  What did happen is I was able to determine that when the application freezes up, it isn’t waiting for anything from the file server.  The application was getting a response that looked “complete” (for you packet monkeys, it had the PUSH flag set on the last packet of the response), the application system responded with an immediate ACK, and then sat there for a long time before doing anything else.  Then the application system sent a packet and things started up again.  I saw this happen multiple times during “freezing” episodes.

What does it mean?  Well, it means the problem isn’t a delay in getting information from the file server.  There could be a problem in the contents of the response, and being unfamiliar with the application itself I couldn’t speak to that.  Or there could be something happening on the application system causing it to freeze that has nothing to do with the network traffic.

This information didn’t solve the problem for the application folks.  It did get the file server admins off the hook, and it pretty well proved the network infrastructure wasn’t at issue, and it gave the application admins and their vendor a little push in the direction of looking at their own system a little harder.  I hope it helped.

If there is a message here, it’s this - troubleshooting can be a painful, frustrating, and sometimes ultimately unrewarding process.  Problems can be really complicated, the tools can be hard to use, and the whole thing can just be a lot of work.  Even when you try your best you don’t always come up with a big win.  But if you don’t try, you don’t stand a chance.  I think a lot of people - including a lot of network people - think that problems can't be solved with a sniffer, or maybe that they can't solve them, so they don't try.  All I can say is, I've done it often enough to know it's not impossible.  Working a problem with a sniffer isn't always fruitless.  So the moral of the story?

Keep squeezing.

No comments:

Post a Comment