James V. Fields: Troubleshooting

Showing posts with label Troubleshooting. Show all posts

Tuesday, February 18, 2020

Anytone DMR - Resolving Type Mismatch Errors

Note, the following post references the Anytone AT-D868UV handheld radio, but I imagine the information applies to the AT-D878UV and other similar models from Anytone.

-----

tl;dr - The virtual com ports created by the com0com software package are incompatible with Anytone QXCodePro and D868UV (used for updating and configuring Anytone DMR radios), causing those programs to crash when they try to start up and enumerate the com ports in the system. The error presented to the user is "Run-time error '13': Type mismatch". The fix is to (temporarily) uninstall com0com prior to using the Anytone programs.

-----

I have an Anytone DMR handheld radio, model AT-D868UV. Anytone makes several models of handheld and mobile radios and their DMR radios have become quite popular in the ham radio community, due to a combination of low price and extensive feature set. As with all DMR radios, the Anytone radios have to be connected to a computer for programming, both for updating the firmware to fix bugs and enable features, and for configuring the radio itself including features and frequencies.

There are two software packages for Anytone radios that must be installed to do these tasks - a firmware update tool called QXCodePro, and a program for creating and maintaining the "codeplug" (just a fancy name for "the configuration file for the radio") called D868UV. The Anytone radio connects to a PC using a custom USB cable. Special driver software allows this to be seen by the computer as a "com" port, like an old-fashioned serial port. Both the QXCodePro and D868UV programs connect to the radio through this serial com port to read and write firmware and configurations.

The software is updated on a somewhat irregular basis - downloaded either direct from Anytone or from a radio vendor.

Recently after not using my DMR radio for some time I decided to pick it back up and work with it. My first goal was to download the newest software package which would include new versions of QXCodePro, D868UV, and a firmware file. After installing the new program executables, I tried firing up QXCodePro and ran into this error - "Run-time error '13': Type mismatch".

Trying to load the D868UV software which manages codeplugs gave the same error. I tried uninstalling the software and reinstalling, no luck. I tried installing an older version, didn't work. I did some Googling around for the error message - it is real common with some Visual Basic stuff plugged into Excel spreadsheets, none of which helped me. After trying everything under the sun that I could think up, I gave in and posted questions all over a variety of Reddit and Facebook groups.

As sometimes happens, after I posted all those queries, I found my answer buried deep in the comments on a Facebook post. I'm sharing it here for others who may have the same issue.

In addition to my Anytone radio I have a Software Defined Radio (SDR) - an RSP1A from SDRPlay. I use a piece of software called CSV User Browser to import shortwave schedule and frequency lists, and one of the things CSVUB can do is control the SDR. It works by sending commands to the SDRUno control software that comes with the radio. That communication is accomplished through... wait for it... com ports. I have installed a program called "com0com" which creates two virtual com ports which are connected to one another. I program CSVUB to grab one of the com ports, and SDRUno to grab the other, and the virtual com port pair pass messages between the two pieces of software.

The issue seems to be some characteristic of these virtual com ports created by com0com. Remember that BOTH the QXCodePro and D868UV programs start by enumerating the com ports in the system upon startup. Apparently there's something about these virtual com ports created by com0com that the Anytone programs can't handle - thus the "type mismatch" error.

The only solution I have found is to uninstall com0com temporarily when I need to use the Anytone programs, then reinstall it when I'm done messing with the DMR firmware or configuration changes. It's sort of a pain in the butt, but it doesn't take terribly long. There may be a way to configure the com0com ports so they don't trigger the type-mismatch in the Anytone programs, but I haven't found one. I am toying with opening a bug report to Anytone, but they're a Chinese company and I don't think they are going to be very responsive.

73 -
James

Wednesday, June 24, 2015

Printing Problems Redux

Some time ago I wrote a rather lengthy post about an old case where I had to troubleshoot a difficult problem with print jobs failing - The Case of the Silence on the Wire. Recently I have had to look at a problem that carried some of the same baggage. An external customer is printing to a print server at our print facility, over a VPN, and seeing some issues.

This is a LPD/LPR setup, with our server listening on TCP port 515 and the customer's system using the standard client side ports 721-731 (see https://www.ietf.org/rfc/rfc1179.txt). When the issue was reported a few weeks ago I didn't really see anything I could put my finger on. Our server response time (as calculated by my analysis software) was a little slow, and there were a few retransmissions from the client, but overall the connections looked healthy enough - TCP three-way handshakes looked ok, data being transferred with our server acknowledging, proper FIN exchanges at the end. At least that's what I saw the first two or three times we were contacted to check it out.

Today I was asked to take another look - the problem was reported to have occurred between 7:00AM and 9:00AM on June 23rd. I grabbed a capture off our sniffers and took a look. The sniffer is capturing everything both on the inside interface of our firewall and on the interface that goes to our private extranet connections, including VPNs, so I had a sort of "double trace", with one copy of the traffic showing the connection to the real internal IP address of our server, and the other showing the connection to the external NAT address.

Looking at a list of connections that took place during the time in question, I saw a bunch of connections that looked more or less like what I described above, but today I noticed something different - there was a connection listed that looked really small on the packet count. This was sourced from client-side port 722. I filtered on the trace and saw incoming SYN packets, but our server wasn't responding with SYN/ACKS - it was responding with plain ACK packets, and the ACK numbers weren't correct for the incoming SYNs.

Now, first guesses aren't always right and you must take care to check things thoroughly. On the other hand, when you've been working with a particular system as long as I've been working with TCP communications, you can sometimes get pretty close to the mark. In this case, I wondered whether the server was responding to an old connection - some previous connection sourced from port 722 that didn't close properly, so the server was still trying to reply using ACK numbers for that old connection.

I began working my way backwards through what our sniffer had captured - all the way back to about 6:35AM - and every time they tried connecting from port 722, we were just sending these ACKs. In all cases the ACK numbers we were sending back were the same (at least on the packets taken from the inside interface of the firewall), and none of them were anywhere close to being correct for the SYN packets. Earlier than that there were no connections on June 23rd.

I shifted focus to my Netflow tool. Hunting for connections out of the vast number captured on a big capture box being fed by multiple taps can be really difficult, but Netflow boils down connection information to the essentials. My Netflow records indicated that communications from this client had ceased a little after 10:00PM on the night of the 22nd, and further that there had been a connection from client port 722 that had enough packets in it to have been viable. With that information I delved back into the sniffer to find that connection.

It was exactly what I was looking for. The connection had started off fine, with a good three-way handshake, and for a time had proceeded normally - client sending data, server sending ACKs. At the end, the client send a final packet with data in it and with the PUSH flag set indicating the server should process it and acknowledge, which it did. I verified that this last ACK got through our firewall, as it appeared in the trace taken on the outside. After that the client didn't send anything further for about 25 seconds, no further data and no FINs - after which, it sent a new SYN packet from client port 722.

This packet did not get through the firewall - the firewall was monitoring state and still thought the old connection was open (the firewall has an idle timer of 1 hour for TCP connections in the state table). The firewall - apparently - sent an ACK using the last valid sequence numbering from the previous connection. The client resent the SYN several times, and the firewall sent those ACKs each time, never letting the SYN through to our server. After a while the client gave up and moved on to a new port number.

The difference between this and what we were seeing in the morning was this - by the time the client got around to starting up again on the 23rd, the firewall had forgotten about the old connection, removing it from the state table, so it was now allowing the SYN packets through to the server. But the server still had the old connection open (more than 8 hours later!!!) and was still sending ACKs for the sequence numbering of the old connection from the night before.

There were two interesting things about these traces that revealed something I'd never seen before, and which challenge my assumptions about how much I know about this particular brand of firewall. First, I've never seen the firewall send an ACK of its own to a connection like we saw on the final connection on the night of the 22nd. At that time the firewall was not letting the SYNs through and my traces on the inside interface of the firewall confirm this - the print server was not getting them and was not sending the ACKs, yet I could see ACKs on the outside interface of the firewall. As near as I can tell the firewall had to be sending them.

Second, in my experience this particular type of firewall is very strict about keeping track of state, and I would not have expected it to allow those ACKs from the server the next morning - once the firewall was letting the SYNs through it should have been watching for SYN/ACKs from the server, and also watching to make sure the ACK numbers were correct. Instead, it was letting those ACKs go right through. I am thinking maybe the firewall is programmed to do this in case there are out-of-order packets on the wire, but it still seems a little freaky and I'm going to have to read up on it.

Thursday, June 4, 2015

Some Things Shouldn't Have To Be Hard

Part of my job involves supporting the network for a business unit with government contracts. We have a connection to a private government extranet, over which our users connect to several websites required to fulfill the contract work.

Monday morning the senior director of IT operations for this business unit called me to say that his users couldn't log into one of these sites. He had already been in touch with tech support for the site, and they had confirmed that it was up and running, and suggested we had a problem on our end.

I started my troubleshooting by logging into the perimeter router connecting to the private extranet, and saw that the connection was up. Next I logged into a perimeter firewall and checked that there was live traffic passing in both directions - everything looked healthy there as well.

Finally I logged into a PC on the affected network and tried connecting to the external site myself using a web browser. I was unable to connect. Browsers these days do a pretty poor job of indicating what the problem is if a site can't be reached. I was using IE 11, and it gave me a list of possible causes that covered just about every possible issue.

I decided to look up the IP address of the remote site so that I could trace the path through the network and double check firewall rules. Using nslookup at the command prompt, I got a good indication of the problem right away - I was unable to resolve the IP address of the site. My computer was configured to point to our internal DNS servers, which in turn forward certain domains to DNS servers located across the private extranet.

Since I was unable to resolve the IP address, I suggested that we needed to get the on-call DNS administrator to check things out. In the meantime we also started a conference call with the tech support people for the remote network. While waiting for our own DNS administrator to join, I described the issue I was seeing.

The remote technician asked me, "Well, what did you change?" I told him we hadn't made any changes. He asked, "Did you do anything to your network connection?" No, we hadn't. "Did you make any firewall changes over the weekend?" was the next question. No, we didn't. I reiterated to the remote tech that our connection was up, everything seemed to be working, but we just couldn't get DNS resolution.

After a short while our local DNS admin joined the call. In short order he confirmed that the DNS servers were working properly, no changes were made on our end, and we seemed to be getting "denied" messages back from the remote DNS server. The remote tech repeated just about every possible iteration of the question about what WE had done to break things.

Only after more than an hour of this line of questioning did the remote technician finally reveal that the remote DNS servers had been changed over the weekend - completely replaced with entirely new devices. It took a little longer, but it was eventually discovered that the new devices had a built-in ACL which was blocking our requests. The old servers hadn't had this capability, and the ACL which the remote DNS admins had put in place didn't allow our servers to talk to theirs.

So riddle me this, Batman - you know you changed out your DNS servers, but when I call and tell you my DNS queries are being refused, you spend an hour making me repeatedly assert that I didn't change anything? I lost two hours of my time, and more importantly my business lost two hours of productive work for dozens of users trying to fulfill their quota of work on a government contract because some bozo didn't want to admit that his change broke the system? Priceless.

Thursday, February 26, 2015

James's Rules of Troubleshooting

While digging through some old documents this afternoon, I came across something I had written about 5 or 6 years ago. It's a pretty fair summarization of some things I have learned about troubleshooting. I've made a few minor edits, but for the most part it's just as I wrote it back then.

James’s Rules of Troubleshooting

Before - (things to do/know prior to a problem – when something breaks it’s too late to start working on these):

Know your stuff. Be an expert on the technology for which you’re responsible. “SME” means Subject Matter Expert – be one. It’s MUCH easier to spot what’s wrong if you know what “right” looks like.
Know where to find your product manuals, configurations, and logs. Know what’s in the logs. Know how to read them. Make sure logs are tuned to show the right amount of data, kept for a sufficient period of time to be useful, and are time-synchronized with everything else in the network. Keep frequent backups of configurations / changes.
Have the tools you need. Have them installed. Have them up-to-date. Know how to use them. Try not to get tied to a single tool, no matter how good it is (when all you have is a hammer, everything looks like a nail). Know more than one way to skin the cat.
Work on your people skills. You got into I.T. to avoid dealing with people? Hopefully I’m not the first person explaining to you that this is not possible. The network, computers, and software exist to serve people (commonly known as “users”) and you will need to be able to deal with them. This includes people on other teams and from other technical disciplines. Network people need to be able to talk to server people, Unix people need to be able to talk to Windows people, etc. Don’t let personal issues fester – they’ll get in the way at the worst times.
Work on your communication skills. Be able to speak and write clearly. Clarity and accuracy are very important. To the degree that it is possible, clarity is achieved by being as simple as you can be while retaining accuracy. Get comfortable standing in front of small (5 to 10 people) and medium sized (10 to 25 people) groups and explaining how your technology works, how it ties in with the rest of the system, etc. Be good at drawing diagrams. Have a diagram of what your stuff looks like before you ever need it.

During (what to do while working a problem):

Get a clear description of the problem. This is often harder than it sounds. “Users” can talk in vague terms – “Everything is slow.” “The network is broken.” You’ll need to elicit the right information through direct questions. “Exactly what were you doing when the problem occurred? Were there any error messages displayed on screen? Were other applications affected?” You may have to repeat questions multiple times in order to get the user to answer what you’re asking.
Oddly enough, it can be even harder to get a clear story from a technician – they may be giving an edited version of events based on their own bias (they think their stuff can’t be broken, or that they know where the problem lies). They will also likely want to tell you all the steps that were tried before you were called. Unless they kept very good records of what was done, in what order, and the results of every test, that information is likely to be less than helpful. If you don’t feel you can completely rely on the source, no matter how good your relationship, it is best to do your own investigation – “See for yourself!”
If this is a problem with something that has been previously working and is now broken, find out what changed (if anything). Software updated (on server or workstations or network infrastructure)? Hardware changed? Don’t be too quick to dismiss something that you don’t think is related – the installation of new DNS servers really can be the cause of slow network performance logging into Unix boxes. Use your company’s change control record-keeping system to research.
Check your stuff first. When you are asked to join in a troubleshooting effort, make sure your components are not misconfigured or broken. When someone asks you to take a look, your response should not be “My part can't be broken…” – rather, it should be “I’ll go check that out and get right back to you.” If you are prepared (see the “Before” section) this should take very little effort.
If the problem turns out to be your area, you need to fix it, but you also need to report honestly and accurately to the team or leadership. I won’t tell you which one to do first – that depends on what’s broken, the rules at your company, etc. But you should make that report a priority. Honesty can be difficult when the problem reveals a personal error. All I can do is urge you to suck it up and do the right thing – it’s worked remarkably well for me over a long period of time. People understand that if you’re not making the occasional mistake, you probably aren’t working. They will respect honesty (as long as you’re learning from your mistakes and not making the same ones over and over).
If the problem is not in your stuff, offer to help other areas however you can. If you got called in because you have a reputation as a great troubleshooter, this will already be understood. But even if that’s not the case, you may have valuable insight to offer – or maybe just a fresh set of eyes (and brains). If you can’t help directly, you can still learn a lot by watching the process and observing the resolution.
Try to recreate the problem. If you can’t test things in the production environment, try to set up a test-bed.
Compare working and non-working configurations. Got two servers that are supposed to be doing the same thing, but one isn’t working? Find out how they are different!
Be persistent. Don’t give up. Stick with the job.
Know when to get help. Some I.T. people are loathe to open problem records with a vendor (including yours truly). But if a production system is down and the bottom line is hurting, bring the vendor in earlier rather than later. There maybe a known issue that they can recognize quickly.
Take breaks. During a protracted issue, you’ll want to rest your brain on occasion. Failing to do so can cause you to overlook otherwise obvious problems – when you look at something for too long, it starts to look normal even if it’s broken! Get up and walk. Drink water. Don’t forget to eat.

After – because eventually, the problem will be solved…

FIX the problem. If workarounds were applied, remove them. Patch software. Reconfigure equipment. Whatever was broken, make it whole again. “Bandaids” that are applied to get through an initial rough spot should not be considered a complete fix.
Bring the system back to “standard”. Don’t settle for a one-off solution that no one will remember exists in a week. If there is something wrong with the standard, FIX THE STANDARD.
Document what you did. The standard documentation for the technology in question should be updated to reflect new configurations, new software versions, etc. Diagrams should be updated. The problem isn’t fixed until this is done.
Be able to express in clear language to your superiors and peers what happened, the steps taken to fix it, etc. Make sure you understand what happened, in technically accurate terms.
Review your performance (and that of your peers) during the troubleshooting process. Did you discover systems that aren’t time synchronized? Fix it! Did you discover you really DON’T know how to use that nifty tool? Practice! Did you take too long to come to the right conclusion, perhaps overlooking data that was obvious in hind-sight? Review the process you followed and try to understand how you might have gotten to the correct solution more quickly. The after-action "lessons learned" session is a valuable opportunity, not to be wasted.

Saturday, February 21, 2015

The Case Of The Silence On The Wire

I have spent a lot of my career as a network engineer in front of packet sniffers. I've often heard it said that "the wire doesn't lie", and that's true - as far as it goes. But packet sniffers (and other analysis tools) don't show you the "truth" either, unless you define truth as just a set of data points. "Truth" as most of us understand it requires deriving meaning from the facts, and sniffers are pretty limited in this aspect. The following story will (hopefully) illustrate the sometimes difficult process of extracting the truth from the facts, how different people sometimes draw different conclusions from the same facts, and the importance of persistence in the pursuit of the truth.

My company has a big print and mail facility, and years ago some smart person realized that we had enough excess production capacity to offer our services to other companies. They lined up their first prospective customer and things got underway. We set up a VPN over the public Internet between the other company's network and our own. Our Internet connection came into our primary datacenter, and from there we had a private connection to our print and mail facility.

The print server was a Unix system running the standard line printer daemon (LPD), and the client was running the line printer remote (LPR) protocol on a Windows server with Microsoft Print Services for Unix. Their system would create a print job and connect to our server over the VPN, whereupon the job would be queued and printed. We went through a short POC phase, and when everything worked to the satisfaction of the print facility and the customer, contracts were signed and work got underway.

Not long afterwards, my manager was contacted by the print facility folks and asked to look into a problem - the customer was reporting occasional problems connecting to the print server. We didn't have a lot of VPNs at that time, and the combined mistrust of the Internet and VPNs had led to a suspicion that the VPN was the culprit. We therefore combed through our VPN logs for evidence of problems, as did an engineer at the customer's network. Neither we, nor the customer's engineer (who I will refer to hereafter as "Steve") found any evidence implicating the VPN. However, the problems continued.

We had recently gotten new packet capture devices with large amounts of storage, designed for more or less permanent installation in potentially high-value locations on the network. We deployed one of these to be able to watch all traffic on both sides of our VPN concentrator - one side would see the encrypted stream, while the other would be able to see the unencrypted stream. We deliberately chose this spot on the network because it was the "furthest out" on the perimeter network - if we didn't see problems here, we could start looking further into the network, while on the other hand if we did, we could safely ignore our internal network.

We didn't have long to wait until the problem resurfaced, and the packet traces were instructive. We saw the communications from the client, and every one of these communications looked perfectly clean. Every connection included a complete TCP three-way handshake, what appeared to be normal communications between the LPR and LPD, and a standard four-way teardown of the session. Not once during the entire time troubleshooting this application did we ever see anything like a failed connection - TCP handshakes were always complete, the client and server communications were always successful, there were never any sessions that died midstream, and the teardowns were always textbook clean. It's been a long time but if memory serves, I am nearly certain we never even saw a single TCP retransmission.

What we DID see were inexplicable absences of connections - periods of time usually lasting several minutes in which there were no packets of any kind coming from the client's systems. During these periods there were no TCP attempts at all. The trace taken outside the VPN box was similarly devoid of traffic. In effect, the wire was silent.

There was a chance - a very slim one - that something was going on in what little of our network existed outside the VPN box, so we looked for evidence of that. The VPN box plugged into an ethernet switch, as did our Internet routers. The switch was clean, as were the routers. We were not experiencing any interruption of other Internet traffic, our other VPNs were all running clean, and the VPN to this partner was not having problems.

Given that we saw NO issues of any kind, I concluded that the issue was occurring at the customer's end. I reasoned that intermittent issues on the Internet, or within our infrastructure, would not "respect" the boundaries of TCP sessions - in other words, I would have expected to see problems occur within the TCP sessions. I might expect to see failed TCP handshakes, or some irregularity within the print jobs streaming over. The fact that this never happened led me to believe that something was preventing our customer's systems from even attempting to connect for short periods of time. I should also mention that our print server was handling lots of internal jobs with no issues.

I packaged up my sniffer traces and forwarded them to Steve, outlining my conclusions and the reasoning behind them, and asking him if he could take local traces of his own and confirm whether his systems were making any attempt to connect. He promised to do so. I didn't hear from him for a while, and we continued to get reports through the print facility that the customer was complaining about the connection problems, so I reached out to Steve again.

I asked directly whether Steve had taken traces. He said he had. I asked if he had seen any irregularities in the traces on his side. He said he had not. This might have been an error on my part - maybe I should have asked whether he saw anything at all when the dropouts were occurring, the same "silence on the wire" evident in my traces, but I didn't think of it. I did ask if I could get copies of his traces to compare with mine, and Steve said he would share them, but they were never forthcoming - so to this day I do not know if he actually took any.

I reported my research to my management, along with my conclusion that the problem must be at the client's end, and that there was not likely anything we could do about it. Their response was to urge me to keep looking, and so I did. I took a dozen or more traces, all containing perfect, complete sessions, and usually also containing some of these weird silent periods.

I dug into the traces and started looking at everything that was there in the communications. Let's see - client SYN packet, source port 721, server port 515, server SYN-ACK, client ACK, some kind of "hello" packet from the client and a response from the server, print job streaming over, client FIN-ACK, server ACK, server FIN-ACK, client ACK. All perfect. Next job - identical except for client source port which is now 722 (and the sequence numbers, of course). And another - client side port 723, etc. It did occur to me that the client-side port numbers were almost TOO sequential - there were never any skips, say from 721 to 725 - which made me think the client's system must not been too busy, or that it might be passing through some device that was altering the client ports on the way our of their network. I also thought most clients should pick their ephemeral ports from a range 1024 and above, but I wasn't too bothered by it.

I looked at this for a while - several times, on various days, over several weeks. I couldn't find anything that I thought would help, and I slowly lost interest, especially as the rate of complaints dwindled. After a while I just didn't think about it any more. It wasn't exactly a matter of "giving up" - I thought I'd done a good job of isolating the issue to the customer's end, and lacking visibility into their network I just didn't see that I could do more to help.

For several months it dropped completely off my radar. I guess I should have known it would come back though, because unresolved issues never really go away. One day my manager and director both asked me to get back on the case. They had been contacted by the manager over the print and mail facility. The issue was still going on, the customer was complaining more than ever, and now the print facility was having to reboot the server on a regular basis to clear up the issue. To make matters more urgent, they were hoping to add another customer, but until this issue was resolved they were unable to take that next step.

I started by calling our print server operator - I wanted to know why we were rebooting the server. He told me that after countless "outages", the customer had requested that we try rebooting the server. This had been done, and the customer was then able to connect and send print jobs. Ever since then the customer had gotten into the habit of calling and requesting a reboot whenever they had trouble connecting.

I have to tell you that this made me really angry. The lack of logic involved here was staggering. I had clear, indisputable evidence that when the customer was "having trouble connecting", we weren't getting anything from them at all. I had evidence that every time we did receive a connection request, we answered appropriately - our server always responded. And in addition, our print server never failed to pick up and handle internal print jobs, which, by the way, were now being interrupted by these frequent reboots. The whole thing made no sense.

I then talked to Steve. He told me that since the problem had persisted, they (the customer's company) had started using a command-line utility to check the print server. They would fire up a command prompt on the system creating the print jobs and run a command that would connect to the print server and display the jobs. When "the problem" was occurring, the command-line utility would also be unable to connect. Their operators would sit there rerunning the utility every couple of minutes until they got a successful connection, and then try to restart the print jobs. Usually after this, they could print again, but sometimes not.

What I got from all of this was that the issue had some sort of time component to it, resolving itself within a few minutes. I was still convinced that the issue was on the customer's end - there was never any evidence otherwise. The server reboots simply gave time for the problem to correct itself, but that had always happened anyway. To test this theory, I advised the print server operator not to reboot the server any more. I suggested that he didn't have to tell the customer he wasn't rebooting the server - he could just say "OK, try again in a few minutes." He began doing this, and sure enough there was no difference in the behavior of the whole system. After a wait of a few minutes, their print jobs would start coming through.

Now that I had gotten the "reboot monkey" off our backs, I went back to the traces. I took a bunch of new ones and started going through the connections again and again, looking for anything out of the ordinary. They looked just like the ones from before - client sends SYN packet with client port 721 (or something similar), server side port 515, server sends SYN-ACK, etc. The connections were as perfect as ever. In fact, they looked so familiar that I began to wonder if I was looking at my old trace files. Nope, these were new. I pulled up some of the old original files to make sure, and it was at this point that I began to grasp the faint outlines of the problem.

The client side port numbers had always bothered me a little bit. Aside from the fact that they were all under 1024, the range of port numbers was always very consistent - and very small. The client side port numbers were always within the range 721 - 731. Eleven port numbers, always in succession, reused again and again. I would see a connection from port 728, 729, 730, 731, then it would loop back around to 721. And every so often, usually after a bunch of successful connections, silence on the wire.

I began to wonder if there was some issue with port-exhaustion - this thing was using such a small pool of client-side ports. I wondered, how quickly is a client-side port allowed to be reused? I dug out my trusty copy of Richard W. Stevens' TCP/IP Illustrated Volume 1 and found the TCP state diagram. I saw something called the "2MSL" wait state which occurs before the socket is fully closed. The MSL is the "maximum segment lifetime" which is supposed to be two minutes. The standards for the protocols we still use today were created back when computers and the Internet were MUCH slower, and back then there might be conditions on the network that could cause a packet to arrive late - very late indeed. Anyway, the standards also said that the partner in a TCP session which initiates an active close (through the use of a FIN-ACK packet) MUST then hold the connection for two times the maximum segment lifetime (2MSL) before it can consider the socket closed.

In other words, the connection doesn't truly close for four minutes after all the teardown messages have been exchanged. If you do a "netstat" command on a system you will often see sessions in something called the "TIME_WAIT" state. These are sessions waiting out the 2MSL period so the system can close them. Basically, from the time I observed the client FIN-ACK and other teardown packets, four minutes would have to elapse before the client-side port number would again be released to the operating system for reuse.

With respect to our customer's printing problem, the issue was now in pretty sharp focus. For some reason, the client was only using 11 port numbers (721-731). After 11 successive print jobs, if the timespan of those jobs was less than 4 minutes, all of the TCP sessions involved would be in the TIME_WAIT state. Until the oldest sessions completed the 2MSL wait, there would not be any available ports for new sessions. But why were they using such a small pool?

The answer to this is in the RFC which defines the LPD/LPR services. RFC 1179 says that "The source port must be in the range 721 to 731 inclusive." To be honest, I didn't actually find the answer in the RFC - but some Googling led me to a Microsoft Knowledge Base document which described exactly the problem we were seeing, which mainly occurred on a specific version of Windows Server, and with a suggested fix - a registry setting to cause the Print Services for Unix to use standard ephemeral ports from the much larger pool above 1023. The document outlined that the command-line utility they were using also drew client ports from the same range, so effectively, if all the ports were tied up in TIME_WAIT sessions, the utility would similarly fail to connect. In fact, when they DID connect with the utility, they were actually putting an available port into a 2MSL wait again!

I sent an email to Steve, asking what version Windows Server was in use. He confirmed the affected version. I then sent an email telling him what I thought was happening - that his server was using a very limited range of client side ports, that the speed/volume of print jobs was outpacing the system's ability to clear the sessions for reuse resulting in client port exhaustion, and that there was a suggested fix involving a registry setting. I even sent a link to the Knowledge Base document.

Steve responded that he would look into it, but my interpretation of his response was that he wasn't sure he believed me. I couldn't do much about that - whenever you are dealing with another company, and that company is your customer, and when you are telling a peer engineer that you have remotely diagnosed a problem in his systems...well, there may be some resistance to the idea. So I waited to see what would happen.

What happened was exactly zilch - the problem persisted, and again we were being begged by our print facility manager to intervene. But this time, there really was nothing more we could do - except that I now understood I would have to force the issue with Steve. I wrote - to my management, and copied to Steve - an exhaustive (and at times pointed) accounting of the entire troubleshooting effort, including my early work and their (accurate) conclusions about the problem being in the customer's network, my difficulties getting information out of Steve, my work in understanding and putting a stop to the reboots, and finally my conclusion - backed up by sniffer traces and documentation from the customer's server OS vendor - that the problem was caused by port exhaustion. I included the Knowledge Base document for reference, and stated that my department was finished working the issue, once and for all.

Within a week of sending that email, Steve (or someone else at his company) had made the registry changes to their system, and the problem never surfaced again. It had taken six months, dozens of hours looking at traces, emails back and forth with an incompetent or unhelpful peer, a lot of pain and suffering on the part of our print facility, and research into another company's network and systems, but the problem was finally resolved.

Whenever I work a problem - especially when it's such a challenging and painful one - I always look for "lessons learned." This one was particularly fruitful:

Persistence, persistence, persistence - over and over throughout my career troubleshooting, I have run into problems where it seems like I just stare at packet traces or logs until my head is about to burst, and then, like a ray of sunshine coming through the storm clouds - the solution appears. This case was somewhat rare in that there was a period where I accepted that the problem was "solved" even when it wasn't. I had felt that in proving the problem was on the customer's end, that my work was done. Figuring out when to stop, when enough is enough, is part of maturing as a troubleshooter and I may do a blog post about that later - but in this case, my real customer was always the people at my company's print and mail facility, and until things were completely resolved, my work was not truly done.
Troubleshooting a problem that exists on a foreign network is really hard, but NOT always impossible - this probably doesn't require much more explanation than what is available in the story above, but I've often seen network engineers focus on this sort of "us VS. them" strategy in problem solving. The idea is that if we can prove it isn't US, then it must be THEM, and we can't do anything more. A lot of times there is a bit of shaky logic employed, something like "Well, our printer works fine for everybody else, and we don't have any other VPN or Internet problems, so it's not us." Of course, Steve always insisted that they were not having problems sending print jobs to anywhere but us, which is possible if we were the only LPD server they were targeting. All of this may have been true - and in my case I had even better evidence from my traces that we weren't even receiving communications from the customer's network - but the fact remains that the answer was always right in front of me, in the packet traces I was so proud of analyzing.
Dealing with "peers" on other teams or at other companies can be just as challenging as the technical act of troubleshooting - "Steve" is a prime example of something I've dealt with many times over the years. He was either unwilling to really look at his network and systems, or incompetent, or both. I believe that if he had actually performed network sniffer traces he would have noticed that there was NO communication coming from his print system during the outages, which would have led him to the same conclusion I had reached. The fact that this didn't happen, combined with his apparent unwillingness to share the traces he claimed to have taken, leads me to believe he never did them at all. Of course it's possible he did the traces but just didn't interpret them properly. I don't suppose I will ever know. I also strongly suspect that the Windows Server in question would likely have been writing event log messages regarding the connection problems, had anyone over there cared to take a look. But Steve was what I had to work with - I had to continually reach out to him, probe for information and prod for action, while trying not to upset or insult him, in order to finally get the action required. It was neither easy nor pleasant.
No matter how much you know, there's always room for more - going into this problem I thought I knew the basics of socket communications pretty well. But it took far too long for me to notice the oddly low port numbers, or the small pool in use. There's a lot of detail in packets and packet traces, and it takes diligence to spot patterns like these.

And so ends the tale of the Silence On The Wire - hopefully you made it here to the end, and that it was worth coming along for the ride.

Monday, October 13, 2014

Bloody Turnips

“You can’t squeeze blood from a turnip.” This old saying is a way of expressing that some things are so obviously impossible that they aren’t worth trying, that they are a waste of time. But sometimes the problem isn’t that we’re trying to squeeze blood from a turnip - the problem is assuming that we’re looking at a turnip in the first place.

The other day I got “the call.” “The call” usually comes late in the day, and frequently on a Friday. It’s when someone has been working at a problem all day, or all week, realizes they are running out of time, and in a last ditch effort at a resolution they ask for a network admin to take a packet trace. And I’m the person that frequently gets “the call.”

This time it was an application which picks up files from a server, the application was locking up, and the people troubleshooting it explained that this is frequently a sign that there was a delay in picking up the files (this application was said to be super time-sensitive). Server admins had found nothing wrong on the file server. I was asked to see if there was anything causing network-based latency, or if I could at least see something in the trace that might account for the issue.

I have to admit that I did not approach this problem with any enthusiasm. I have a life. I do not like getting called at 3:00PM to start a multi-hour troubleshooting session on something this vague. But it’s part of the job, these were my customers, and apparently nobody else was making any headway (including the vendor of the application, who had been called in to work on it).

Now despite being pretty good with the sniffer - and sometimes enjoying the challenge - I know that it can be a hard way to get to the root of a problem, so I made an effort to do things the easier way. I asked the usual questions - when did the problem start, did something change, could I get a more technically accurate description of the problem, etc. I looked at the basics - located and checked for errors on the switch ports of the file server and application system and so forth. And then, reluctantly, I fired up the sniffer and got started.

About an hour into the session, one of my teammates came up to watch, and he asked the obvious question - “Do you really think you’re going to find the problem by looking at the packet contents?” He was, in essence, asking me if I was trying to squeeze blood from a turnip. And honestly I did not know how to answer him.

It’s something I’ve thought about often over the years. I am very interested in troubleshooting - the thought processes that go into it, the practice of it, the techniques that are used. I think that the act of trying to reverse-engineer an application by staring at the sniffer until it feels like my head is bleeding is a really hard way to do things. But while I have not come up with a lot of amazing answers to those questions, I have learned one thing:

I can’t solve a problem if I don’t try.

There are a lot of times it feels like I’m squeezing a turnip. But the truth is I don’t know what I’m squeezing. It’s like sticking my hand in a bag and grabbing something, and squeezing it, and after a long time I get some blood out of it - in which case I find that it wasn’t a turnip. And sometimes I get nothing but a turnip guts.

So I just said to him - “I have no idea.” And I kept on squeezing.

I’d like to conclude this post by telling you about the amazing discovery I made in the packet trace. Unfortunately that didn’t happen. What did happen is I was able to determine that when the application freezes up, it isn’t waiting for anything from the file server. The application was getting a response that looked “complete” (for you packet monkeys, it had the PUSH flag set on the last packet of the response), the application system responded with an immediate ACK, and then sat there for a long time before doing anything else. Then the application system sent a packet and things started up again. I saw this happen multiple times during “freezing” episodes.

What does it mean? Well, it means the problem isn’t a delay in getting information from the file server. There could be a problem in the contents of the response, and being unfamiliar with the application itself I couldn’t speak to that. Or there could be something happening on the application system causing it to freeze that has nothing to do with the network traffic.

This information didn’t solve the problem for the application folks. It did get the file server admins off the hook, and it pretty well proved the network infrastructure wasn’t at issue, and it gave the application admins and their vendor a little push in the direction of looking at their own system a little harder. I hope it helped.

If there is a message here, it’s this - troubleshooting can be a painful, frustrating, and sometimes ultimately unrewarding process. Problems can be really complicated, the tools can be hard to use, and the whole thing can just be a lot of work. Even when you try your best you don’t always come up with a big win. But if you don’t try, you don’t stand a chance. I think a lot of people - including a lot of network people - think that problems can't be solved with a sniffer, or maybe that they can't solve them, so they don't try. All I can say is, I've done it often enough to know it's not impossible. Working a problem with a sniffer isn't always fruitless. So the moral of the story?

Keep squeezing.

Monday, August 25, 2014

What's The Problem, Anyway?

The first step in troubleshooting a problem is knowing that you have one. Hopefully you have some sort of monitoring system in place that can alert you to the existence of a problem in a timely manner. Unfortunately this isn't always the case, and problems are reported to us by users, system or application administrators, or in the worst case by customers.

Once we know there is a problem, the second step is to get a clear description of the symptoms (which will hopefully lead us to an actual technical definition of the problem). And herein lies one of the biggest headaches for a troubleshooter, because the reports we get are often vague, inaccurate or misleading. An important skill for the troubleshooter is therefore the ability to extract accurate information from the people reporting the problem, to get detailed descriptions, and weed out what is just plain wrong.

There are various reasons why we can't simply trust early problem reporting, some of which has to do with exactly who is making the report. In particular, getting people to concentrate on describing the symptoms rather than jumping to conclusions can be a real chore. Here are some common issues I see frequently with problem reporting:

End users frequently tend to describe what they feel rather than what they see, and to generalize - a lot. Descriptions such as "Everything is slow" are common. Users who can't get to a specific web site sometimes report that "The Internet is down."
People who have experienced one kind of problem in the past sometimes think that every new problem is the same as the old one. A recent example occurred at my office when the users of an externally hosted web application experienced extreme slowness and broken app sessions due to packet loss along the path to the external hosting site. A couple of weeks after that was resolved, there were problems with a server hosting the application, and it was reported to my team that the network problem had come back, despite the fact that the symptoms were different (and that users were getting server-side error messages displayed onscreen).
There may be one or more "human layers" between the people with the problem and the people troubleshooting, and they can muddy the waters. For example, many of our problems come to us by way of a helpdesk which takes problem calls. They provide a vital function, but inexperienced or untrained personnel may not ask the right questions, or they may provide their own interpretation before passing along the report.
People sometimes report inaccurate information, and once it's been reported it may be hard to correct. In the case described above where users were having trouble with performance of an external site, a manager who received initial complaints from his users concluded that only users of Windows XP and older versions of Internet Explorer were affected, but that users on Windows 7 and newer browsers were fine. This incorrect information was the result of failing to gather enough data before calling the helpdesk - but it went into the ticket. The problem got kicked around for several days by other areas before landing in my team's lap, but although the manager had learned during that time that his Windows 7 users were indeed affected, that information never made it into the problem ticket. Our team started the troubleshooting process with inaccurate information.
People sometimes think they already know what the problem is, and try to lead the troubleshooter to a particular conclusion that may not be warranted. A lot of the problems that come our way start out like this: "We need you to check the firewall, our app server can't get to the database server." Of course it may be the firewall, but troubleshooters who allow themselves to be led this way often lose precious time following false trails.

It's difficult to always keep these issues from occurring, but a good troubleshooter knows the importance of getting an accurate description of the symptoms. Here are a few ways it's done:

Whenever possible, talk to the people experiencing the problem. I know a lot of IT people who just HATE this - we like having the helpdesk act as a buffer between us and our users or customers (who may be in a foul mood by the time they call in a problem). But the more layers there are between us and the people who are actually experiencing the issue, the harder it will be to make sure the right questions are asked.
Concentrate on the basics. What is the user doing when the problem happens? What application are they using? What web site are they accessing? What specific function within the application or site are they accessing? What is it supposed to do that it isn't doing? When did the problem start? Did is used to work and now it doesn't, or is it something we've never seen work properly? How many people are affected? Is the user aware of a change - a new operating system or browser, or maybe a patch that got pushed out? Did an application administrator push out a new code release?
Try to see the problem for yourself. Can you try to run the same application under the same circumstances as the user? Can you remotely access a workstation in the same place and do the same thing? Can you shadow or monitor a user's session so you can see what they see? If you can't see it yourself, can someone reproduce the problem and give you a description? Can you get someone to take a screen shot, or send you an error message from an application screen or from a server or application log? (On that note, it's best if you can avoid having people try to write down or type an error message, as it may not be faithfully transmitted to you - a screenshot or actual snippet from an error log is better).
Try to recognize the difference between a problem description and a conclusion drawn by someone else about the nature of the problem - in other words, try not to be led. If the problem report tells you what needs to be checked this should be an immediate red flag. It's especially difficult to avoid if you know the person making the report and you have some respect for their technical skills, but you need to think things through for yourself - which may mean getting the reporter to back up and walk you through the symptom. If they want to describe how they came to their conclusion, that's fine as long as you can resist the temptation to let them do your work for you.

This isn't meant to be rude or disrespectful, but remember this - problem reports can be wildly inaccurate or so vague as to be nearly useless. An important part of troubleshooting is to get a clear, accurate description of the symptoms. Without that, you're half-blind and may waste a lot of valuable time and effort on the wrong path.

Thursday, August 21, 2014

It's The Network! (why network engineers get so much experience troubleshooting)

After I'd been on the job for a while I began to notice a disturbing trend - the network gets blamed for a lot of problems. At first I thought it was something unique to my company and our IT staff, but I have learned that this is a common occurrence. Every few months a vendor will come in trying to sell us some new-fangled network monitoring tool, and the opening pitch is always something like this:

"Are you tired of having to defend the network all the time? With (insert product name here) you can instantly PROVE that it's not the network causing problems, and refocus your troubleshooting on finding the REAL cause!"

The fact that a market exists for such tools, and that pretty much every vendor chooses the same pitch to get network engineers to buy them, tells me that this problem is widespread. It's very common for server and application administrators to blame the network when their systems aren't working the way they expect, and this attitude is also seen in management as well.

There are a number of reasons why this is so - I'll list and comment on some of them here, and later blog posts will explore them in further detail.

The word "network" means something different to network engineers and to pretty much everyone else. To a network engineer, the "network" is a collection of routers, switches, firewalls, and VPN devices. When we're feeling generous it may also include other devices that can alter or affect traffic flow - security devices like IDS/IPS, load-balancers, etc. But to many people the "network" is defined as "everything other than the system I'm responsible for." This means that if a server or application administrator is having a problem and they don't see something wrong with their own systems (and frankly, they may not know how to look), they are going to toss it over the wall to the network team.
Some parts of the network are designed to block traffic. It's true - the very definition of a "firewall" is a system that blocks everything by default, and only allows traffic by explicit exceptions. And intrusion prevention systems can interfere with traffic that fits (or fails to fit) a particular profile. Which leads us to this little gem:
Sometimes, it really IS the network. No part of a large IT system is immune to problems. Firewalls may be blocking traffic if a network engineer has failed to correctly configure a necessary exception (or if the application owner has failed to request it). IPS systems can mis-identify traffic as malicious. Switch ports, line cards, and routers can have hardware problems, and as with any software, the operating code on these systems can be buggy. And that leads to yet another item:
It was the network last time, so it's the network this time. I call this The Problem of Experience. If a server or application admin has ever been the victim of a missing or misconfigured firewall rule or a bad IPS signature or a flaky switch port, the next time they have any sort of problem they're more likely to conclude that the network is causing it THIS time, too.
The network team has unique powers of observation. In addition to our ability to look at our own systems - our switches, routers, firewalls, VPN devices - we network engineers can also look at traffic. We are usually the folks who own and operate the packet capture and analysis devices - which makes a certain kind of sense given that we have to configure the network to copy traffic to them. Even when someone is kind enough not to actually blame the network, they often come straight to the network team for a "sniff" (and some expert assistance with the analysis) as a shortcut to resolving their issues.
We're good at troubleshooting. I addressed this briefly in my introductory post on this blog, but it comes down to one of those self-reinforcing cycles. We get lots of problems so we develop skill at solving problems, good at checking our own systems first and then tackling other people's issues, and then because we're good at it, we get asked to do it some more, so we get better at it...you see where this goes, right?

So if you're a network engineer wondering if it's just you, or just your company or your admins or your users...the answer is "No." We get the same thing everywhere - if there's a problem, someone is bound to blame the network. If you're lucky you will survive long enough to develop some skill at solving problems, and if you're really lucky you will eventually convince the people around you that it's not always the network. But don't hold your breath waiting for them to stop asking for help.

James V. Fields