My company has a big print and mail facility, and years ago some smart person realized that we had enough excess production capacity to offer our services to other companies. They lined up their first prospective customer and things got underway. We set up a VPN over the public Internet between the other company's network and our own. Our Internet connection came into our primary datacenter, and from there we had a private connection to our print and mail facility.
The print server was a Unix system running the standard line printer daemon (LPD), and the client was running the line printer remote (LPR) protocol on a Windows server with Microsoft Print Services for Unix. Their system would create a print job and connect to our server over the VPN, whereupon the job would be queued and printed. We went through a short POC phase, and when everything worked to the satisfaction of the print facility and the customer, contracts were signed and work got underway.
Not long afterwards, my manager was contacted by the print facility folks and asked to look into a problem - the customer was reporting occasional problems connecting to the print server. We didn't have a lot of VPNs at that time, and the combined mistrust of the Internet and VPNs had led to a suspicion that the VPN was the culprit. We therefore combed through our VPN logs for evidence of problems, as did an engineer at the customer's network. Neither we, nor the customer's engineer (who I will refer to hereafter as "Steve") found any evidence implicating the VPN. However, the problems continued.
We had recently gotten new packet capture devices with large amounts of storage, designed for more or less permanent installation in potentially high-value locations on the network. We deployed one of these to be able to watch all traffic on both sides of our VPN concentrator - one side would see the encrypted stream, while the other would be able to see the unencrypted stream. We deliberately chose this spot on the network because it was the "furthest out" on the perimeter network - if we didn't see problems here, we could start looking further into the network, while on the other hand if we did, we could safely ignore our internal network.
We didn't have long to wait until the problem resurfaced, and the packet traces were instructive. We saw the communications from the client, and every one of these communications looked perfectly clean. Every connection included a complete TCP three-way handshake, what appeared to be normal communications between the LPR and LPD, and a standard four-way teardown of the session. Not once during the entire time troubleshooting this application did we ever see anything like a failed connection - TCP handshakes were always complete, the client and server communications were always successful, there were never any sessions that died midstream, and the teardowns were always textbook clean. It's been a long time but if memory serves, I am nearly certain we never even saw a single TCP retransmission.
What we DID see were inexplicable absences of connections - periods of time usually lasting several minutes in which there were no packets of any kind coming from the client's systems. During these periods there were no TCP attempts at all. The trace taken outside the VPN box was similarly devoid of traffic. In effect, the wire was silent.
There was a chance - a very slim one - that something was going on in what little of our network existed outside the VPN box, so we looked for evidence of that. The VPN box plugged into an ethernet switch, as did our Internet routers. The switch was clean, as were the routers. We were not experiencing any interruption of other Internet traffic, our other VPNs were all running clean, and the VPN to this partner was not having problems.
Given that we saw NO issues of any kind, I concluded that the issue was occurring at the customer's end. I reasoned that intermittent issues on the Internet, or within our infrastructure, would not "respect" the boundaries of TCP sessions - in other words, I would have expected to see problems occur within the TCP sessions. I might expect to see failed TCP handshakes, or some irregularity within the print jobs streaming over. The fact that this never happened led me to believe that something was preventing our customer's systems from even attempting to connect for short periods of time. I should also mention that our print server was handling lots of internal jobs with no issues.
I packaged up my sniffer traces and forwarded them to Steve, outlining my conclusions and the reasoning behind them, and asking him if he could take local traces of his own and confirm whether his systems were making any attempt to connect. He promised to do so. I didn't hear from him for a while, and we continued to get reports through the print facility that the customer was complaining about the connection problems, so I reached out to Steve again.
I asked directly whether Steve had taken traces. He said he had. I asked if he had seen any irregularities in the traces on his side. He said he had not. This might have been an error on my part - maybe I should have asked whether he saw anything at all when the dropouts were occurring, the same "silence on the wire" evident in my traces, but I didn't think of it. I did ask if I could get copies of his traces to compare with mine, and Steve said he would share them, but they were never forthcoming - so to this day I do not know if he actually took any.
I reported my research to my management, along with my conclusion that the problem must be at the client's end, and that there was not likely anything we could do about it. Their response was to urge me to keep looking, and so I did. I took a dozen or more traces, all containing perfect, complete sessions, and usually also containing some of these weird silent periods.
I dug into the traces and started looking at everything that was there in the communications. Let's see - client SYN packet, source port 721, server port 515, server SYN-ACK, client ACK, some kind of "hello" packet from the client and a response from the server, print job streaming over, client FIN-ACK, server ACK, server FIN-ACK, client ACK. All perfect. Next job - identical except for client source port which is now 722 (and the sequence numbers, of course). And another - client side port 723, etc. It did occur to me that the client-side port numbers were almost TOO sequential - there were never any skips, say from 721 to 725 - which made me think the client's system must not been too busy, or that it might be passing through some device that was altering the client ports on the way our of their network. I also thought most clients should pick their ephemeral ports from a range 1024 and above, but I wasn't too bothered by it.
I looked at this for a while - several times, on various days, over several weeks. I couldn't find anything that I thought would help, and I slowly lost interest, especially as the rate of complaints dwindled. After a while I just didn't think about it any more. It wasn't exactly a matter of "giving up" - I thought I'd done a good job of isolating the issue to the customer's end, and lacking visibility into their network I just didn't see that I could do more to help.
For several months it dropped completely off my radar. I guess I should have known it would come back though, because unresolved issues never really go away. One day my manager and director both asked me to get back on the case. They had been contacted by the manager over the print and mail facility. The issue was still going on, the customer was complaining more than ever, and now the print facility was having to reboot the server on a regular basis to clear up the issue. To make matters more urgent, they were hoping to add another customer, but until this issue was resolved they were unable to take that next step.
I started by calling our print server operator - I wanted to know why we were rebooting the server. He told me that after countless "outages", the customer had requested that we try rebooting the server. This had been done, and the customer was then able to connect and send print jobs. Ever since then the customer had gotten into the habit of calling and requesting a reboot whenever they had trouble connecting.
I have to tell you that this made me really angry. The lack of logic involved here was staggering. I had clear, indisputable evidence that when the customer was "having trouble connecting", we weren't getting anything from them at all. I had evidence that every time we did receive a connection request, we answered appropriately - our server always responded. And in addition, our print server never failed to pick up and handle internal print jobs, which, by the way, were now being interrupted by these frequent reboots. The whole thing made no sense.
I then talked to Steve. He told me that since the problem had persisted, they (the customer's company) had started using a command-line utility to check the print server. They would fire up a command prompt on the system creating the print jobs and run a command that would connect to the print server and display the jobs. When "the problem" was occurring, the command-line utility would also be unable to connect. Their operators would sit there rerunning the utility every couple of minutes until they got a successful connection, and then try to restart the print jobs. Usually after this, they could print again, but sometimes not.
What I got from all of this was that the issue had some sort of time component to it, resolving itself within a few minutes. I was still convinced that the issue was on the customer's end - there was never any evidence otherwise. The server reboots simply gave time for the problem to correct itself, but that had always happened anyway. To test this theory, I advised the print server operator not to reboot the server any more. I suggested that he didn't have to tell the customer he wasn't rebooting the server - he could just say "OK, try again in a few minutes." He began doing this, and sure enough there was no difference in the behavior of the whole system. After a wait of a few minutes, their print jobs would start coming through.
Now that I had gotten the "reboot monkey" off our backs, I went back to the traces. I took a bunch of new ones and started going through the connections again and again, looking for anything out of the ordinary. They looked just like the ones from before - client sends SYN packet with client port 721 (or something similar), server side port 515, server sends SYN-ACK, etc. The connections were as perfect as ever. In fact, they looked so familiar that I began to wonder if I was looking at my old trace files. Nope, these were new. I pulled up some of the old original files to make sure, and it was at this point that I began to grasp the faint outlines of the problem.
The client side port numbers had always bothered me a little bit. Aside from the fact that they were all under 1024, the range of port numbers was always very consistent - and very small. The client side port numbers were always within the range 721 - 731. Eleven port numbers, always in succession, reused again and again. I would see a connection from port 728, 729, 730, 731, then it would loop back around to 721. And every so often, usually after a bunch of successful connections, silence on the wire.
I began to wonder if there was some issue with port-exhaustion - this thing was using such a small pool of client-side ports. I wondered, how quickly is a client-side port allowed to be reused? I dug out my trusty copy of Richard W. Stevens' TCP/IP Illustrated Volume 1 and found the TCP state diagram. I saw something called the "2MSL" wait state which occurs before the socket is fully closed. The MSL is the "maximum segment lifetime" which is supposed to be two minutes. The standards for the protocols we still use today were created back when computers and the Internet were MUCH slower, and back then there might be conditions on the network that could cause a packet to arrive late - very late indeed. Anyway, the standards also said that the partner in a TCP session which initiates an active close (through the use of a FIN-ACK packet) MUST then hold the connection for two times the maximum segment lifetime (2MSL) before it can consider the socket closed.
In other words, the connection doesn't truly close for four minutes after all the teardown messages have been exchanged. If you do a "netstat" command on a system you will often see sessions in something called the "TIME_WAIT" state. These are sessions waiting out the 2MSL period so the system can close them. Basically, from the time I observed the client FIN-ACK and other teardown packets, four minutes would have to elapse before the client-side port number would again be released to the operating system for reuse.
With respect to our customer's printing problem, the issue was now in pretty sharp focus. For some reason, the client was only using 11 port numbers (721-731). After 11 successive print jobs, if the timespan of those jobs was less than 4 minutes, all of the TCP sessions involved would be in the TIME_WAIT state. Until the oldest sessions completed the 2MSL wait, there would not be any available ports for new sessions. But why were they using such a small pool?
The answer to this is in the RFC which defines the LPD/LPR services. RFC 1179 says that "The source port must be in the range 721 to 731 inclusive." To be honest, I didn't actually find the answer in the RFC - but some Googling led me to a Microsoft Knowledge Base document which described exactly the problem we were seeing, which mainly occurred on a specific version of Windows Server, and with a suggested fix - a registry setting to cause the Print Services for Unix to use standard ephemeral ports from the much larger pool above 1023. The document outlined that the command-line utility they were using also drew client ports from the same range, so effectively, if all the ports were tied up in TIME_WAIT sessions, the utility would similarly fail to connect. In fact, when they DID connect with the utility, they were actually putting an available port into a 2MSL wait again!
I sent an email to Steve, asking what version Windows Server was in use. He confirmed the affected version. I then sent an email telling him what I thought was happening - that his server was using a very limited range of client side ports, that the speed/volume of print jobs was outpacing the system's ability to clear the sessions for reuse resulting in client port exhaustion, and that there was a suggested fix involving a registry setting. I even sent a link to the Knowledge Base document.
Steve responded that he would look into it, but my interpretation of his response was that he wasn't sure he believed me. I couldn't do much about that - whenever you are dealing with another company, and that company is your customer, and when you are telling a peer engineer that you have remotely diagnosed a problem in his systems...well, there may be some resistance to the idea. So I waited to see what would happen.
What happened was exactly zilch - the problem persisted, and again we were being begged by our print facility manager to intervene. But this time, there really was nothing more we could do - except that I now understood I would have to force the issue with Steve. I wrote - to my management, and copied to Steve - an exhaustive (and at times pointed) accounting of the entire troubleshooting effort, including my early work and their (accurate) conclusions about the problem being in the customer's network, my difficulties getting information out of Steve, my work in understanding and putting a stop to the reboots, and finally my conclusion - backed up by sniffer traces and documentation from the customer's server OS vendor - that the problem was caused by port exhaustion. I included the Knowledge Base document for reference, and stated that my department was finished working the issue, once and for all.
Within a week of sending that email, Steve (or someone else at his company) had made the registry changes to their system, and the problem never surfaced again. It had taken six months, dozens of hours looking at traces, emails back and forth with an incompetent or unhelpful peer, a lot of pain and suffering on the part of our print facility, and research into another company's network and systems, but the problem was finally resolved.
Whenever I work a problem - especially when it's such a challenging and painful one - I always look for "lessons learned." This one was particularly fruitful:
- Persistence, persistence, persistence - over and over throughout my career troubleshooting, I have run into problems where it seems like I just stare at packet traces or logs until my head is about to burst, and then, like a ray of sunshine coming through the storm clouds - the solution appears. This case was somewhat rare in that there was a period where I accepted that the problem was "solved" even when it wasn't. I had felt that in proving the problem was on the customer's end, that my work was done. Figuring out when to stop, when enough is enough, is part of maturing as a troubleshooter and I may do a blog post about that later - but in this case, my real customer was always the people at my company's print and mail facility, and until things were completely resolved, my work was not truly done.
- Troubleshooting a problem that exists on a foreign network is really hard, but NOT always impossible - this probably doesn't require much more explanation than what is available in the story above, but I've often seen network engineers focus on this sort of "us VS. them" strategy in problem solving. The idea is that if we can prove it isn't US, then it must be THEM, and we can't do anything more. A lot of times there is a bit of shaky logic employed, something like "Well, our printer works fine for everybody else, and we don't have any other VPN or Internet problems, so it's not us." Of course, Steve always insisted that they were not having problems sending print jobs to anywhere but us, which is possible if we were the only LPD server they were targeting. All of this may have been true - and in my case I had even better evidence from my traces that we weren't even receiving communications from the customer's network - but the fact remains that the answer was always right in front of me, in the packet traces I was so proud of analyzing.
- Dealing with "peers" on other teams or at other companies can be just as challenging as the technical act of troubleshooting - "Steve" is a prime example of something I've dealt with many times over the years. He was either unwilling to really look at his network and systems, or incompetent, or both. I believe that if he had actually performed network sniffer traces he would have noticed that there was NO communication coming from his print system during the outages, which would have led him to the same conclusion I had reached. The fact that this didn't happen, combined with his apparent unwillingness to share the traces he claimed to have taken, leads me to believe he never did them at all. Of course it's possible he did the traces but just didn't interpret them properly. I don't suppose I will ever know. I also strongly suspect that the Windows Server in question would likely have been writing event log messages regarding the connection problems, had anyone over there cared to take a look. But Steve was what I had to work with - I had to continually reach out to him, probe for information and prod for action, while trying not to upset or insult him, in order to finally get the action required. It was neither easy nor pleasant.
- No matter how much you know, there's always room for more - going into this problem I thought I knew the basics of socket communications pretty well. But it took far too long for me to notice the oddly low port numbers, or the small pool in use. There's a lot of detail in packets and packet traces, and it takes diligence to spot patterns like these.
And so ends the tale of the Silence On The Wire - hopefully you made it here to the end, and that it was worth coming along for the ride.
No comments:
Post a Comment