Some time ago I wrote a rather lengthy post about an old case where I had to troubleshoot a difficult problem with print jobs failing - The Case of the Silence on the Wire. Recently I have had to look at a problem that carried some of the same baggage. An external customer is printing to a print server at our print facility, over a VPN, and seeing some issues.
This is a LPD/LPR setup, with our server listening on TCP port 515 and the customer's system using the standard client side ports 721-731 (see https://www.ietf.org/rfc/rfc1179.txt). When the issue was reported a few weeks ago I didn't really see anything I could put my finger on. Our server response time (as calculated by my analysis software) was a little slow, and there were a few retransmissions from the client, but overall the connections looked healthy enough - TCP three-way handshakes looked ok, data being transferred with our server acknowledging, proper FIN exchanges at the end. At least that's what I saw the first two or three times we were contacted to check it out.
Today I was asked to take another look - the problem was reported to have occurred between 7:00AM and 9:00AM on June 23rd. I grabbed a capture off our sniffers and took a look. The sniffer is capturing everything both on the inside interface of our firewall and on the interface that goes to our private extranet connections, including VPNs, so I had a sort of "double trace", with one copy of the traffic showing the connection to the real internal IP address of our server, and the other showing the connection to the external NAT address.
Looking at a list of connections that took place during the time in question, I saw a bunch of connections that looked more or less like what I described above, but today I noticed something different - there was a connection listed that looked really small on the packet count. This was sourced from client-side port 722. I filtered on the trace and saw incoming SYN packets, but our server wasn't responding with SYN/ACKS - it was responding with plain ACK packets, and the ACK numbers weren't correct for the incoming SYNs.
Now, first guesses aren't always right and you must take care to check things thoroughly. On the other hand, when you've been working with a particular system as long as I've been working with TCP communications, you can sometimes get pretty close to the mark. In this case, I wondered whether the server was responding to an old connection - some previous connection sourced from port 722 that didn't close properly, so the server was still trying to reply using ACK numbers for that old connection.
I began working my way backwards through what our sniffer had captured - all the way back to about 6:35AM - and every time they tried connecting from port 722, we were just sending these ACKs. In all cases the ACK numbers we were sending back were the same (at least on the packets taken from the inside interface of the firewall), and none of them were anywhere close to being correct for the SYN packets. Earlier than that there were no connections on June 23rd.
I shifted focus to my Netflow tool. Hunting for connections out of the vast number captured on a big capture box being fed by multiple taps can be really difficult, but Netflow boils down connection information to the essentials. My Netflow records indicated that communications from this client had ceased a little after 10:00PM on the night of the 22nd, and further that there had been a connection from client port 722 that had enough packets in it to have been viable. With that information I delved back into the sniffer to find that connection.
It was exactly what I was looking for. The connection had started off fine, with a good three-way handshake, and for a time had proceeded normally - client sending data, server sending ACKs. At the end, the client send a final packet with data in it and with the PUSH flag set indicating the server should process it and acknowledge, which it did. I verified that this last ACK got through our firewall, as it appeared in the trace taken on the outside. After that the client didn't send anything further for about 25 seconds, no further data and no FINs - after which, it sent a new SYN packet from client port 722.
This packet did not get through the firewall - the firewall was monitoring state and still thought the old connection was open (the firewall has an idle timer of 1 hour for TCP connections in the state table). The firewall - apparently - sent an ACK using the last valid sequence numbering from the previous connection. The client resent the SYN several times, and the firewall sent those ACKs each time, never letting the SYN through to our server. After a while the client gave up and moved on to a new port number.
The difference between this and what we were seeing in the morning was this - by the time the client got around to starting up again on the 23rd, the firewall had forgotten about the old connection, removing it from the state table, so it was now allowing the SYN packets through to the server. But the server still had the old connection open (more than 8 hours later!!!) and was still sending ACKs for the sequence numbering of the old connection from the night before.
There were two interesting things about these traces that revealed something I'd never seen before, and which challenge my assumptions about how much I know about this particular brand of firewall. First, I've never seen the firewall send an ACK of its own to a connection like we saw on the final connection on the night of the 22nd. At that time the firewall was not letting the SYNs through and my traces on the inside interface of the firewall confirm this - the print server was not getting them and was not sending the ACKs, yet I could see ACKs on the outside interface of the firewall. As near as I can tell the firewall had to be sending them.
Second, in my experience this particular type of firewall is very strict about keeping track of state, and I would not have expected it to allow those ACKs from the server the next morning - once the firewall was letting the SYNs through it should have been watching for SYN/ACKs from the server, and also watching to make sure the ACK numbers were correct. Instead, it was letting those ACKs go right through. I am thinking maybe the firewall is programmed to do this in case there are out-of-order packets on the wire, but it still seems a little freaky and I'm going to have to read up on it.