Monday, August 23, 2010

To keep alive or not to keep alive

Networks are what make computers cool to me. Without a network a computer is like a car without wheels (perhaps not that bad but close). Ever since I wrote my first dumb networking code it still blows my mind that I can send data from one computer to another. One thing I have learned over my short programming career is that writing good networking code turns out to be hard. Operations on the local computer can and do fail, but for the most part it is not to hard to check for the error and do something about it. Networks, on the on other hand, tend to be very asynchronous and can fail in all kinds of fun ways.

The Internet, as I understand it, is based on the IP protocol. The IP protocol is based on the idea of best effort delivery, which has proven to be a powerful idea. It also means that about anything can happen and the endpoints of the network will have no idea. In the case of TCP one end of the connection can tell the other if it is going to close the connection. This happens if the process owning the connection is killed or the computer is turned off in a nice way. The case of the one computer dies in a flame of glory, or the connection breaks (someone pulls the Ethernet cable out to use in there laptop) does not let the other end know what happened. TCP is just a sudo connection after all.

Enter the keepalive. The idea is simple. Send some TCP packets over the connection and make sure a response is returned in a reasonable amount of time. The question is what level of the networking stack should this happen on? Currently, I have done this in applications level. In the protocols I make, I will have a ping/pong message that I send so both sides know that connection is up. It turns out that it can also be done in the tcp level. It seems most OS have the ability to send a TCP packet with no data, but the response flag is set to true. In Linux there are a number of setting that tell the interval to test the connection on, the number of times to retry, and the amount of time between retries.

I have yet to test this out compared to the applications level stuff I have done, but I have high hopes. The applications level ping/pong does provide more feed back about what is happening and it is up to the application what to do about it. The ping/pong approach has a number of issues. The first is it makes the application code more complex. The application not only has to handle a the case of the connection closing, it has to have timer and other code to decide the connection is dead and should be aborted. The second and perhaps most important is, if the application is sending a lot of data the ping/pong message might not be sent for the timeout value. That means one side the connection might decide the connection is dead just because it was to busy. Now this "problem" happens if you just depend on the ping/pong to reset your timers, so I think it can be worked around. But it is still messy, in the code and the protocol. Then what if you not are not making your binary protocol?

To end the rant if the OS can be set to keep the connection alive, and it works well I say let it. Now all this it assuming a long lived TCP connection. UDP has its own issues, which I have yet to try and work with. I think UDP puts it back in the application level to manage the connection.

(Update)
After more thought I think the most flexible design is to use a ping/pong in the applications level. Even if it is more work to code. It provide more portability and test more issues. Now if you are just writing a brunch of server to server code with something like Google Go it might work out to have the OS handle it (assuming Linux).