With the normal TCP/IP setup it can take up to two hours for a dropped connection to terminate — the Samba project faced this problem when creating clustered Samba.

In a video interview at linux.conf.au in Melbourne, the founder of the Samba project, Andrew Tridgell, explained how Samba tackled the problem of node failure in a cluster.

“The problem is that the client doesn’t know it’s happened — the client is waiting for a reply from the previous server and doesn’t know the new server has taken over. It can take up to 2 hours with normal TCP setup for what’s called a keep-alive packet to kick in and cause the connection to reset.”

Clustered Samba solves this problem with the use of a “tickle ACK” — an exchange of acknowledgement packets that allows for the replacement node to issue a proper reset packet.

The “tickle ACK” mechanism is necessary because the reset packet needs a valid sequence number to be obeyed — an invalid reset packet is ignored. The catch is that only the client and the failed node know the correct sequence number, and this is where the “tickle ACK” proves useful.

Since every node knows the connections on every other node, when a new node takes over it will send an acknowledge packet with an invalid sequence number. The client responds with an acknowledge packet with the correct sequence number, which the new node takes to issue a reset.

“The end result is that you can flick a service backwards and forwards between nodes in a cluster incredibly fast” said Tridgell.

“Which, for a person like myself who enjoys dealing with TCP packages and low level networking, is really great fun.”