TL;DR

I was having trouble with socket connections timing out reliably. Sometimes, my timeout would be reached. Other times, the connect would fail after three to six seconds. I finally figured out it had to do with trying to connect to a routable, non-localhost address. This function is what I finally ended up with that reliably connects to a working server, fails quickly for a server that has an address/port that is not reachable and will reach the timeout for routable addresses that are not up.

I have put a version of my final function into a Gist on Github. I hope someone finds it useful.

Full Story

So, it seems that when you try and connect to an IP that is routable on the network, but not answering, the TCP stack has some built in timeouts that are not obvious. This differs from trying to connect to an IP address that is up, but not listening on a given port. We took a Gearman server down for maintenance and I noticed our warning logs were showing a 3 to 7 second delay between the attempt to queue jobs and the warning log. The timeout we had set was only 100ms. So, this seemed odd.

After a lot of messing around, a coworker pointed out that in production, the failures were happening for an IP that was routable on the network, but that had no host listening on the IP. I had been using localhost and some foreign port for my "failed" server. After using an IP that was local to our LAN but had no host listening on the IP, I was able to recreate it on a dev server. I figured out that if you set the send and receive timeouts really low before calling connect, you can loop while calling connect. You check the error state and timeout. As long as the error is an acceptable one and the timeout is not reached, keep trying until it connects. It works like a charm.

I found several similar examples to this on the web. However, none of them mixed all these techniques.

You can simply set the send and receive timeouts to your actual timeout and it will return quicker. However, the timeouts apply to the packets. And there are retry rules in place. So, I found that a 100ms timeout for each send and receive would wind up taking 500ms or so to actually fail. This was not what I wanted. I wanted more control. So, I set a 100 microsecond timeout during connect. This makes socket_connect return quickly. As long as the socket error is 115 (in progress) or 114 (already trying), we keep calling it. Unless of course our timeout is reached. Then we fail.

It works really well. Should help for doing server maintenance on our Gearman servers.