Found the nastiest Enomalism bug yet!

Found the nastiest Enomalism bug yet!

Wow. That was a painful week and a half.

Periodically, Enomalism would hang, and I could not find out why. The system would load up with CLOSE_WAIT sockets, and enomalism would fail over (not to be confused with the "Apache is stupid" bug which is a separate problem which has been remedied with nginx. Trust the Russians to generate a nearly unknown, indestructable, high quality web server, with all of lighttpd's functionality, but without the memory leaks, and in half the ram).

Note to Derek: Brackets considered harmful, especially when they contain run on sentences.

So, right, the bug again... When Enomalism starts and stops machines, and has the SimpleFirewall (note to self, bad name) module installed, iptables is called in subprocess, and this shuts down the firewall for that machine (or starts it). Starting is no problem, since we perform a series of operations out of sync, but stopping is more problematic. When we stop the machines, iptables is called with -F (to flush the filter chain), -D (to remove the jump to the filter chain), and then -X (to delete the old filter chain). The problem is that sometimes iptables hangs on indefinitely, which blocks the next filter chain, and worse, results in a futex_wait on the subprocess completing. The solution? Not sure yet. I am sure that I will be able to workaround nicely, but the sequential blocking behavior is necessary to prevent early shutdown of the firewall, and also to do the operations without stepping on each others toes. Perhaps a clever del P in the right place? More details in a couple of days!

Home Home
http://www.reaysmoving.com/