Hello again.
We recently noticed a behavior in stunnel that surprised us, and seems to be inconsistent with the man page description. We use "failover = prio" in our configuration file. To refresh everyone's memory, the man page describes this option here:
failover = rr | prio Failover strategy for multiple "connect" targets. rr (round robin) - fair load distribution prio (priority) - use the order specified in config file default: rr
Our config file contains a list of connect targets, like this:
[...] connect = h1.example.com connect = h2.example.com connect = h3.example.com connect = h4.example.com [...]
Based on the description, we expect the connection to try h1 first, and if h1 fails, failover through the remainder of the list, in-order. However, what we observe is that the connections start at a seemingly random place in the list. It does appear to proceed in-order...it's just the starting place that seems incorrect. For example, we might get a sequence that proceeds "h3->h4->h1->h2".
Please let me know if I am misunderstanding the intended behavior of the 'prio' failover strategy.
Here is a trial patch that hard-codes the starting place for the failover to the beginning of the list. I am not sure if this change has any unintended consequences for other configurations. But as a proof-of-concept, it does seem to fix the behavior to be consistent with our reading of the man page. This patch is against 5.20 but it appears that 5.21b2 acts the same way.
Thanks!
Michael
diff -ru stunnel-5.20.orig/src/client.c stunnel-5.20-prio-fix/src/client.c --- stunnel-5.20.orig/src/client.c 2015-07-16 15:55:52.213064746 -0700 +++ stunnel-5.20-prio-fix/src/client.c 2015-07-17 16:13:12.129184507 -0700 @@ -1285,7 +1285,8 @@ *c->connect_addr.rr_ptr=(i+1)%c->connect_addr.num; s_log(LOG_INFO, "failover: round-robin"); } else { - s_log(LOG_INFO, "failover: priority"); + s_log(LOG_INFO, "failover: priority, starting at first entry"); + i=0; } return i; }
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 21.07.2015 00:09, Michael Gebis wrote:
Please let me know if I am misunderstanding the intended behavior of the 'prio' failover strategy.
You understand it correctly. The "failover = prio" feature is indeed broken since stunnel 5.15. Your fix is also correct: the rr_val/rr_ptr values should not be used for the 'prio' strategy.
Thank you for reporting the bug. It will be fixed in the next release.
Mike