read(7, "HTTP/1.0 200 OK\r\nContent-Type: text/plain\r\nContent-Length: 56391\r\n\r\nXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"..., 18432) = 18432
poll([{fd=7, events=0}, {fd=3, events=POLLIN|POLLOUT}], 2, 43200000) = 2 ([{fd=7, revents=POLLHUP}, {fd=3, revents=POLLOUT}])
read(7, "", 0) = 0
sendto(6, "<31>Feb 5 15:08:44 stunnel: LOG7[20119:0]: Socket closed on read", 65, MSG_NOSIGNAL, NULL, 0) = 65
write(2, "2013.02.05 15:08:44 LOG7[20119:0]: Socket closed on read\n", 57) = 57

Here read() was being called with a size of zero, which according it the man page should return 0 and have no other results. I added a and additional guard to the read from socket portion of transfer() and have not been able to reproduce the truncated response issue yet. Unfortunately the transfer loop watchgaurd is still occurring.

Dustin Lundquist

Blue Box Group

On Mon, Feb 4, 2013 at 5:39 PM, Stephen Balukoff <sbalukoff@bluebox.net> wrote:

Hello, y'all!

I've recruited the help of my colleague Dustin Lundquist, and he's
been able to create a reproducible test case for this bug, which will
hopefully shed some more light on this. We're specifically seeing two
error behaviors which we haven't yet been able to determine whether
they're related. These are:

1. transfer() loop error, generating an error code telling us to
contact Michal.Trojnara@mirt.net
2. truncated response from the web server (always at byte 46693, for
whatever reason).

These errors can occur independently or simultaneously, and
occasionally not at all on different runs of the same script on the
same machine. When we attempt to strace the stunnel process, the
problem goes away entirely, which screams to me of a timing issue or
race condition. In any case, it's a bit of a heisenbug, which is
annoying in and of itself.

And, again, these go away when we communicate to the back-end web
server process using TCP sockets instead of unix sockets.

Attached are output from the four permutations of the above bug, as
well as the sample perl script used to generate them.

If anyone has any ideas of how we might go about fixing this one, or
knows what significant difference between TCP and unix sockets might
be causing the error here, we'd love to hear it!

Stephen

On Thu, Jan 24, 2013 at 6:18 PM, Stephen Balukoff <sbalukoff@bluebox.net> wrote:
> Hello!
>
> Recently we started seeing some erroneous behavior in one of our load
> balancer configurations in which we're using stunnel for SSL tunnel
> termination with web clients. Specifically, the setup looks like the
> following:
>
>
> Web client ---(SSL)---> stunnel ---(unix socket)--> haproxy
> ---(tcp/http)---> web server
>
> After extensive troubleshooting we were able to determine that when
> the following parameters are present:
>
> * Client forces non-keepalive connection
> * Client generates a POST request, the response to which (generated by
> the back-end web server) is greater than about 8k
> * stunnel and haproxy communicate via unix socket
>
> ...then most responses end up being truncated by an extraneous TCP RST
> packet from stunnel, and the following log lines are generated:
>
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> transfer() loop executes not transferring any data
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: please
> report the problem to Michal.Trojnara@mirt.net
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: stunnel
> 4.54 on x86_64-unknown-linux-gnu platform
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> Compiled/running with OpenSSL 1.0.0-fips 29 Mar 2010
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> Threading:PTHREAD SSL:+ENGINE+OCSP+FIPS Auth:none Sockets:POLL+IPv6
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> protocol=TLSv1, SSL_pending=0
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> sock_open_rd=n, sock_open_wr=Y
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> SSL_RECEIVED_SHUTDOWN=n, SSL_SENT_SHUTDOWN=Y
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> sock_can_rd=Y, sock_can_wr=n
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> ssl_can_rd=n, ssl_can_wr=n
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> read_wants_read=Y, read_wants_write=n
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> write_wants_read=n, write_wants_write=n
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]:
> shutdown_wants_read=n, shutdown_wants_write=n
> Jan 23 15:24:17 ds420a stunnel: LOG3[22586:139835614443264]: socket
> input buffer: 0 byte(s), ssl input buffer: 0 byte(s)
>
> The work arounds we have found are either:
>
> 1. Make stunnel and haproxy communicate over a TCP socket bound to
> the loopback interface (less ideal, as then we're limited to 65k
> connections)
> 2. Add the "reset=no" parameter to our stunnel.conf (which seems to
> fix the problem, but makes me wonder if we're not opening ourselves up
> to other problems-- like really getting stuck in an infinite loop
> that that watchdog is intended to avoid.)
>
> In any case based on what "reset=no" is supposed to do as well as
> where that first line of the error output above is being generated in
> the code, it's clear that the main client data transfer loop in
> client.c (starting line 519) is being called too often with no data to
> transfer, which triggers the infinite loop detection in the transfer()
> function (client.c, starting line 832).
>
> My best hypothesis at this point is that the transfer() code was
> written only with TCP sockets in mind, and that something about the
> unix socket in use here is triggering that loop too often. Maybe
> s_poll_wait on line 541 is being called with parameters which are
> right for a TCP socket but inappropriate for a unix socket? At this
> point though, we're at the limit of my socket programming ability, so
> I'm a little stuck. (Would like to have simply fixed this and
> submitted a patch... but cest la vie, eh.)
>
> We'd like to avoid using a TCP socket to communicate between stunnel
> and haproxy simply because it's a little less efficient, and because
> there's an artificial 65k limit on the number of back-end connections
> we can support per loopback IP if we do that. Alternatively, if
> someone could verify that "reset=no" is an acceptable / safe work
> around in this case, I'd appreciate that (though, of course, I'd
> prefer my logs not to get spammed on every connection. :P )
>
> Is there any other data I can provide to help troubleshoot and fix this?
>
> Thanks,
> Stephen
>
>
>
> --
> Stephen Balukoff
> Blue Box Group, LLC
> (800)613-4305 x807

--
Stephen Balukoff
Blue Box Group, LLC
(800)613-4305 x807

_______________________________________________
stunnel-users mailing list
stunnel-users@stunnel.org
https://www.stunnel.org/cgi-bin/mailman/listinfo/stunnel-users