Hi Michal (and if there is anyone else remotely interested :-) )
After a lot more digging, I've now confirmed it to be a race condition in the handling of the read socket in stunnel, although not the one I originally suspected. The problem occurs when the remote program feeds some data to stunnel, then immediately closes the socket, i.e.:
write(sock, buff, length); shutdown(sock, SHUT_WR); close(sock);
The write returns a valid count and this looks legal code to me - the client program should not have to wait around to confirm data has been received at the other end before start the shutdown, OK?
Looking at the way stunnel checks for incoming data, it polls the file descriptors and then takes action on the return status - particularly POLLIN and/or POLLHUP. When the transfer succeeded, the poll status was just POLLIN, and Stunnel then reads the socket correctly. In the failure scenario though, the returned status was (POLLIN|POLLHUP). Stunnel processed the POLLHUP case first, and marked the socket as closed without checking for any data still within it. Switching the order to first unload any data from the socket, and ONLY THEN marking the socket as shut fixes the problem. The read on the fd with the HUP state marked does work, as confirmed by the return byte count, and the shutdown proceeds as normal after that has been processed.
I attach a log showing the desired and failure scenarios for both the original ordered code and the fixed version. I also attach a log of the diffs between the working code and the corresponding sources in Version 5.02. The extra debugging output I added to track down the problem is included, although is clearly not strictly necessary. The version of the utility s_poll_value() for the non USE_POLL case is just a best guess, as my configuration appears to use the polling mechanism rather than select.
I've only tried to fix this particular problem. I don't know whether there is a similar scenario possible for the SSL side of the transfer as well - my knowledge of SSL is even scantier than my knowledge of sockets. I can't think of a similar problem on the write side...but again my knowledge of comms is purely on a need-to-know basis!
Graham
-----Original Message----- From: Graham Nayler (work) Sent: Monday, September 15, 2014 3:04 PM To: stunnel-users@stunnel.org Subject: [stunnel-users] Premature socket closure - race condition bug?
Dear All,
After a recent upgrade I'm currently experiencing intermittent problems with securing bidirectional comms traffic for a moitoring program with stunnel.
The system is: 70+ Client machines running BBWin on Windows (mostly 7) -> stunnel 5.01 .....internet......stunnel 5.02 -> Xymon running on 64-bit Linux Mint 17 (Virtual machine inside 2012 R2 Essentials server)
Prior to the recent upgrade, the server was an approx 3 year old 32-bit Ubuntu server, running stunnel 4.56. Comms then worked (mostly) fine for our client machines.
Since the upgrade, client requests for information from the server have been largely failing. Running the comms with direct unsecured socket connections work fine.
I've spent a bit of time over the last couple of days looking at the source for both Stunnel and BBWin and it looks to me as if there is a disconnect in understanding between BBWin and Stunnel as to how read and write connections work.
The BBWin Client makes the connection, then issues (in essence) the following sequence: send(connection, msg) shutdown(connection, SHUT_WR) do recv(connection....) until it returns zero or SOCKET_ERROR shutdown(connection, SHUT_RD) shutdown(connection, SHUT_BOTH) closesocket(connection) i.e. the client shuts down the transmission side as soon as it's done, then shuts down the receive side only once it's finished receiving any returned data.
I atttach Stunnel logs for both client and server for both failed and successful transfers. I've added a little more debugging output to the server Stunnel instance to display the data being read and written to both the socket and SSL side of the comms. This shows that the only difference between the two is that in the successful transfer the server receives and passes on data from the socket to SSL before starting the shutdown. So that looks fine - when it works. But when it doesn't, it looks just as if the return path is shut down before the server app has had time to retrieve the data to be returned
Looking at the stunnel code though, I'm confused - and my suspicion is that stunnel (on the client machine) is closing the SSL connection prematurely.. It looks as if it issues the SSL_shutdown command (client.c line 855) if: it's not already sent the shutdown the read fd on the socket is closed there's nothing left in the outboud queue (sock_ptr is 0) and SSL wants a retry (? I don't yet understand write_wants_write usage). That's all very well for closing the outbound side, but what about the inbound? Surely it should keep the SSL open until either it's notified by the other side that everything is closed down there, or BOTH read and write on the socket side have been shutdown. A further point of confusion is that the stunnel code handles read and write fds for each of SSL and socket independently, but for most cases they are both set to the same value. Is there some confusion about handling s_poll_hup()? I freely admit I don't understand fully how this works, as I've only had a day's experience of this comms stuff, and it looks pretty well thought out, but there's logic here for handling the inbound and outbound sides independently, so SSL should remain open while any one of those two channel remains active?
The bottom line is that the comms: a) works reliably when not routed through stunnel b) works reliably to transmit from client to server b) now works (in reception) less than 10% of the time when using stunnel - but does work occasionally It worked fairly reliably with 5.01 on the clients and 4.56 on a slowish server, and now doesn't on 5.02 on a highish spec server, with client software/hardware unchanged. My suspicion is that improving the spec on the server has exposed a race condition on the client installation.
Any thoughts?
Graham Nayler
_______________________________________________ stunnel-users mailing list stunnel-users@stunnel.org https://www.stunnel.org/cgi-bin/mailman/listinfo/stunnel-users