Obviously stunnel problems are like London buses - you wait years for one, then two come along inside a week ;-)
Having fixed my server-side problems by installing 4.14b1, I now have a problem on the client side. Apologies for the voluminous diagnostics below, but hopefully they will help make the problem clear.
I am running Apache + mod_jk, stunneling to a remote JBoss server. The client is
stunnel 4.14 on i686-pc-linux-gnu UCONTEXT+POLL+IPv4 with OpenSSL 0.9.7i 14 Oct 2005
When things are running normally, everything is sweetness and light. However, if I 'kill -HUP' the Apache server (eg to do a log rotation), the mod_jk connection to stunnel gets closed, and stunnel goes into a loop. If I enable DEBUG_UCONTEXT in network.c, and add a few lines of diagnostics:-
$ diff -c network.c.orig network.c.loop *** network.c.orig Thu Oct 20 08:37:47 2005 --- network.c.loop Thu Oct 27 13:07:10 2005 *************** *** 31,37 **** #include "common.h" #include "prototypes.h"
! /* #define DEBUG_UCONTEXT */
#ifndef USE_WIN32 static int signal_pipe[2]={-1, -1}; --- 31,37 ---- #include "common.h" #include "prototypes.h"
! #define DEBUG_UCONTEXT
#ifndef USE_WIN32 static int signal_pipe[2]={-1, -1}; *************** *** 156,167 **** for(i=0; i<ctx->fds->nfds; i++) { ctx->fds->ufds[i].revents=ufds[nfds].revents; #ifdef DEBUG_UCONTEXT ! s_log(LOG_DEBUG, "CONTEXT %ld, FD=%d, (%s%s)->(%s%s)", ctx->id, ufds[nfds].fd, ufds[nfds].events & POLLIN ? "IN" : "", ufds[nfds].events & POLLOUT ? "OUT" : "", ufds[nfds].revents & POLLIN ? "IN" : "", ! ufds[nfds].revents & POLLOUT ? "OUT" : ""); #endif if(ufds[nfds].revents & (POLLIN|POLLOUT)) ctx->ready++; --- 156,170 ---- for(i=0; i<ctx->fds->nfds; i++) { ctx->fds->ufds[i].revents=ufds[nfds].revents; #ifdef DEBUG_UCONTEXT ! s_log(LOG_DEBUG, "CONTEXT %ld, FD=%d, (%s%s)->(%s%s%s%s%s)", ctx->id, ufds[nfds].fd, ufds[nfds].events & POLLIN ? "IN" : "", ufds[nfds].events & POLLOUT ? "OUT" : "", ufds[nfds].revents & POLLIN ? "IN" : "", ! ufds[nfds].revents & POLLOUT ? "OUT" : "", ! ufds[nfds].revents & POLLERR ? "ERR" : "", ! ufds[nfds].revents & POLLHUP ? "HUP" : "", ! ufds[nfds].revents & POLLNVAL ? "NVAL" : ""); #endif if(ufds[nfds].revents & (POLLIN|POLLOUT)) ctx->ready++;
I get (with fd=9 being the mod_jk socket, fd=10 the SSL one, and TIMEOUTidle set to 35 seconds for demo purposes) ...
2005.10.27 13:13:42 LOG7[1286:0]: Waiting 35 second(s) for 7 file descriptor(s) 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 2, FD=9, (INOUT)->(OUT) 2005.10.27 13:13:42 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN) 2005.10.27 13:13:42 LOG7[1286:2]: SSL_read returned WANT_READ: retrying 2005.10.27 13:13:42 LOG7[1286:0]: Waiting 35 second(s) for 7 file descriptor(s) 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 2, FD=9, (IN)->(IN) 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->() 2005.10.27 13:13:43 LOG7[1286:2]: Socket closed on read
... ie mod_jk has disconnected. Everything carries on for a while ...
2005.10.27 13:13:43 LOG7[1286:2]: SSL write shutdown 2005.10.27 13:13:43 LOG7[1286:0]: Waiting 60 second(s) for 6 file descriptor(s) 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:43 LOG7[1286:0]: CONTEXT 2, FD=10, (INOUT)->(OUT) 2005.10.27 13:13:43 LOG7[1286:2]: SSL alert (write): warning: close notify 2005.10.27 13:13:43 LOG7[1286:2]: SSL_shutdown retrying 2005.10.27 13:13:43 LOG7[1286:2]: SSL doesn't need to read or write 2005.10.27 13:13:43 LOG7[1286:0]: Waiting 60 second(s) for 6 file descriptor(s) 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN) 2005.10.27 13:13:45 LOG7[1286:0]: Waiting 35 second(s) for 7 file descriptor(s) 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->() 2005.10.27 13:13:45 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(OUT) [...] 2005.10.27 13:13:46 LOG7[1286:0]: Waiting 60 second(s) for 6 file descriptor(s) 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN) 2005.10.27 13:13:46 LOG7[1286:2]: SSL_read returned WANT_READ: retrying 2005.10.27 13:13:46 LOG7[1286:0]: Waiting 60 second(s) for 6 file descriptor(s) 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->(IN) 2005.10.27 13:13:46 LOG7[1286:0]: Waiting 35 second(s) for 7 file descriptor(s) 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)
... but eventually we start getting ERRHUP (ie POLLERR | POLLHUP), not OUT, on fd 9, and then we end up in a tight loop of
2005.10.27 13:13:46 LOG7[1286:0]: Waiting 35 second(s) for 7 file descriptor(s) 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=10, (IN)->() 2005.10.27 13:13:46 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP)
... until eventually we get
[...] 2005.10.27 13:14:23 LOG7[1286:0]: Waiting 2 second(s) for 6 file descriptor(s) 2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:14:23 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP) 2005.10.27 13:14:25 LOG7[1286:0]: Waiting 0 second(s) for 6 file descriptor(s) 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:14:25 LOG7[1286:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP) 2005.10.27 13:14:25 LOG6[1286:2]: s_poll_wait timeout: connection reset 2005.10.27 13:14:25 LOG5[1286:2]: Connection reset: 260 bytes sent to SSL, 24734 bytes sent to socket
The problem seems to be in network.c:s_poll_wait(), in the code
while(!ready_head) /* no context ready */ scan_waiting_queue();
because poll() returns > 0, but scan_waiting_queue() never sets ready_head because no fd has yielded POLLIN or POLLOUT:-
if(ufds[nfds].revents & (POLLIN|POLLOUT)) ctx->ready++;
The end result is that we chew up CPU until TIMEOUTidle expires (which is rather a long time if left at the default of 12 hours!!)
I have fudged a circumvention as follows (because I couldn't follow the logic well enough to do it cleanly, and time is running out on me, I'm afraid.)
$ diff -c network.c.loop network.c *** network.c.loop Thu Oct 27 13:07:10 2005 --- network.c Thu Oct 27 13:05:21 2005 *************** *** 166,172 **** ufds[nfds].revents & POLLHUP ? "HUP" : "", ufds[nfds].revents & POLLNVAL ? "NVAL" : ""); #endif ! if(ufds[nfds].revents & (POLLIN|POLLOUT)) ctx->ready++; nfds++; } --- 166,172 ---- ufds[nfds].revents & POLLHUP ? "HUP" : "", ufds[nfds].revents & POLLNVAL ? "NVAL" : ""); #endif ! if(ufds[nfds].revents & (POLLIN|POLLOUT|POLLERR|POLLHUP|POLLNVAL)) ctx->ready++; nfds++; }
... which seems to solve the problem but produces a rather untidy result:-
2005.10.27 13:17:46 LOG7[1759:0]: Waiting 35 second(s) for 7 file descriptor(s) 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 2, FD=10, (IN)->(IN) 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 2, FD=9, (OUT)->(OUT) 2005.10.27 13:17:46 LOG7[1759:0]: Waiting 35 second(s) for 6 file descriptor(s) 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=3, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=5, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=6, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=7, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 1, FD=8, (IN)->() 2005.10.27 13:17:46 LOG7[1759:0]: CONTEXT 2, FD=9, (OUT)->(ERRHUP) 2005.10.27 13:17:46 LOG3[1759:2]: INTERNAL ERROR: s_poll_wait returned 1, but no descriptor is ready 2005.10.27 13:17:46 LOG5[1759:2]: Connection reset: 260 bytes sent to SSL, 41132 bytes sent to socket
... i.e it sorts itself out quickly, but at the cost of logging an INTERNAL ERROR, which doesn't look very good ;-) As far as I could tell (but haven't shown), everything actually got close'd correctly.
As I say, I appear to have a circumvention, so I'm not hanging on a proper fix (unless anyone can see any fatal flaw in what I've done?)
Regards, Richard