Hi... I'm deploying stunnel on some servers. I did some tests, and i never had problems. For example, i tried 5k parallel connections, and i didn't have any problem.
But, i have problems when a small number of connections wants to transfer big chunks of information; the connection hangs.
This is a snippet of logs:
Server side: 2006.10.20 16:00:58 LOG7[20302:1]: oracle11521 accepted FD=226 from 192.168.22.10:39721 2006.10.20 16:00:58 LOG7[20302:200]: oracle11521 started 2006.10.20 16:00:58 LOG7[20302:200]: FD 226 in non-blocking mode 2006.10.20 16:00:58 LOG7[20302:200]: TCP_NODELAY option set on local socket 2006.10.20 16:00:58 LOG5[20302:200]: oracle11521 connected from 192.168.22.10:39721 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): before/ accept initialization 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): SSLv3 read client hello A 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): SSLv3 write server hello A 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): SSLv3 write change cipher spec A 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): SSLv3 write finished A 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): SSLv3 flush data 2006.10.20 16:00:58 LOG7[20302:200]: SSL state (accept): SSLv3 read finished A 2006.10.20 16:00:58 LOG7[20302:200]: 3 items in the session cache 2006.10.20 16:00:58 LOG7[20302:200]: 0 client connects (SSL_connect ()) 2006.10.20 16:00:58 LOG7[20302:200]: 0 client connects that finished 2006.10.20 16:00:58 LOG7[20302:200]: 0 client renegotiations requested 2006.10.20 16:00:58 LOG7[20302:200]: 199 server connects (SSL_accept()) 2006.10.20 16:00:58 LOG7[20302:200]: 194 server connects that finished 2006.10.20 16:00:58 LOG7[20302:200]: 0 server renegotiations requested 2006.10.20 16:00:58 LOG7[20302:200]: 173 session cache hits 2006.10.20 16:00:58 LOG7[20302:200]: 0 session cache misses 2006.10.20 16:00:58 LOG7[20302:200]: 18 session cache timeouts 2006.10.20 16:00:58 LOG6[20302:200]: SSL accepted: previous session reused 2006.10.20 16:00:58 LOG7[20302:200]: FD 227 in non-blocking mode 2006.10.20 16:00:58 LOG7[20302:200]: oracle11521 connecting 192.168.22.13:1521 2006.10.20 16:00:58 LOG7[20302:200]: connect_wait: waiting 10 seconds 2006.10.20 16:00:58 LOG7[20302:200]: connect_wait: connected 2006.10.20 16:00:58 LOG7[20302:200]: Remote FD=227 initialized 2006.10.20 16:00:58 LOG7[20302:200]: TCP_NODELAY option set on remote socket 2006.10.20 16:00:58 LOG7[20302:75]: SSL_read returned WANT_READ: retrying 2006.10.20 16:00:58 LOG7[20302:75]: SSL_read returned WANT_READ: retrying 2006.10.20 16:00:58 LOG7[20302:75]: SSL_read returned WANT_READ: retrying
client side: 2006.10.20 16:16:01 LOG6[12415:91]: SSL connected: new session negotiated 2006.10.20 16:16:01 LOG6[12415:91]: Negotiated ciphers: DES-CBC3- SHA SSLv3 Kx=RSA Au=RSA Enc=3DES(168) Mac=SHA1 2006.10.20 16:16:01 LOG7[12415:91]: Socket closed on read 2006.10.20 16:16:01 LOG7[12415:91]: SSL write shutdown 2006.10.20 16:16:01 LOG7[12415:91]: SSL alert (write): warning: close notify 2006.10.20 16:16:01 LOG7[12415:91]: SSL_shutdown retrying 2006.10.20 16:16:01 LOG7[12415:91]: SSL doesn't need to read or write 2006.10.20 16:16:01 LOG7[12415:91]: SSL alert (read): warning: close notify 2006.10.20 16:16:01 LOG7[12415:91]: SSL closed on SSL_read 2006.10.20 16:16:01 LOG7[12415:91]: Socket write shutdown 2006.10.20 16:16:01 LOG5[12415:91]: Connection closed: 3663 bytes sent to SSL, 5239 bytes sent to socket 2006.10.20 16:16:01 LOG7[12415:91]: oracle11521 finished (3 left)
The timings are shifted, i don't know if the delta of time is related to a misconfiguration of clock on the machines, or are snippets taken from different events... i don't have direct access to the machines.
I tried to make a test (client and server on the same machine)
client: nc localhost 10001 >>/dev/null stunnel accept on 10001 and connect to localhost:11521
on the server side: stunnel accept on 11521 and connect on 1521 cat 230megsfile | nc -l -p 1521 & 2>&1 >>/dev/null
so, when the nc on client side connects, the server starts the cat.
It did complete correctly within a pair of minutes on an ibook 64 1.33 1g ram, but with LOADS of want_read and want_write errors on both sides of stunnel.
The server and client are working on "big iron", i mean some superdome hardware with 30-40 cpu on both sides, and with an estimated traffic of 10gig/day
And, important... the version of the stunnel instances is 4.15
Can you help me?
Bye, dario.
On Thursday 26 October 2006 17:54, Dario Mariani wrote:
I'm deploying stunnel on some servers. I did some tests, and i never had problems. For example, i tried 5k parallel connections, and i didn't have any problem.
Thank you for the information. What is your platform (hardware, operating system)?
2006.10.20 16:00:58 LOG7[20302:75]: SSL_read returned WANT_READ: retrying
[cut]
It did complete correctly within a pair of minutes on an ibook 64 1.33 1g ram, but with LOADS of want_read and want_write errors on both sides of stunnel.
They're not errors! They're debug (LOG7) messages. The message does not indicate anything wrong by itself.
Debugging should be only enabled when you're trying to diagnose a problem - not in a production system.
Can you help me?
What is the problem (besides those debug messages)? How can I reproduce the hang mentioned int the subject?
Best regards, Mike
Il giorno 27/ott/06, alle ore 09:31, Michal Trojnara ha scritto:
On Thursday 26 October 2006 17:54, Dario Mariani wrote:
I'm deploying stunnel on some servers. I did some tests, and i never had problems. For example, i tried 5k parallel connections, and i didn't have any problem.
Thank you for the information. What is your platform (hardware, operating system)?
Well, it was only a test that i did to help a user understand the concept of ulimit :) with the tunneling on (and an "ulimit -n 8192 in the stunnel.init script :) ) , we did 5k "telnet localhost 10001 &" or something similar, it wasn't a big stress test... the system worked like a charm, all connections went fine (until the oracle listener closed connection, of course), and without problems. I don't remember the system, but it was a solaris 9 on big iron (i think some sunfire 6800-15k-25k with 52 or 56 1.2-1.33g cpus and more than 200g of ram)
In some days (about the end of the _next_ week, sorry) i can give to you the results of some performance tests we did... but they are not very deep in details :)
2006.10.20 16:00:58 LOG7[20302:75]: SSL_read returned WANT_READ: retrying
[cut]
It did complete correctly within a pair of minutes on an ibook 64 1.33 1g ram, but with LOADS of want_read and want_write errors on both sides of stunnel.
They're not errors! They're debug (LOG7) messages. The message does not indicate anything wrong by itself.
Debugging should be only enabled when you're trying to diagnose a problem - not in a production system.
What is the problem (besides those debug messages)?
The problem is this: the system works well for about 45min, then gives these messages and hangs. Simple and useless :( The traffic "shape" is that of a datawarehouse, with a little number of connections (i think few 10s), that carries a good load of traffic from the db (stunnel server) to the appserver (stunnel client), with peaks every 15min. And i _think_ that sometimes there are big uploads (sql updates) from the client to the server. This is what i understood asking :)
Now, i'm a little confused... the server started giving these debug messages, and then HUNG HORRIBLY within minutes. :) With the tests that i made on my laptop, i had those debug messages, but it all worked well and in expected times (the path netcat 120m file -> stunnel client -> stunnel server -> openssl s_server >/dev/ null took 20 seconds!!! ) So, i think at this point the problem isn't the WANT_READ debug messaged, but something that can be (or not) related to this.
What i'm asking is: - what these messages _exactly_ means? reading some openssl related forums, i saw that this message is sent by the server when the read buffer is empty and the server is awaiting data. - do you have any idea on what topic i can direct my analysis?
How can I reproduce the hang mentioned int the subject?
Well, i have some problems with this point: i CANNOT put up stunnel on the system that had the problem, until i fix the problem :(
Excuse me for my lack of precision and details, but these are chaotic days here :)
Bye, dario.
On Friday 27 October 2006 12:41, Dario Mariani wrote:
The problem is this: the system works well for about 45min, then gives these messages and hangs.
System (kernel) hangs? Stunnel hangs (no longer accepts new connections)? This connection hangs (no longer transfers any data)?
With the tests that i made on my laptop, i had those debug messages, but it all worked well and in expected times (the path netcat 120m file -> stunnel client -> stunnel server -> openssl s_server >/dev/ null took 20 seconds!!! )
6MB/s (48Mbit/s) for two SSL connections on a laptop seems to be a reasonable performance.
What i'm asking is:
- what these messages _exactly_ means? reading some openssl related
forums, i saw that this message is sent by the server when the read buffer is empty and the server is awaiting data.
From http://www.openssl.org/docs/ssl/SSL_read.html:
If the underlying BIO is non-blocking, SSL_read() will also return when the underlying BIO could not satisfy the needs of SSL_read() to continue the operation. In this case a call to SSL_get_error(3) with the return value of SSL_read() will yield SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE. As at any time a re-negotiation is possible, a call to SSL_read() can also cause write operations! The calling process then must repeat the call after taking appropriate action to satisfy the needs of SSL_read(). The action depends on the underlying BIO. When using a non-blocking socket, nothing is to be done, but select() can be used to check for the required condition. When using a buffering BIO, like a BIO pair, data must be written into or retrieved out of the BIO before being able to continue.
- do you have any idea on what topic i can direct my analysis?
The problem is either in transfer() function in client.c file or somewhere in OpenSSL library.
How can I reproduce the hang mentioned int the subject?
Well, i have some problems with this point: i CANNOT put up stunnel on the system that had the problem, until i fix the problem :(
Excuse me for my lack of precision and details, but these are chaotic days here :)
I see.
Best regards, Mike
On Friday 27 October 2006 12:41, Dario Mariani wrote:
Now, i'm a little confused... the server started giving these debug messages, and then HUNG HORRIBLY within minutes. :)
I think I was able to locate and fix this bug. Please try the beta release and let me know if it works fine, now.
ftp://ftp.mirt.net/stunnel/stunnel-4.19-b1.tar.gz ftp://ftp.mirt.net/stunnel/stunnel-4.19-installer-b1.exe
Best regards, Mike