Hi All, We have been doing some throughput testing and are wondering
of you can help explain some of the results we have been getting.
We are trying to establish the maximum number of handshakes per second
our appliances can do when using Stunnel.
Our tests have been done on a 4 core Intel(R) Xeon(R) CPU E3-1230 v5 @
3.40GHz machine with 8Gb ram. Using stunnel 5.33 openssl 1.0.2j (We
use stunnel 5.33 here, but this was also tested with stunnel 5.40 and
we got the same issue)
Stunnel was compiled with - ./configure --enable-fips --enable-ipv6
--with-threads=pthread --sysconfdir=/etc
--with-ssl=/usr/src/built/Open
SSL_1_0_2j-no_march/usr/local/
--disable-libwrap --disable-systemd
I was using a python script of my own design for load testing when I
noticed the problem we then switched to an open source package called
vegeta https://github.com/tsenart/vegeta the command I was using to
run it was -
echo "GET https://10.10.10.10:443" | ./vegeta attack -workers=30
-rate=3000 -duration=300s -insecure > test.out
We are essentially seeing a degradation in throughput at a fixed point in time.
I am seeing the problem when using 3000 transactions per second. We
are doing https gets through stunnel to a http backend, 1 transaction
is an ssl handshake the http get and waiting for the response.
Interestingly if I try 5.06 it works fine I get a smooth constant
loading.
Please take a look at 30worker3000rate5.33-300s.png where you can see
after about 55 seconds there is a jump in latency.
I get the same issue if I run the test with 2000 transactions. Please
see 30worker20005.33-700s.png but with the lower transaction rate the
issue takes longer to rear its head.
The issue doesnt happen if I run the test with 3000 transaction using
5.06. Please see 30worker3000rate5.06-300s.png
>From looking at the graphs our thoughts were that we were hitting some sort
of cache limit and used that as a starting point. We have taken a look at the
code changes between versions we know were working and the problematic
version and the main thing that stands out is the changes to the leak detection.
Previously this was partially based on the number of clients but has been
changed to use the sections. This led to the decision to try a
different limit value
and found the number of connections we saw before seeing an issue corresponds
to the change of the limit value.
We think we have identified the code that is causing the problem/doing
something funny it was added in 5.33
NOEXPORT long leak_threshold() {
long limit;
limit=10000*((int)number_of_sections+1);
#ifndef USE_FORK
limit+=100*num_clients;
#endif
return limit;
}
its src/str.c
when increasing the limit number so
limit=10000*((int)number_of_sections+1)
to
limit=20000*((int)number_of_sections+1)
This still results in an increase in response time but at a later time
in the test, please see 30worker3000rate5.33-ben-highlimit.png
Interestingly once I end up in this state I am stuck with high
latencies until I restart stunnel. (The box I am testing on was left
over night to make sure its not any hanging tcp connections or
anything like that)
I must admit we are at the limit of our C understanding and your imput
would be welcomed.