Hi,
I've just added openssl engine support for Cavium Nitrox lite PCI card. Here are the speed test output:
[engine] stands for speed test output with engine [soft] stands for speed test output without engine (openssl internal software implementation used)
~ # openssl speed -evp [rc4|aes256|sha1] [-engine /root/libcavium.so] ...... The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes rc4[engine] 6339.20k 25376.00k 100966.40k 3757056.00k 919961.60k rc4[soft] 6056.66k 7017.34k 7228.37k 7339.13k 7356.14k aes-256-cbc[engine] 3178.40k 25465.60k 50636.80k 194201.60k 2088140.80k aes-256-cbc[soft] 2430.36k 2572.93k 2597.45k 2617.27k 2609.82k sha1[engine] 774.20k 3115.20k 16486.40k 27033.60k 285593.60k sha1[soft] 915.74k 2622.71k 5678.93k 7942.23k 9041.75k
~ # openssl speed rsa [-engine /root/libcavium.so] sign verify sign/s verify/s rsa[soft] 512 bits 0.013810s 0.001349s 72.4 741.2 rsa[soft] 1024 bits 0.079200s 0.004342s 12.6 230.3
sign verify sign/s verify/s rsa[engine] 512 bits 0.000037s 0.000022s 26725.8 45181.5 rsa[engine] 1024 bits 0.000046s 0.000043s 21866.7 23489.1
The speed test shows than using engine is much much faster than software implementations. but while testing with stunnel, the result is really poor. I am still using apache benchmark 2 as the test tool. #ab2 -c 50 -n 100 https://192.168.10.5/test_1k.html (a html file whose size is 1Kbytes) #ab2 -c 50 -n 100 https://192.168.10.5/test_4k.html (a html file whose size is 4Kbytes) #ab2 -c 50 -n 100 https://192.168.10.5/test_50k.html (a html file whose size is 50Kbytes) #ab2 -c 50 -n 100 https://192.168.10.5/test_100k.html (a html file whose size is 100Kbytes) #ab2 -c 50 -n 100 https://192.168.10.5/test_200k.html (a html file whose size is 200Kbytes)
Here is the result: Algorithm engine 1K 4K 50K 100K 200K AES-256-CBC+SHA1 no 14.12 13.21 7.92 4.48 2.17 (Requests per second) AES-256-CBC+SHA1 yes 51.10 23.34 7.30 3.12 1.91 (Requests per second)
the data says everything. while using engine with stunnel, we got some performance increasement for processing small files, but processing bigger files is even slower than software implementation (it's quite strange, because proccessing large data blocks with hardware engine is assumed to be much faster than proccessing small data blocks). I did the same test for several times, but the result is the same.
something I can be sure: 1) while doing stunnel engine test, all of the algorithms are processed by the engine. I can see it by debugging information printed on console screen. 2) the client running ab2 is a AMD64 with 1G RAM, and the web server (apache) used in the test above is an XEON with 2G RAM. so the bottleneck should be neither ab2 nor apache.
So here are my questions: Has anyone tried stunnel with a hardware acceleartor? How about the performance?
BTW, attached to this mail is a little patch for stunnel-4.14 which allows the engine directive in stunnel.conf to be set to either a engine id or a shared object path refer to the engine (ie. engine = /root/libcavium.so)
Best regards,
Zhuang Yuyao
--- stunnel-4.14/src/ssl.c Thu Oct 27 09:42:54 2005 UTC +++ stunnel/src/ssl.c Tue Mar 28 08:48:03 2006 UTC @@ -122,6 +143,21 @@ }
#if (SSLEAY_VERSION_NUMBER >= 0x00907000L) && defined(HAVE_OSSL_ENGINE_H) +static ENGINE *load_dynamic_engine(const char *engine) +{ + ENGINE *e = ENGINE_by_id("dynamic"); + if (e) + { + if (!ENGINE_ctrl_cmd_string(e, "SO_PATH", engine, 0) + || !ENGINE_ctrl_cmd_string(e, "LOAD", NULL, 0)) + { + ENGINE_free(e); + e = NULL; + } + } + return e; +} + static void init_engine(void) { ENGINE *e;
@@ -133,9 +169,12 @@ } e=ENGINE_by_id(options.engine); if(!e) { + e = load_dynamic_engine(options.engine); + if (!e) { s_log(LOG_ERR, "Engine not found"); exit(1); } + } if(!ENGINE_init(e)) { s_log(LOG_ERR, "Engine can't be initialized"); ENGINE_free(e);
On 2006-04-03, at 06:12, Zhuang Yuyao wrote:
aes-256-cbc[engine] 3178.40k 25465.60k 50636.80k 194201.60k 2088140.80k
I guess there's something wrong with the OpenSSL benchmark. I really can't believe in 2GB/s (16Gbit/s) throughput on a PCI device. 8-)
Algorithm engine 1K 4K 50K 100K 200K AES-256-CBC+SHA1 no 14.12 13.21 7.92 4.48 2.17 (Requests per second) AES-256-CBC+SHA1 yes 51.10 23.34 7.30 3.12 1.91 (Requests per second)
I'm pretty sure stunnel is *not* a performance bottleneck in your configuration. AFAIR the most CPU-expensive operation is the SSL handshake on the client, not server. See: http://stunnel.mirt.net/perf.html
BTW: What is your version of OpenSSL library? According to my tests OpenSSL 0.9.8x is *a lot* faster than OpenSSL 0.9.7x.
Best regards, Mike