Hi everyone,
I've recently upgraded from NixOS 24.05 to 24.11 which went from stunnel 5.72 to 5.73 and have since had stunnel crash on me multiple times on different servers (pretty much identical setup however) due to what looks like a memory allocation race?
Since 5.73 moved OCSP fetches to a thread, and the problem hasn't ever happened with 5.72 it sounds like an issue specific to that.
Someone familiar with the codebase probably has a better idea of what's going on.
Given that my updates happened somewhen mid last month and there have been no certificate renewals since (except for one that wouldn't affect the systems that were affected, thanks so much to crt.sh and CT existing) the error message of "OCSP routines::status expired" also seems out of place at least for the certificate itself (the staple may be expired, but stunnel should probably get a new one in that case which may have to do with the error in question).
```
Dec 06 04:43:02 shell stunnel[1124111]: LOG5[per-minute]: OCSP: Certificate accepted
Dec 06 04:43:02 shell stunnel[1124111]: 2024.12.06 04:43:02 LOG5[per-minute]: OCSP: Certificate accepted
Dec 06 04:43:02 shell stunnel[1124111]: LOG5[per-minute]: OCSP: Certificate accepted
Dec 06 04:43:02 shell stunnel[1124111]: 2024.12.06 04:43:02 LOG5[per-minute]: OCSP: Certificate accepted
Dec 06 04:43:02 shell stunnel[1124111]: LOG5[per-minute]: OCSP: Certificate accepted
Dec 06 04:43:02 shell stunnel[1124111]: 2024.12.06 04:43:02 LOG5[per-minute]: OCSP: Certificate accepted
Dec 06 04:43:56 shell stunnel[1124111]: LOG5[ui]: Log file reopened
Dec 06 04:43:56 shell stunnel[1124111]: 2024.12.06 04:43:56 LOG5[ui]: Log file reopened
Dec 06 04:44:02 shell stunnel[1124111]: LOG3[per-minute]: OCSP: OCSP_check_validity: crypto/ocsp/ocsp_cl.c:351: error:1380007D:OCSP routines::status expired
Dec 06 04:44:02 shell stunnel[1124111]: 2024.12.06 04:44:02 LOG3[per-minute]: OCSP: OCSP_check_validity: crypto/ocsp/ocsp_cl.c:351: error:1380007D:OCSP routines::status expired
Dec 06 04:44:02 shell stunnel[1124111]: 2024.12.06 04:44:02 LOG4[per-minute]: OCSP: Unknown verification status
Dec 06 04:44:02 shell stunnel[1124111]: INTERNAL ERROR: Memory allocated in a different thread at ocsp.c, line 412
Dec 06 04:44:02 shell stunnel[1124111]: LOG4[per-minute]: OCSP: Unknown verification status
Dec 06 04:44:02 shell stunnel[1124111]: INTERNAL ERROR: Memory allocated in a different thread at ocsp.c, line 412
Dec 06 04:44:02 shell systemd-coredump[282485]: Process 1124111 (stunnel) of user 988 terminated abnormally with signal 6/ABRT, processing...
Dec 06 04:44:02 shell systemd[1]: Started Process Core Dump (PID 282485/UID 0).
Dec 06 04:44:02 shell systemd-coredump[282486]: [🡕] Process 1124111 (stunnel) of user 988 dumped core.
Module libgcc_s.so.1 without build-id.
Module libcap.so.2 without build-id.
Module stunnel without build-id.
Stack trace of thread 1124120:
#0 0x00007f9191c99a9c __pthread_kill_implementation (libc.so.6 + 0x92a9c)
#1 0x00007f9191c47576 raise (libc.so.6 + 0x40576)
#2 0x00007f9191c2f935 abort (libc.so.6 + 0x28935)
#3 0x000056142e23e435 fatal_debug.cold (stunnel + 0xc435)
#4 0x000056142e23eebc get_alloc_list_ptr (stunnel + 0xcebc)
#5 0x000056142e23f0b9 str_detach_debug.part.0 (stunnel + 0xd0b9)
#6 0x000056142e23f856 str_free_debug (stunnel + 0xd856)
#7 0x000056142e259635 ocsp_stapling (stunnel + 0x27635)
#8 0x000056142e25a18f per_minute_thread (stunnel + 0x2818f)
#9 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#10 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 676245:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e241f82 client_run (stunnel + 0xff82)
#3 0x000056142e2438fc client_thread (stunnel + 0x118fc)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 1124111:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e25b5f6 daemon_loop (stunnel + 0x295f6)
#3 0x000056142e23e680 main (stunnel + 0xc680)
#4 0x00007f9191c3127e __libc_start_call_main (libc.so.6 + 0x2a27e)
#5 0x00007f9191c31339 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x2a339)
#6 0x000056142e23e875 _start (stunnel + 0xc875)
Stack trace of thread 1124119:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e250acd s_poll_sleep (stunnel + 0x1eacd)
#3 0x000056142e25a21c per_second_thread (stunnel + 0x2821c)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 676190:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e241f82 client_run (stunnel + 0xff82)
#3 0x000056142e2438fc client_thread (stunnel + 0x118fc)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 1124121:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e250acd s_poll_sleep (stunnel + 0x1eacd)
#3 0x000056142e259fab per_day_thread (stunnel + 0x27fab)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 676445:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e241f82 client_run (stunnel + 0xff82)
#3 0x000056142e2438fc client_thread (stunnel + 0x118fc)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 676300:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e241f82 client_run (stunnel + 0xff82)
#3 0x000056142e2438fc client_thread (stunnel + 0x118fc)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
Stack trace of thread 676491:
#0 0x00007f9191d090af __poll (libc.so.6 + 0x1020af)
#1 0x000056142e2509f1 s_poll_wait (stunnel + 0x1e9f1)
#2 0x000056142e241f82 client_run (stunnel + 0xff82)
#3 0x000056142e2438fc client_thread (stunnel + 0x118fc)
#4 0x00007f9191c97d02 start_thread (libc.so.6 + 0x90d02)
#5 0x00007f9191d173ac __clone3 (libc.so.6 + 0x1103ac)
ELF object binary architecture: AMD x86-64
Dec 06 04:44:02 shell systemd[1]: stunnel.service: Main process exited, code=dumped, status=6/ABRT
Dec 06 04:44:02 shell systemd[1]: stunnel.service: Failed with result 'core-dump'.
Dec 06 04:44:02 shell systemd[1]: stunnel.service: Consumed 2min 11.812s CPU time, 21.6M memory peak, 19.6M memory swap peak, 50.9M read from disk, 47.1M written to disk, 587.2M incoming IP traffic, 588.8M outgoing IP traffic.
```
Note that stunnel is handling mTLS for my prometheus, which means that it is indeed getting hammered by one scrape per exporter at a time, which in this case would make exactly five scrapes per minute but all of them happening at the same time, which, if this is a race, does make me a lot more likely to hit this issue.
Also a huge Thank You for building what has been an integral part of my infrastructure for almost a decade now.