The loop which performs renewals in the background obtains a read lock
on the certificate cache map, so that it can be safely iterated. Before
this fix, it would obtain the renewals in the read lock. This has been
fine, except that the TLS-SNI challenge, when invoked after Caddy has
already started, requires adding a certificate to the cache. Doing this
requires an exclusive write lock. But it cannot obtain a write lock
because a read lock is obtained higher in the stack, while the loop
iterates. In other words, it's a deadlock.
I was able to reproduce this issue consistently locally, after jumping
through many hoops to force a renewal in a short time that bypasses
Let's Encrypt's authz caching. I was also able to verify that by queuing
renewals (like we do deletions and OCSP updates), lock contention is
relieved and the deadlock is avoided.
This only affects background renewals where the TLS-SNI(-01) challenge
are used. Users report seeing strange errors in the logs after this
happens ("tls: client offered an unsupported, maximum protocol version
of 301"), but I was not able to reproduce these locally. I was also not
able to reproduce the leak of sockets which are left in CLOSE_WAIT.
I am not sure if those are symptoms of running in production on Linux
and are related to this bug, or not.
Either way, this is an important fix. I do not yet know the ripple
effects this will have on other symptoms we've been chasing. But it
definitely resolves a deadlock during renewals.
By calling SetTLSAddress, the acme package reset the challenge provider
to the default one instead of keeping the custom one we specified before
with SetChallengeProvider. Yikes. This means that Caddy would try to
open a listener on port 443 even though we should have been handling it
with our provider, causing the challenge to fail, since usually port 443
is in use.
So this change just reorders the calls so that our provider takes
precedence.
cf. https://github.com/xenolf/lego/pull/292
We renamed caddytls.ErrStorageNotFound to caddytls.ErrNotExist to more
closely mirror the os package. We changed it to an interface wrapper
so that the custom error message can be preserved. Returning only "data
not found" was useless in debugging because we couldn't know the
concrete value of the error (like what it was trying to load).
Users can do a type assertion to determine if the error value is a "not
found" error instead of doing an equality check.
A Caddyfile using *.example.com as its site address would be subject to
this bug at renewal time, as it would use the literal "*.example.com"
value instead of the name being passed in to obtain a certificate.
This change fixes the LoadSite call so that it looks in the proper
directory for the certificate resources.
It was set by default on the caddy-internal config object, and even
checked for conflicts, but it was never actually reflected on the
tls.Config.
This will have user-visible changes: a client that prefers, say, AES-CBC
but also supports AES-GCM would have used AES-CBC befor this, and will
use AES-GCM after.
This is desirable and important behavior, because if for example the
server wanted to support 3DES, but *only if it was strictly necessary*,
it would have had no way of doing so with PreferServerCipherSuites
false, as the client preference would have won.
If another ACME client is trying to solve a challenge for a name not
being served by Caddy on the same machine where Caddy is running, the
HTTP challenge will be consumed by Caddy rather than allowing the owner
to use the Caddyfile to proxy the challenge.
With this change, we only consume requests for HTTP challenges for
hostnames that we recognize. Before doing the challenge, we add the
name to a set, and when seeing if we should proxy the challenge, we
first check the path of course to see if it is an HTTP challenge;
if it is, we then check that set to see if the hostname is in the
set. Only if it is, do we consume it.
Otherwise, the request is treated like any other, allowing the owner
to configure a proxy for such requests to another ACME client.