Automated Certificate Management for Internal Infrastructure

Automated Certificate Management for Internal Infrastructure

Certificate management is one of those operational burdens that accumulates silently until it doesn’t. A forgotten certificate expiry takes down a service at 3 AM. A self-signed cert prompts engineers to add NODE_TLS_REJECT_UNAUTHORIZED=0 to their environment, which then propagates to production. An internal CA with no revocation infrastructure leaves compromised certificates valid indefinitely.

Modern certificate management should be largely automated: issuance, renewal, and distribution handled by tooling rather than calendar reminders. This article covers the practical toolchain for automating certificates across internet-facing services, internal infrastructure, and air-gapped environments.

Let’s Encrypt and ACME for Internet-Facing Services

Let’s Encrypt provides free, automated, DV (Domain Validated) certificates through the ACME protocol (RFC 8555). For internet-facing services, it is the correct default choice — no cost, 90-day validity forcing regular rotation, and broad client trust store coverage.

The two primary ACME challenge types for infrastructure use:

  • HTTP-01: The ACME client places a token at http://yourdomain.com/.well-known/acme-challenge/TOKEN. Let’s Encrypt fetches it to verify control. Requires port 80 to be reachable from the internet. Simple but cannot be used for wildcard certificates.
  • DNS-01: The ACME client creates a TXT record at _acme-challenge.yourdomain.com. Let’s Encrypt queries DNS to verify. Works behind firewalls, supports wildcard certificates, and is the only option for internal hostnames with public DNS delegation.

DNS-01 is preferable for infrastructure-grade automation because it decouples certificate issuance from service availability — a server can renew its certificate even when its HTTPS port is temporarily unreachable, during maintenance windows, or behind a firewall that blocks inbound HTTP.

Traefik: Automated Renewal with Let’s Encrypt

Traefik has first-class ACME support built in. It manages the full lifecycle: initial issuance, storage, and automatic renewal before expiry. Configuration is minimal.

Static configuration (traefik.yml):

certificatesResolvers:
  letsencrypt:
    acme:
      email: [email protected]
      storage: /data/acme.json
      dnsChallenge:
        provider: cloudflare
        resolvers:
          - "1.1.1.1:53"
          - "8.8.8.8:53"

entryPoints:
  web:
    address: ":80"
    http:
      redirections:
        entrypoint:
          to: websecure
          scheme: https
  websecure:
    address: ":443"

Dynamic configuration for a service router:

http:
  routers:
    api-router:
      rule: "Host(`api.example-corp.com`)"
      entryPoints:
        - websecure
      tls:
        certResolver: letsencrypt
      service: api-service

  services:
    api-service:
      loadBalancer:
        servers:
          - url: "http://10.0.20.10:3001"

Traefik stores certificates in acme.json (a JSON file with 600 permissions). It checks certificate expiry on startup and every 24 hours, renewing certificates within 30 days of expiry. The acme.json file should be on persistent storage, not an ephemeral container volume — losing it forces re-issuance for all managed domains simultaneously, which can trigger Let’s Encrypt rate limits.

For the Cloudflare DNS provider, supply credentials via environment variables:

[email protected]
CF_DNS_API_TOKEN=your_cloudflare_dns_edit_token

Create a scoped API token in the Cloudflare dashboard with only Zone DNS Edit permissions for the specific zone. Do not use a Global API Key.

Internal CAs with step-ca

Internet-facing services are well-served by Let’s Encrypt. Internal services — databases, message queues, internal APIs, management consoles — have different requirements: they may not have public DNS, they need certificates that internal clients trust but the public internet does not, and they may need to issue client certificates for mTLS.

step-ca (from Smallstep) is the best open-source option for an internal CA that supports ACME, JWK, and OIDC provisioners. It is production-grade, actively maintained, and designed for automation.

Initialize a new CA:

step ca init \
  --name "Example Corp Internal CA" \
  --dns "ca.internal.example-corp.com" \
  --address ":9000" \
  --provisioner "[email protected]" \
  --provisioner-password-file /etc/step/provisioner-password

# Start the CA as a systemd service
systemctl enable --now step-ca

The root CA certificate needs to be distributed to every client that needs to trust internal certificates:

# Linux: copy to system trust store
sudo cp root_ca.crt /usr/local/share/ca-certificates/internal-ca.crt
sudo update-ca-certificates

# macOS: add to system keychain (run once per device)
sudo security add-trusted-cert -d -r trustRoot \
  -k /Library/Keychains/System.keychain root_ca.crt

Distribute the root CA via configuration management (Puppet, Ansible, or your preferred tool) so it is consistently present on all managed hosts. New services automatically trust the internal CA without manual steps.

ACME with an Internal CA

step-ca supports the ACME protocol, which means any ACME-compatible client — including Traefik, Certbot, and the step CLI — can request certificates from your internal CA using the same workflow as Let’s Encrypt. No custom integration code required.

Configure Traefik to use the internal ACME endpoint for internal hostnames:

certificatesResolvers:
  internal-ca:
    acme:
      email: [email protected]
      storage: /data/acme-internal.json
      caServer: "https://ca.internal.example-corp.com:9000/acme/acme/directory"
      httpChallenge:
        entryPoint: web

For step-ca, configure an ACME provisioner:

step ca provisioner add acme --type ACME

Services on the internal network can now get certificates from the internal CA automatically, with the same renewal mechanics as internet-facing services. The trust chain terminates at your internal root CA rather than Let’s Encrypt.

DNS-01 Challenges: The Reliable Path for Wildcard Certs

Wildcard certificates (*.example-corp.com) are attractive for reducing certificate sprawl — one cert covers all subdomains. Let’s Encrypt issues wildcard certificates, but only via DNS-01 challenge. There is no HTTP-based path to a wildcard cert.

For organizations using Cloudflare, AWS Route 53, or other supported DNS providers, the ACME client performs the DNS challenge automatically. For custom or split-horizon DNS, you may need to implement a manual DNS hook or use a DNS API proxy.

A minimal DNS-01 hook script for a custom nameserver (called by Certbot):

#!/bin/bash
# deploy-challenge.sh — called by certbot with:
# CERTBOT_DOMAIN, CERTBOT_VALIDATION

NSUPDATE_SERVER="10.0.10.5"
NSUPDATE_KEY="/etc/certbot/nsupdate.key"

nsupdate -k "$NSUPDATE_KEY" << EOF
server $NSUPDATE_SERVER
zone example-corp.com.
update add _acme-challenge.${CERTBOT_DOMAIN}. 60 TXT "${CERTBOT_VALIDATION}"
send
EOF

Air-Gapped Environments

ACME requires the CA to reach DNS resolvers or the certificate requester to expose an HTTP endpoint. In a fully air-gapped environment with no external connectivity, Let's Encrypt is not an option. The alternatives:

  • Internal CA with pre-issued certificates: step-ca operates entirely on the internal network. Issue certificates with longer validity (1 year is common for air-gapped systems where automated renewal is harder), and automate renewal via an internal cronjob calling the step CLI against the internal CA endpoint.
  • PKCS#12 bundles distributed via configuration management: For hosts that cannot run an ACME client, issue certificates from the internal CA, package them as PKCS#12, and distribute via Puppet/Ansible with appropriate file permissions. Track expiry centrally and alert before the 30-day mark.
  • Short-lived certificates with frequent renewal: In environments where automation is possible but external connectivity is not, issue 24-hour certificates and automate renewal. This eliminates the need for revocation infrastructure — a compromised certificate expires quickly.

The NODE_TLS_REJECT_UNAUTHORIZED=0 Anti-Pattern

This environment variable disables TLS certificate validation in Node.js. It is used when developers hit self-signed or untrusted certificate errors in development, add the variable to make it go away, and then forget to remove it before deploying. In production, it means TLS provides encryption but no authentication — a man-in-the-middle attack becomes trivially easy.

The correct fix is always to distribute the root CA certificate to the system trust store or provide it explicitly to the application:

# Instead of:
NODE_TLS_REJECT_UNAUTHORIZED=0

# Provide the CA bundle explicitly:
NODE_EXTRA_CA_CERTS=/etc/ssl/certs/internal-ca.crt

# Or in application code:
const https = require('https');
const agent = new https.Agent({
  ca: fs.readFileSync('/etc/ssl/certs/internal-ca.crt')
});

Treat any occurrence of NODE_TLS_REJECT_UNAUTHORIZED=0 in production code or environment configuration as a P0 security finding. Automated scanning tools like Semgrep can detect it in CI pipelines before it reaches deployment.

Monitoring Certificate Expiry

Automated renewal is reliable, but not infallible. Storage failures, network issues, or DNS propagation problems can cause renewal to fail silently. Monitor certificate expiry as a separate control from the renewal mechanism.

A simple check with OpenSSL:

#!/bin/bash
# cert-expiry-check.sh
DOMAIN=$1
WARN_DAYS=30

EXPIRY=$(echo | openssl s_client -connect "${DOMAIN}:443" -servername "$DOMAIN" 2>/dev/null \
  | openssl x509 -noout -enddate \
  | cut -d= -f2)

EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_LEFT=$(( (EXPIRY_EPOCH - NOW_EPOCH) / 86400 ))

if [ "$DAYS_LEFT" -lt "$WARN_DAYS" ]; then
  echo "WARN: ${DOMAIN} expires in ${DAYS_LEFT} days (${EXPIRY})"
  exit 1
fi
echo "OK: ${DOMAIN} expires in ${DAYS_LEFT} days"

Run this as a scheduled job against all critical hostnames and route alerts to your monitoring system. Prometheus users can use the ssl_exporter to expose certificate expiry as a metric and alert at the 30-day and 7-day marks.

Conclusion

Certificate management becomes a solved problem when you treat it as an infrastructure automation challenge rather than a manual operational task. For internet-facing services, Traefik with Let's Encrypt DNS-01 challenges handles the full lifecycle. For internal services, step-ca provides a production-grade internal CA with ACME protocol support, keeping the same client toolchain. Air-gapped environments require more planning but are fully manageable with an internal CA and configuration management distribution.

The key discipline is eliminating manual processes — no spreadsheets tracking expiry dates, no calendar reminders for renewals, and absolutely no NODE_TLS_REJECT_UNAUTHORIZED=0 shortcuts. Automate the mechanics; monitor the outcomes.

Scroll to Top