How NATS Lame Duck Mode Actually Works

Every production NATS deployment eventually needs a rolling restart. Lame duck mode is how NATS coordinates graceful server shutdown without dropping all connections simultaneously. The mechanism is straightforward once you see the moving parts, but most operators only encounter it during their first maintenance window – usually under pressure.

The Protocol

When a NATS server enters lame duck mode (via SIGUSR2, nats-server --signal ldm, or programmatic LameDuckShutdown() call), it executes a multi-phase shutdown:¹

Stops accepting new connections (closes the listener)
Transfers Raft leadership roles (if JetStream enabled)
Shuts down JetStream
Shuts down Raft nodes
Waits for the accept loops to fully drain
Sends an INFO protocol message with ldm: true (and an updated connect_urls list that no longer advertises this server) to every connected client²
Waits through a grace period (default 10 seconds)
Gradually disconnects clients over the remaining duration

The client side responds automatically:

Detects the ldm: true flag in the INFO message
Fires the LameDuckModeHandler callback if configured
Removes the draining server from its connection pool – but only because the server’s INFO update no longer advertises itself in connect_urls. The ldm: true flag alone does not trigger pool removal; the standard connect_urls update path does
When disconnected, reconnects to one of the alternative servers provided in the INFO message
Restores all subscriptions on the new connection

The result: clients migrate to healthy servers without application-level intervention.

Timing

The defaults balance two competing concerns. Disconnecting clients too fast causes a reconnection storm on the remaining servers. Disconnecting too slowly delays the maintenance operation.

DEFAULT_LAME_DUCK_DURATION    = 2 * time.Minute  // spread disconnections
DEFAULT_LAME_DUCK_GRACE_PERIOD = 10 * time.Second // initial breathing room

These constants are configurable via the server config.³

The server calculates a sleep interval between disconnections:

interval = (duration - grace_period) / number_of_clients

For 1,000 clients with default settings, that’s roughly 110ms between each disconnection. The server randomizes each sleep to between 50% and 100% of the calculated interval to prevent synchronized reconnection waves.⁴ number_of_clients is snapshotted once after the accept loops drain, before any client receives the LDM INFO; clients that disconnect themselves during the grace period reduce the actual count but do not shorten the sleep interval.

For large deployments (10k+ connections), increase the duration:

lame_duck_duration: 5m
lame_duck_grace_period: 30s

The Handler

Most stateless services don’t need a handler – the automatic reconnection is sufficient. For stateful services, the handler provides a window to prepare:

nc, err := nats.Connect(servers,
    nats.LameDuckModeHandler(func(nc *nats.Conn) {
        log.Info("server draining, preparing for reconnection")
        completeInFlightWork()
        pauseNewRequests()
    }),
)

The handler runs asynchronously and should not block for long. Its purpose is to signal your application to wrap up current work before the disconnection arrives.

Rolling Restart Sequence

For a 3-node cluster:

# Node 1 -- preferred: signal via the CLI (no PID required, works cross-platform)
nats-server --signal ldm

# Or via Unix signal (Linux: pidof; macOS: pgrep -x)
kill -USR2 $(pgrep -x nats-server)

# Wait for connections to drain, then restart with new version
systemctl restart nats-server

# Verify node 1 is healthy before proceeding
nats server report jetstream

# Repeat for nodes 2 and 3, one at a time

Wait for each node to fully rejoin the cluster and restore its Raft group memberships before proceeding to the next. The nats server report jetstream output confirms when all stream replicas are healthy. If a node fails to rejoin – particularly after a name change – see Recovering a JetStream Cluster After Quorum Loss for the recovery procedure.

Health Checks

During lame duck mode, the server’s /healthz endpoint returns 500 Internal Server Error (the listener is closed, so the readiness check fails).⁵ If you’re running behind a load balancer or using Kubernetes readiness probes, this automatically stops new traffic from being routed to the draining server.

In Kubernetes:

readinessProbe:
  httpGet:
    path: /healthz
    port: 8222

The readiness probe will fail as soon as lame duck mode begins, and Kubernetes will stop sending new connections to that pod before the grace period even expires.

Common Mistakes

Using SIGKILL during lame duck. The whole point is graceful shutdown. Sending SIGKILL after SIGUSR2 interrupts the client migration mid-flight, causing exactly the connection storm lame duck mode is designed to prevent.

Not accounting for in-flight work. The grace period delays the start of client disconnections, but does not guarantee each client time to finish work. Use the LameDuckModeHandler callback to drain in-flight requests, and ensure lame_duck_duration is long enough for all clients to be disconnected gracefully. Note that lame_duck_duration is not a strict ceiling on total drain time – Raft leader transfer (up to ~1 second) and JetStream shutdown run before the duration timer starts, so the actual elapsed time from LDM entry to final disconnect exceeds the configured duration.

Skipping lame duck for “quick” restarts. Even a fast restart drops all connections simultaneously, causing a thundering herd on the remaining servers. Lame duck mode takes 2 minutes by default.

NATS Server Signals - SIGUSR2 triggers lame duck mode. The SIGTERM handler in server/signal.go checks s.ldm; when the server is already in lame duck mode, the entire SIGTERM handling block is a no-op – Shutdown(), WaitForShutdown(), and os.Exit(1) are all skipped, allowing the in-flight graceful drain to run to completion. ↩︎
The full shutdown sequence in lameDuckMode() (server/server.go): close listener, transfer Raft leaders (up to ~1 second), shutdown JetStream, shutdown Raft nodes, wait for accept loops to fully drain via ldmCh, send LDM INFO to routes, send LDM INFO to clients, grace period, gradual client disconnection, Shutdown(). JetStream is fully shut down before clients receive the ldm: true INFO. ↩︎
Constants DEFAULT_LAME_DUCK_DURATION and DEFAULT_LAME_DUCK_GRACE_PERIOD are defined in server/server.go. Configurable via lame_duck_duration and lame_duck_grace_period in the server config. ↩︎
The disconnection interval randomization uses rand.Int63n(si) clamped at floor si/2, producing a sleep in [si/2, si) where si is the calculated per-client interval. The interval is also capped at 1 second – with few clients over a long duration, the server still disconnects at most once per second. The sleep is skipped for the last client: lameDuckMode() breaks out of the loop after closing the final connection without a trailing wait. See lameDuckMode() in server/server.go. ↩︎
NATS Monitoring - The /healthz endpoint calls readyForConnections(), which checks s.listener != nil. Since lameDuckMode() sets s.listener = nil as its first action, the health check returns 500 Internal Server Error immediately (StatusServiceUnavailable – 503 – is reserved for the “JetStream not enabled” branch). See server/monitor.go. ↩︎

Published September 20, 2025 Updated May 15, 2026