How NATS Lame Duck Mode Actually Works
Every production NATS deployment eventually needs a rolling restart. Lame duck mode is how NATS coordinates graceful server shutdown without dropping all connections simultaneously. The mechanism is straightforward once you see the moving parts, but most operators only encounter it during their first maintenance window – usually under pressure.
The Protocol
When a NATS server enters lame duck mode (via SIGUSR2, nats-server --signal ldm, or programmatic LameDuckShutdown() call), it executes a multi-phase shutdown:1
- Stops accepting new connections
- Transfers Raft leadership roles (if JetStream enabled)
- Shuts down JetStream and Raft nodes
- Sends an
INFOprotocol message withldm: trueto every connected client - Clears its own URLs from the
connect_urlslist and provides alternative servers - Waits through a grace period (default 10 seconds)
- Gradually disconnects clients over the remaining duration
The client side responds automatically:
- Detects the
ldm: trueflag in the INFO message - Fires the
LameDuckModeHandlercallback if configured - Excludes the draining server from its connection pool
- When disconnected, reconnects to one of the alternative servers provided in the INFO message
- Restores all subscriptions on the new connection
The result: clients migrate to healthy servers without application-level intervention.
Timing
The defaults balance two competing concerns. Disconnecting clients too fast causes a reconnection storm on the remaining servers. Disconnecting too slowly delays the maintenance operation.
DEFAULT_LAME_DUCK_DURATION = 2 * time.Minute // spread disconnections
DEFAULT_LAME_DUCK_GRACE_PERIOD = 10 * time.Second // initial breathing room
These constants are configurable via the server config.2
The server calculates a sleep interval between disconnections:
interval = (duration - grace_period) / number_of_clients
For 1,000 clients with default settings, that’s roughly 110ms between each disconnection. The server randomizes each sleep to between 50% and 100% of the calculated interval to prevent synchronized reconnection waves.3
For large deployments (10k+ connections), increase the duration:
lame_duck_duration: 5m
lame_duck_grace_period: 30s
JetStream Coordination
On JetStream-enabled servers, lame duck mode handles Raft and JetStream shutdown before notifying clients. The sequence is: transfer Raft leadership, shut down JetStream, shut down Raft nodes, then send the ldm: true INFO to clients. This ensures JetStream operations continue on remaining servers before the grace period even begins.4
The Handler
Most stateless services don’t need a handler – the automatic reconnection is sufficient. For stateful services, the handler provides a window to prepare:
nc, err := nats.Connect(servers,
nats.LameDuckModeHandler(func(nc *nats.Conn) {
log.Info("server draining, preparing for reconnection")
completeInFlightWork()
pauseNewRequests()
}),
)
The handler runs asynchronously and should not block for long. Its purpose is to signal your application to wrap up current work before the disconnection arrives.
Rolling Restart Sequence
For a 3-node cluster:
# Node 1
kill -USR2 $(pidof nats-server) # enter lame duck (pgrep -x nats-server on macOS)
# wait for connections to drain
systemctl restart nats-server # restart with new version
# Verify node 1 is healthy before proceeding
nats server report jetstream
# Repeat for nodes 2 and 3, one at a time
Wait for each node to fully rejoin the cluster and restore its Raft group memberships before proceeding to the next. The nats server report jetstream output confirms when all stream replicas are healthy.
Health Checks
During lame duck mode, the server’s /healthz endpoint returns an error status (the listener is closed, so the readiness check fails).5 If you’re running behind a load balancer or using Kubernetes readiness probes, this automatically stops new traffic from being routed to the draining server.
In Kubernetes:
readinessProbe:
httpGet:
path: /healthz
port: 8222
The readiness probe will fail as soon as lame duck mode begins, and Kubernetes will stop sending new connections to that pod before the grace period even expires.
Common Mistakes
Using SIGKILL during lame duck. The whole point is graceful shutdown. Sending SIGKILL after SIGUSR2 interrupts the client migration mid-flight, causing exactly the connection storm lame duck mode is designed to prevent.
Not accounting for in-flight work. The grace period delays the start of client disconnections, but does not guarantee each client time to finish work. Use the LameDuckModeHandler callback to drain in-flight requests, and ensure lame_duck_duration is long enough for all clients to be disconnected gracefully.
Skipping lame duck for “quick” restarts. Even a fast restart drops all connections simultaneously, causing a thundering herd on the remaining servers. Lame duck mode takes 2 minutes by default.
-
NATS Server Signals - SIGUSR2 triggers lame duck mode. The SIGTERM handler in
server/signal.gocheckss.ldm; if the server is already in lame duck mode, it skipsShutdown()entirely, allowing the graceful drain to complete. ↩︎ -
Constants
DEFAULT_LAME_DUCK_DURATIONandDEFAULT_LAME_DUCK_GRACE_PERIODare defined inserver/server.go. Configurable vialame_duck_durationandlame_duck_grace_periodin the server config. ↩︎ -
The disconnection interval randomization uses
rand.Int63n(si)clamped at floorsi/2, producing a sleep in[si/2, si)wheresiis the calculated per-client interval. The interval is also capped at 1 second – with few clients over a long duration, the server still disconnects at most once per second. SeelameDuckMode()inserver/server.go. ↩︎ -
The full shutdown sequence in
lameDuckMode(): close listener, transfer Raft leaders, shutdown JetStream, shutdown Raft nodes, send LDM INFO to routes and clients, grace period wait, gradual disconnection. ↩︎ -
NATS Monitoring - The
/healthzendpoint callsreadyForConnections(), which checkss.listener != nil. SincelameDuckMode()setss.listener = nilas its first action, the health check returns503immediately. Seeserver/monitor.go. ↩︎