Most Kubernetes outages I've seen weren't infrastructure failures. They were graceful shutdown done wrong.
If you've ever seen dropped requests during a deployment, a pod that wouldn't die, or health checks failing during a rollout - this post is for you.
SIGTERM sounds simple. It's not.
What you think happens
The mental model most engineers carry:
kubectl delete pod → pod gets SIGTERM → pod shuts down → done
┌─────────────┐ SIGTERM ┌─────────────┐ exit 0 ┌──────────┐
│ kubectl │ ─────────────► │ Pod │ ────────────► │ Gone │
└─────────────┘ └─────────────┘ └──────────┘
Simple. Clean. Wrong.
What actually happens is messier, has a race condition baked in, and will bite you in production if you don't account for it.
The full termination sequence
When a pod is deleted (rolling update, node drain, or kubectl delete pod), Kubernetes kicks off a sequence that most engineers have never seen written down completely.
kubectl delete pod
│
▼
API Server
marks pod
"Terminating"
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
kubelet Endpoints ReplicaSet
sends SIGTERM controller controller
to container removes pod detects count
│ from Service below desired
│ │ │
▼ ▼ ▼
App starts kube-proxy Scheduler
shutdown updates places new
iptables pod on node
(takes seconds)
t=0s API Server marks pod Terminating in etcd
t=0s THREE things happen simultaneously:
1. kubelet sends SIGTERM to PID 1 in the container
→ your application's shutdown handler fires (if you wrote one)
2. Endpoints controller removes pod IP from Service endpoints
→ kube-proxy starts updating iptables rules on every node
→ new connections stop being routed to this pod... eventually
3. ReplicaSet controller detects pod count below desired
→ creates a new pod → Scheduler assigns it to a node
t=0–? kube-proxy propagates iptables changes across all nodes
(this takes seconds - it is NOT instant)
t=Xs Your application finishes in-flight work and exits cleanly
kubelet sees process exit → container removed
t=300s If process hasn't exited by terminationGracePeriodSeconds
kubelet sends SIGKILL → container forcefully terminated
Notice what happens at t=0. SIGTERM fires at the exact same moment the Endpoints controller starts removing the pod. The key word: starts.
The race condition nobody talks about
kube-proxy doesn't update iptables instantly. It watches the Endpoints object, detects the change, and rewrites iptables rules on every node. This takes seconds - sometimes 5-10 seconds on a large cluster.
During those seconds:
t=0s ┌─────────────────────────────────────────────────────────┐
│ SIGTERM sent to pod iptables update starts │
│ Pod: "I'm shutting down" kube-proxy: "working on it" │
└─────────────────────────────────────────────────────────┘
t=3s ┌─────────────────────────────────────────────────────────┐
│ Pod: refuses new connections │
│ iptables: STILL routing traffic to this pod ← BUG │
│ User: gets 502 / connection refused │
└─────────────────────────────────────────────────────────┘
t=8s ┌─────────────────────────────────────────────────────────┐
│ kube-proxy: iptables updated, traffic stops │
│ Pod: already refusing connections (too late) │
└─────────────────────────────────────────────────────────┘
Your pod got SIGTERM and started refusing connections. But traffic is still being sent to it because kube-proxy hasn't caught up yet. This is the bug.
You can reproduce it yourself:
# Watch your error rate during a rolling deploy kubectl rollout restart deployment/your-app -n your-namespace # In another terminal, hammer the service while true; do curl -s -o /dev/null -w "%{http_code}\n" http://your-service/health; sleep 0.1; done # You'll see 502s and connection refused errors during the rollout
The fix: preStop hook
The solution is elegant once you understand the problem. You need to delay your application's shutdown long enough for kube-proxy to finish propagating the iptables changes.
containers:
- name: your-app
lifecycle:
preStop:
exec:
command: ["sleep", "15"] # wait for kube-proxy to catch up
Without preStop: With preStop (sleep 15): t=0s SIGTERM → app shuts down t=0s preStop starts (sleep 15) t=0s iptables still routing ←bug t=0s iptables update starts t=3s app refuses connections t=8s iptables update complete t=3s users get 502s ←bug t=15s preStop done → SIGTERM fires t=8s iptables finally updated t=15s app shuts down gracefully t=15s no traffic arriving → no 502s ✅
By the time your application gets SIGTERM, no new requests are arriving. The race condition is gone.
15 seconds works for most clusters. On larger clusters with many nodes, kube-proxy propagation can take longer. Measure it: watch your iptables update latency during a test rollout and tune the sleep value to match your actual environment.
terminationGracePeriodSeconds. If your grace period is 30s and preStop sleeps 15s, your application only has 15s to drain. Set your grace period accordingly: terminationGracePeriodSeconds: 60 gives you preStop(15s) + app drain(45s).Setting terminationGracePeriodSeconds
The default is 30 seconds. That's fine for a simple web server. It's not fine if your application has long-running operations.
Ask yourself: what is the longest operation my application can be in the middle of when it receives SIGTERM?
Simple web server: 30s is plenty (requests are fast)
Database: 60-120s (transactions need to commit)
Video transcoding: 300s+ (can't restart mid-transcode)
LLM inference: depends on your workload
(model loading + inference × max concurrent requests)
Set it based on your worst case, not your average case.
spec: terminationGracePeriodSeconds: 300
What your application needs to do
Kubernetes sends the signal. Your application has to handle it.
import signal
import sys
shutdown_requested = False
def handle_sigterm(signum, frame):
global shutdown_requested
shutdown_requested = True # tell main loop to stop accepting work
# do NOT call sys.exit here — let the main loop drain and clean up
signal.signal(signal.SIGTERM, handle_sigterm)
# Main loop checks the flag
while not shutdown_requested:
handle_next_request()
# Cleanup runs after the loop drains
flush_buffers()
close_db_connections()
sys.exit(0)
What your handler should do:
- Stop accepting new requests or work
- Finish all in-flight operations
- Flush any buffered writes to disk or remote storage
- Close database connections cleanly
- Exit with code 0 (clean exit)
If PID 1 in your container doesn't handle SIGTERM, the signal is ignored. Kubernetes waits for terminationGracePeriodSeconds, then sends SIGKILL. Your application is force-killed with no chance to clean up - data loss, corrupted state, dropped requests.
The PID 1 trap
This catches a lot of people.
# Wrong — runs your app as a child of sh, not as PID 1 CMD ["sh", "-c", "python app.py"] # Right — your app IS PID 1, receives signals directly ENTRYPOINT ["python", "-m", "your.app"]
When you use shell form (sh -c), the shell becomes PID 1. SIGTERM goes to the shell. The shell may or may not forward it to your app - usually it doesn't. Your app never gets the signal, never shuts down cleanly, gets SIGKILL'd after the grace period.
Always use exec form in your Dockerfile ENTRYPOINT.
If you can't restructure the Dockerfile (third-party image, complex entrypoint script), use tini or dumb-init as a lightweight init process. They run as PID 1, forward signals correctly to child processes, and reap zombie processes. One line in your Dockerfile: ENTRYPOINT ["/tini", "--", "your-entrypoint.sh"].
Readiness probe: your last line of defense
Even with preStop and a proper SIGTERM handler, you want your readiness probe to fail immediately on shutdown:
def handle_sigterm(signum, frame):
global is_ready, shutdown_requested
is_ready = False # readiness probe returns 503 immediately
shutdown_requested = True # tell main loop to stop accepting work
# do NOT exit here — let the main loop drain first
When readiness returns 503, Kubernetes removes the pod from Service endpoints immediately - faster than waiting for kube-proxy to detect the Terminating state. Belt and suspenders.
The complete picture
Rolling update starts │ ├── t=0s ──────────────────────────────────────────────────────── │ OLD POD NEW POD │ preStop fires (sleep 15s) starts booting │ readiness → 503 readiness → 503 │ removed from endpoints not yet in endpoints │ ├── t=10s ──────────────────────────────────────────────────────── │ OLD POD NEW POD │ preStop still sleeping passes readiness ✅ │ no new traffic arriving added to endpoints │ starts receiving traffic │ ├── t=15s ──────────────────────────────────────────────────────── │ OLD POD NEW POD │ preStop done → SIGTERM fires serving 100% of traffic │ finishes in-flight requests │ flushes state to storage │ exits cleanly (code 0) │ └── t=20s ──────────────────────────────────────────────────────── OLD POD NEW POD container removed serving 100% of traffic Zero dropped requests. Zero data loss. ✅
The checklist
- ✅ Use exec form in Dockerfile ENTRYPOINT (not shell form)
- ✅ Handle SIGTERM in your application
- ✅ preStop sleep 15s (measure on large clusters) - let kube-proxy propagate
- ✅ terminationGracePeriodSeconds = preStop + worst-case drain time
- ✅ Readiness probe fails immediately on shutdown
- ✅ Finish in-flight work before exiting
- ✅ Exit with code 0
Miss any of these and you're relying on luck during your next deployment.
Most K8s outages I've seen weren't infrastructure failures. They were graceful shutdown done wrong. The good news: once you understand the sequence, it's completely preventable.