The Ultimate Guide to Kubernetes Troubleshooting
Understanding the Fundamentals of Kubernetes Troubleshooting

Introduction
A critical pod fails. Your alerts are firing. What's next? Every DevOps/SRE engineer needs a clear plan to handle cluster issues. This guide provides a systematic strategy and the essential tools to diagnose any problem, from pod health to the control plane.
The First Steps: Observing the Problem
When a pod is failing, your first task is to become a detective. You need to gather clues and observations before you jump to conclusions. Fortunately, Kubernetes provides a suite of essential commands for this initial investigation. Mastering these fundamentals will save you a lot of time and frustration.
1. The Overview: kubectl get pods
The most basic used command is kubectl get pods. It provides a quick, high-level snapshot of your pods’ current state.
kubectl get pods -n <namespace>
By running this, you can immediately see:
The pod's STATUS (Running, Pending, CrashLoopBackOff, etc.).
The number of times the pod has RESTARTED.
How long the pod has been running (AGE).
A pod in a CrashLoopBackOff state is a clear sign of a repeated failure, while a pod stuck in Pending indicates a scheduling issue.
2. The Details: kubectl describe pod
Once you have a suspicious-looking pod, the next step is to run a more detailed inspection with kubectl describe pod.
kubectl describe pod <pod-name> -n <namespace>
This command provides a wealth of information, including:
Events: A list of recent events related to the pod, which can show you exactly why a container failed to start or why a volume couldn't be mounted.
Containers: Details about the containers within the pod, including their image, state, and restartCount.
Status & Exit Reason: Details about the containers within the pod, their image, and most importantly, the State and Exit Code if a container has terminated. An exit code of 1 or a specific error message can be your first solid clue.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m1s default-scheduler Successfully assigned my-app-pod to node-01
Normal Pulling 4m kubelet Pulling image "my-app:v1"
Normal Pulled 3m59s kubelet Successfully pulled image "my-app:v1" in 987ms
Normal Created 3m59s kubelet Created container my-app-container
Normal Started 3m59s kubelet Started container my-app-container
Warning Failed 3m1s (x3 over 3m1s) kubelet Failed to pull image "my-app:v2": rpc error: code = NotFound
3. The Output: kubectl logs
The kubectl logs command is your window into what's happening inside the container itself.
kubectl logs <pod-name> -n <namespace>
By default, this shows you the standard output and error streams from your container. It's the go-to command for finding application-level errors, startup failures, or unexpected behavior.
Advanced Pod Troubleshooting: When a Pod is Running but Not Responding
What do you do when kubectl get pods shows a pod is Running, but your application still isn't working?
1. The First Test: Is the Application Itself Unresponsive?
The quickest way to check if your application is simply hung is by trying to access it from a shell inside the container. This tests the application directly, bypassing the Kubernetes service layer and network policies.
Use kubectl exec to run a simple connectivity test.
kubectl exec -it <pod-name> -n <namespace> -- curl localhost:<app-port>
If the curl command fails to connect or gets a timeout, you know the problem is within the container itself. If it returns a correct response, the issue is likely with the Kubernetes networking or service configuration.
2. The Network Check: Can the Pod Reach Out?
If the pod is supposed to be making external connections but isn't, you need to check its network connectivity. Since many images don't contain standard network tools, you can use kubectl exec to run a few commands if available.
DNS Resolution: Use
nslookupordigto check if the pod can resolve a service name or an external domain.Connectivity: Use
ping,nc, orcurlto test connectivity to a service, database, or external API.
For example, to check DNS resolution for a service:
# If the service is in the same namespace as the pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>
# If the service is in a different namespace,
# use the fully qualified domain name (FQDN) for reliability.
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<service-namespace>.svc.cluster.local
If these tests fail, it points to a network configuration issue (like a misconfigured DNS or Network Policy) rather than an application-level bug.
3. Process and Resource Checks
Even if the application is reachable, its internal state might be unhealthy. You can use process inspection tools to see if a process has a high CPU or is hung.
# Check running processes and their status
kubectl exec -it <pod-name> -- ps aux
# Check resource usage (if available in the image)
kubectl exec -it <pod-name> -- top
These commands give you a live view of the pod's workload and can help identify bottlenecks or application deadlocks.
4. Filesystem and Configuration Inspection
Sometimes a pod fails because a required file is missing or a volume has the wrong permissions.
# Check if a mount point is full on the pod
kubectl exec -it <pod-name> -- df -h
# List files and check permissions
kubectl exec -it <pod-name> -- ls -la /app/config
# View the contents of a configuration file
kubectl exec -it <pod-name> -- cat /app/config/settings.yaml
These commands help you quickly rule out common configuration and filesystem-related issues.
5. The Problem with kubectl exec
While
kubectl execis a familiar command, its significant flaw is that it only works on containers that have a shell and tools. This makes it impossible for troubleshooting secure, distroless containers.
The Solution: kubectl debug
To solve this, we can use kubectl debug to inject a new, ephemeral container into the pod that contains the tools we need. The command to do this is as follows:
This below command adds a temporary 'busybox' container to your pod and shares the process namespace, allowing you to debug the original container.
kubectl debug -it <pod-name> --image=busybox --share-processes --container=debugger
In this command:
--image=busybox tells Kubernetes to use a standard debug image that contains essential Linux utilities like ps and ls.
--share-processes is a critical flag that allows the new debugger container to see and interact with the processes running in the original application container.
--container=debugger gives our temporary debugging container a clear, identifiable name.
Once the command runs, you will be dropped into a shell inside the busybox container, where you can run commands like ps aux to inspect the processes of the main application container.
Troubleshoot smarter. Not harder.



