Kubernetes Troubleshooting: Ultimate Guide

Introduction

A critical pod fails. Your alerts are firing. What's next? Every DevOps/SRE engineer needs a clear plan to handle cluster issues. This guide provides a systematic strategy and the essential tools to diagnose any problem, from pod health to the control plane.

The First Steps: Observing the Problem

When a pod is failing, your first task is to become a detective. You need to gather clues and observations before you jump to conclusions. Fortunately, Kubernetes provides a suite of essential commands for this initial investigation. Mastering these fundamentals will save you a lot of time and frustration.

1. The Overview: kubectl get pods

The most basic used command is kubectl get pods. It provides a quick, high-level snapshot of your pods’ current state.

kubectl get pods -n <namespace>

By running this, you can immediately see:

The pod's STATUS (Running, Pending, CrashLoopBackOff, etc.).
The number of times the pod has RESTARTED.
How long the pod has been running (AGE).

A pod in a CrashLoopBackOff state is a clear sign of a repeated failure, while a pod stuck in Pending indicates a scheduling issue.

2. The Details: kubectl describe pod

Once you have a suspicious-looking pod, the next step is to run a more detailed inspection with kubectl describe pod.

kubectl describe pod <pod-name> -n <namespace>

This command provides a wealth of information, including:

Events: A list of recent events related to the pod, which can show you exactly why a container failed to start or why a volume couldn't be mounted.
Containers: Details about the containers within the pod, including their image, state, and restartCount.
Status & Exit Reason: Details about the containers within the pod, their image, and most importantly, the State and Exit Code if a container has terminated. An exit code of 1 or a specific error message can be your first solid clue.

Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  4m1s                 default-scheduler  Successfully assigned my-app-pod to node-01
  Normal   Pulling    4m                   kubelet            Pulling image "my-app:v1"
  Normal   Pulled     3m59s                kubelet            Successfully pulled image "my-app:v1" in 987ms
  Normal   Created    3m59s                kubelet            Created container my-app-container
  Normal   Started    3m59s                kubelet            Started container my-app-container
  Warning  Failed     3m1s (x3 over 3m1s)  kubelet            Failed to pull image "my-app:v2": rpc error: code = NotFound

3. The Output: kubectl logs

The kubectl logs command is your window into what's happening inside the container itself.

kubectl logs <pod-name> -n <namespace>

By default, this shows you the standard output and error streams from your container. It's the go-to command for finding application-level errors, startup failures, or unexpected behavior.

Advanced Pod Troubleshooting: When a Pod is Running but Not Responding

What do you do when kubectl get pods shows a pod is Running, but your application still isn't working?

1. The First Test: Is the Application Itself Unresponsive?

The quickest way to check if your application is simply hung is by trying to access it from a shell inside the container. This tests the application directly, bypassing the Kubernetes service layer and network policies.

Use kubectl exec to run a simple connectivity test.

kubectl exec -it <pod-name> -n <namespace> -- curl localhost:<app-port>

If the curl command fails to connect or gets a timeout, you know the problem is within the container itself. If it returns a correct response, the issue is likely with the Kubernetes networking or service configuration.

2. The Network Check: Can the Pod Reach Out?

If the pod is supposed to be making external connections but isn't, you need to check its network connectivity. Since many images don't contain standard network tools, you can use kubectl exec to run a few commands if available.

DNS Resolution: Use nslookup or dig to check if the pod can resolve a service name or an external domain.
Connectivity: Use ping, nc, or curl to test connectivity to a service, database, or external API.

For example, to check DNS resolution for a service:

# If the service is in the same namespace as the pod
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>

# If the service is in a different namespace,
# use the fully qualified domain name (FQDN) for reliability.
kubectl exec -it <pod-name> -n <namespace> -- nslookup <service-name>.<service-namespace>.svc.cluster.local

If these tests fail, it points to a network configuration issue (like a misconfigured DNS or Network Policy) rather than an application-level bug.

3. Process and Resource Checks

Even if the application is reachable, its internal state might be unhealthy. You can use process inspection tools to see if a process has a high CPU or is hung.

# Check running processes and their status
kubectl exec -it <pod-name> -- ps aux

# Check resource usage (if available in the image)
kubectl exec -it <pod-name> -- top

These commands give you a live view of the pod's workload and can help identify bottlenecks or application deadlocks.

4. Filesystem and Configuration Inspection

Sometimes a pod fails because a required file is missing or a volume has the wrong permissions.

# Check if a mount point is full on the pod
kubectl exec -it <pod-name> -- df -h

# List files and check permissions
kubectl exec -it <pod-name> -- ls -la /app/config

# View the contents of a configuration file
kubectl exec -it <pod-name> -- cat /app/config/settings.yaml

These commands help you quickly rule out common configuration and filesystem-related issues.

5. The Problem with kubectl exec

While kubectl exec is a familiar command, its significant flaw is that it only works on containers that have a shell and tools. This makes it impossible for troubleshooting secure, distroless containers.

The Solution: kubectl debug

To solve this, we can use kubectl debug to inject a new, ephemeral container into the pod that contains the tools we need. The command to do this is as follows:

This below command adds a temporary 'busybox' container to your pod and shares the process namespace, allowing you to debug the original container.


kubectl debug -it <pod-name> --image=busybox --share-processes --container=debugger

In this command:

--image=busybox tells Kubernetes to use a standard debug image that contains essential Linux utilities like ps and ls.
--share-processes is a critical flag that allows the new debugger container to see and interact with the processes running in the original application container.
--container=debugger gives our temporary debugging container a clear, identifiable name.

Once the command runs, you will be dropped into a shell inside the busybox container, where you can run commands like ps aux to inspect the processes of the main application container.

Troubleshoot smarter. Not harder.

The Ultimate Guide to Kubernetes Troubleshooting

Introduction

The First Steps: Observing the Problem

1. The Overview: kubectl get pods

2. The Details: kubectl describe pod

3. The Output: kubectl logs

Advanced Pod Troubleshooting: When a Pod is Running but Not Responding

1. The First Test: Is the Application Itself Unresponsive?

2. The Network Check: Can the Pod Reach Out?

3. Process and Resource Checks

4. Filesystem and Configuration Inspection

5. The Problem with kubectl exec

Comments

Kubernetes for DevOps & SRE

kubectl proxy vs kubectl port-forward: Access Your Kubernetes Cluster Securely

More from this blog

kubectl proxy vs kubectl port-forward: Access Your Kubernetes Cluster Securely

Distroless Containers: Why They Are a Game-Changer for DevOps Security

Command Palette

Introduction

The First Steps: Observing the Problem

1. The Overview: kubectl get pods

2. The Details: kubectl describe pod

3. The Output: kubectl logs

Advanced Pod Troubleshooting: When a Pod is Running but Not Responding

1. The First Test: Is the Application Itself Unresponsive?

2. The Network Check: Can the Pod Reach Out?

3. Process and Resource Checks

4. Filesystem and Configuration Inspection

5. The Problem with kubectl exec

Comments

Kubernetes for DevOps & SRE

kubectl proxy vs kubectl port-forward: Access Your Kubernetes Cluster Securely

More from this blog