Skip to content

Troubleshooting

Common issues and their resolutions for the cluster.


Node Not Joining Cluster

Symptoms: Node shows as not ready, or does not appear in kubectl get nodes.

Check Talos Health

talosctl health --nodes <node-ip>

Look for failures in etcd, kubelet, or API server connectivity.

Check etcd Membership

talosctl etcd members --nodes 192.168.0.201

If the node was previously part of the cluster and was reset, its stale etcd member entry may need to be removed:

talosctl etcd remove-member <member-id> --nodes 192.168.0.201

Verify Machine Config

Ensure the node has the correct machine config applied:

talosctl apply-config --nodes <node-ip> --file ./clusterconfig/<node-config>.yaml --dry-run

Check kubelet-csr-approver

New nodes need their CSRs approved. Verify the kubelet-csr-approver is running:

kubectl get pods -n kube-system -l app.kubernetes.io/name=kubelet-csr-approver
kubectl get csr

Bootstrap Addons

If kubelet-csr-approver is not running, apply the bootstrap addons:

cd pitower/talos && just addons


Pod Stuck in Pending State

Symptoms: Pod stays in Pending status and never gets scheduled.

Check Node Resources

kubectl describe node <node-name> | grep -A10 "Allocated resources"
kubectl top nodes

Check Storage

If the pod requires a PVC, verify the storage class and available capacity:

kubectl get pvc -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>

For Rook Ceph:

kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd df

For OpenEBS (local PV):

kubectl get blockdevice -n openebs

Check Pod Events

kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20

Check Node Taints

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

DNS Not Resolving

Symptoms: Services cannot resolve DNS names, or external DNS records are not created.

Ubiquiti DNS Interception

Port 53 Interception

The Ubiquiti router intercepts all DNS traffic on port 53. This means standard DNS lookups may return the router's cached results rather than actual Cloudflare records.

Verify with DoH (DNS over HTTPS)

To check actual Cloudflare DNS records, bypass the router's interception using DoH:

# Using curl to query Cloudflare DoH
curl -sH 'accept: application/dns-json' \
  'https://cloudflare-dns.com/dns-query?name=echo.example.com&type=A' | jq

# Using dig with DoH (if supported)
dig @1.1.1.1 echo.example.com +https

Check CoreDNS

kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50

Check external-dns

kubectl get pods -n networking -l app.kubernetes.io/name=external-dns
kubectl logs -n networking -l app.kubernetes.io/name=external-dns --tail=50

Verify external-dns is watching the correct gateways:

kubectl get gateways -A -l external-dns.alpha.kubernetes.io/enabled=true

Gateway Label Filter

external-dns uses --gateway-label-filter=external-dns.alpha.kubernetes.io/enabled=true to select which gateways to process. Ensure the target gateway has this label.

Check HTTPRoute and Gateway

kubectl get httproutes -A
kubectl get gateways -A

Certificate Issues

Symptoms: TLS errors, expired certificates, or certificates not being issued.

Check cert-manager

kubectl get pods -n cert-manager
kubectl logs -n cert-manager deploy/cert-manager --tail=50

Check Certificate Status

kubectl get certificates -A
kubectl get certificaterequests -A
kubectl get orders.acme.cert-manager.io -A
kubectl get challenges.acme.cert-manager.io -A

Check ClusterIssuer

kubectl get clusterissuers
kubectl describe clusterissuer letsencrypt-production

Force Certificate Renewal

Delete the certificate to trigger re-issuance:

kubectl delete certificate <cert-name> -n <namespace>

DNS-01 Challenges

If using DNS-01 challenges with Cloudflare, verify the API token has the correct permissions and the DNS zone is accessible.


Service Not Accessible

Symptoms: Cannot reach a service via its URL, connection timeouts, or 404 errors.

Check Gateway Status

kubectl get gateways -n networking
kubectl describe gateway envoy-external -n networking
kubectl describe gateway envoy-internal -n networking

Check HTTPRoute

kubectl get httproutes -A
kubectl describe httproute <route-name> -n <namespace>

Verify the route's parentRefs point to the correct gateway:

  • envoy-external: For services accessed via Cloudflare tunnel (proxied)
  • envoy-internal: For services accessed via Tailscale/LAN

Check Cilium L2 Announcements

Verify LoadBalancer IPs are being announced:

kubectl get svc -A | grep LoadBalancer
cilium status

Check that the Cilium L2 announcement policy is active:

kubectl get ciliuml2announcementpolicies
kubectl get ciliumbgppeeringpolicies

Check Cloudflare Tunnel

For externally exposed services:

kubectl get pods -n networking -l app.kubernetes.io/name=cloudflared
kubectl logs -n networking -l app.kubernetes.io/name=cloudflared --tail=50

Check nginx Reverse Proxy

kubectl get pods -n networking -l app.kubernetes.io/name=nginx
kubectl get svc -n networking | grep nginx

End-to-End Request Flow

flowchart LR
    Client --> CF[Cloudflare]
    CF --> Tunnel[cloudflared]
    Tunnel --> Nginx[nginx]
    Nginx --> EE[envoy-external<br/>192.168.0.239]
    EE --> Route[HTTPRoute]
    Route --> Svc[Service]
    Svc --> Pod[Pod]

Verify each hop in the chain to isolate where the failure occurs.


ArgoCD Sync Failed

Symptoms: Application shows OutOfSync, Degraded, or Unknown in ArgoCD.

Check Application Status

kubectl get applications -n argocd
kubectl describe application <app-name> -n argocd

ArgoCD CLI

argocd app list
argocd app get <app-name>
argocd app diff <app-name>

Common Sync Failures

Resource Already Exists

If a resource was manually created, ArgoCD may fail to adopt it:

argocd app sync <app-name> --force

Schema Validation Errors

CRDs may not be installed yet when the app tries to sync:

# Check if CRDs exist
kubectl get crds | grep <crd-name>

# Sync CRDs first if needed
argocd app sync <crd-app-name>

Health Check Failures

Check pod health and events:

kubectl get pods -n <namespace> -l app.kubernetes.io/name=<app>
kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Helm Template Errors

For apps using Helm, test rendering locally:

cd pitower/kubernetes/apps/<category>/<app>
kustomize build . --enable-helm

Storage Issues

Rook Ceph Degraded

kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph health detail
kubectl -n rook-ceph exec deploy/rook-ceph-tools -- ceph osd tree

PVC Stuck in Pending

kubectl get pvc -A | grep Pending
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get sc

VolSync Backup Failures

kubectl get replicationsources -A
kubectl describe replicationsource <name> -n <namespace>

Network Issues

Pod-to-Pod Communication

# Test from a debug pod
kubectl run -it --rm debug --image=busybox -- sh
# Inside the pod:
wget -qO- http://<service>.<namespace>.svc.cluster.local:<port>

Cilium Connectivity

cilium connectivity test
cilium status --verbose

Envoy Gateway Logs

kubectl logs -n envoy-gateway-system deploy/envoy-gateway --tail=50