A kubernetes failure story (#1)

k8s meetup at NMBRS


# whoami

kubernetes addicts support group

k8s.af

The setting

Kubernetes authentication

  • Dex
  • OIDC "proxy"
  • Storage backends

Config

              
config:
  storage:
    type: kubernetes
    config:
      inCluster: true
              
            

The problem

Bad defaults

  • AKA "stop hitting yourself"
  • Insane TTL
  • Refreshing dashboards
  • Actual security bug

I got these hints earlier..

Now i got $x AuthRequests

  • How TF do i clean this up?!
  • Listing all authrequests times out
  • Kinda need that to delete the resources

What would you do?

  1. Tweak timeouts in etcd/apiserver
  2. Delete namespace and let k8s clean up for us
  3. Bypass k8s and delete resources in etcd
  4. (did i miss any?)

Whats behind door number two?

is magic, The End

Reconciliation loops

  • List all requests in order to delete
  • Timeout
  • Repeat
  • Masters load spikes to 60

New situation

Royally screwed

  • Masters down
  • Can't access monitoring (no auth, no port-forward)
  • Plenty of alerts in slack though..

Dirty tricks

              
apiVersion: v1
kind: Namespace
metadata:
  name: dex
spec:
  finalizers:
  - kubernetes
status:
  phase: Active
              
            

You cant actually remove that finalizer with kubectl

NEVER DO THIS

That "fixed" it

  • Still left with 0.5G of authrequests in etcd
  • Fairly confident
    kubectl create namespace dex
    kills the cluster
  • Deployed dex to new namespace with SANE DEFAULTS

Postmortem

Postmortem

No business applications were harmed during this outage.

Questions?