AdvancedForPlatform EngineersCloud Security EngineersArchitects

July 5, 20267 min

Secure-by-Default GKE: A Reference Architecture for 2026

A practical secure-by-default GKE reference architecture for 2026: defense in depth across identity, network, supply chain, admission, runtime, and observability — with Workload Identity Federation and Binary Authorization at the core.

gkekubernetes-securitygcpworkload-identity-federationbinary-authorizationzero-trustreference-architecture

Contents

Layer 1 — Identity: kill the static keys
Layer 2 — Network: private by default
Layer 3 — Supply chain: only signed images run
Layer 4 — Admission and policy: deny by default
Layer 5 — Workload runtime: shrink the blast radius
Layer 6 — Observability: assume you will need the tape
How to adopt it without a big bang
Verdict

The first GKE cluster you stand up feels reassuringly managed: Google runs the control plane, nodes auto-repair, upgrades are a click. It is easy to assume “managed” means “secure.” It does not. A default cluster is built to work, and almost every security control that matters is opt-in — the public endpoint, the broad node permissions, the lack of image provenance, the root containers. None of that is a bug. It is just not your threat model’s problem until it is.

This is a secure-by-default GKE reference architecture: a blueprint where the hardened configuration is the baseline, not a backlog item. It is organized as defense in depth — six independent layers, each of which assumes the one in front of it might fail.

Secure-by-default GKE: your cluster is insecure by default — this architecture makes hardening the baseline.

The goal is not a single wall. It is that an attacker who gets past identity still meets the network controls, who gets past the network still meets a supply-chain gate, and so on. Break one layer and the others still hold.

This is the security half of a bigger GCP architecture story. For the messaging side — keeping an event-driven system clean instead of secure — see Pub/Sub or Eventarc? Event-Driven GCP Without the Spaghetti.

Six independent layers of defense in depth around a GKE workload: identity, network, supply chain, admission and policy, workload runtime, observability.

Let’s walk the layers, from the outside in.

Layer 1 — Identity: kill the static keys

The highest-impact change you can make to a GKE cluster has nothing to do with the network. It is getting rid of long-lived service-account keys.

The old pattern is a JSON key file mounted into the pod. It works, and it is a liability: that file gets copied into images, committed to Git, printed in an env dump, and it keeps granting access until a human notices and rotates it. A leaked static key is the most common — and most preventable — cloud compromise.

Workload Identity Federation replaces it. A Kubernetes service account is bound to a Google identity; when the pod calls a GCP API, Google mints a short-lived, IAM-scoped token on the spot. There is no key at rest, nothing to leak, and the token expires on its own.

Zero Trust on GKE: a pod’s Kubernetes service account federates to a Google identity, GCP STS issues a short-lived token, and the workload calls the API with no static key involved.

# enable Workload Identity on the cluster
gcloud container clusters update CLUSTER \
  --workload-pool=PROJECT_ID.svc.id.goog

# bind a Kubernetes SA to a Google SA
gcloud iam service-accounts add-iam-policy-binding \
  GSA@PROJECT_ID.iam.gserviceaccount.com \
  --role roles/iam.workloadIdentityUser \
  --member "serviceAccount:PROJECT_ID.svc.id.goog[NAMESPACE/KSA]"

The rule for this layer: no service-account JSON keys anywhere in the cluster. If a workload needs GCP access, it federates. If a human needs access, they use short-lived credentials too.

Layer 2 — Network: private by default

The next assumption to flip is reachability. A default cluster can expose its control plane to the public internet and place nodes on routable addresses. Neither should be true in production.

Private cluster — nodes get internal IPs only; the control-plane endpoint is private.
Authorized networks — the API server is reachable only from the CIDR ranges you name (your CI, your bastion, your VPN).
Cloud Armor — a WAF and DDoS layer in front of anything you do expose through a load balancer, with rules tuned to your traffic instead of the defaults.

The principle is that nothing is reachable unless there is a reason for it to be. The network is not the whole defense — it is the layer that buys you time when identity is bypassed.

Layer 3 — Supply chain: only signed images run

Most “container security” effort goes into scanning images. Scanning tells you what is in an image; it does not guarantee that only your images run. That guarantee is the supply-chain layer, and on GKE it is Binary Authorization.

Binary Authorization is an admission gate. Every image must carry the attestations your policy demands — typically a signature from your CI proving it was built from reviewed source and passed its gates. Anything unsigned or untrusted is blocked before it reaches a node.

Supply chain on GKE: CI builds and signs an image, Artifact Registry stores it, and Binary Authorization admits signed images while blocking unsigned ones before they reach a node.

# Binary Authorization policy (excerpt)
globalPolicyEvaluationMode: ENABLE
defaultAdmissionRule:
  evaluationMode: REQUIRE_ATTESTATION
  enforcementMode: ENFORCED_BLOCK_AND_AUDIT_LOG
  requireAttestationsBy:
    - projects/PROJECT_ID/attestors/prod-attestor

This is what closes the gap between “we build securely” and “only what we built securely actually runs” — the exact space supply-chain attacks live in. Roll it out in audit mode first so you can see what would be blocked, then switch to enforcement.

Layer 4 — Admission and policy: deny by default

Binary Authorization answers “is this image trusted?” It does not answer “does this workload obey our rules?” That is the job of a policy engine — OPA Gatekeeper or Kyverno — running as an admission controller.

This is where you encode the rules specific to your platform, as code, enforced deny-by-default:

only images from your Artifact Registry
required labels and owners on every workload
resource requests and limits mandatory
no hostPath, no privileged containers, no latest tags

# Kyverno: block privileged containers, cluster-wide
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
spec:
  validationFailureAction: Enforce
  rules:
    - name: privileged-containers
      match:
        any:
          - resources:
              kinds: ["Pod"]
      validate:
        message: "Privileged containers are not allowed."
        pattern:
          spec:
            containers:
              - =(securityContext):
                  =(privileged): "false"

Pair this with Kubernetes Pod Security as the built-in floor. Pod Security enforces the baseline; the policy engine enforces everything your organization specifically cares about.

Layer 5 — Workload runtime: shrink the blast radius

Assume something eventually gets a shell in a container. Layer 5 is about making that as boring as possible.

Run as non-root — a runAsNonRoot security context, no root in the image.
Read-only root filesystem — writable paths are explicit emptyDir mounts, nothing else.
Drop capabilities — start from none and add back only what is needed.
seccomp — the RuntimeDefault profile filters dangerous syscalls.

securityContext:
  runAsNonRoot: true
  runAsUser: 10001
  readOnlyRootFilesystem: true
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]
  seccompProfile:
    type: RuntimeDefault

None of these stop an intrusion on their own. Together they mean a compromised container is a locked room, not a hallway.

Layer 6 — Observability: assume you will need the tape

The final layer is not prevention, it is knowing. When something does go wrong, the difference between a five-minute triage and a five-day incident is whether you can see it.

Audit logs — Kubernetes and Cloud Audit Logs, retained and queryable.
Runtime detection — a tool like Falco or GKE’s built-in threat detection watching for anomalous syscalls and process activity.
SLO-based alerting — alerts that fire on symptoms your users feel, with clear ownership, not a wall of noise nobody reads.

Security observability is what turns the first five layers from “we hope it worked” into “we can prove what happened.”

How to adopt it without a big bang

You do not roll this out all at once. Order the work by how much standing risk each layer removes:

Identity first. Migrate off static keys to Workload Identity Federation — highest impact, and a leaked key is the cheapest possible compromise.
Network second. Private control plane, authorized networks.
Supply chain and policy third, both in audit mode before enforcement so you can see what would break.
Runtime hardening as you touch each workload.
Observability wired throughout, so you can watch each layer as you enable it.

Enforce nothing in production until you have watched it in audit mode. Secure-by-default is a destination you reach deliberately — but once you are there, the safe configuration is the one you get for free, and every new cluster starts hardened instead of hoping someone remembers to do it later.

Verdict

“Managed” is not “secure,” and a default GKE cluster proves it every day. The fix is not a single control but a stack of independent ones: kill static keys with Workload Identity Federation, make the network private, admit only signed images with Binary Authorization, enforce deny-by-default policy with a policy engine, harden the runtime, and watch all of it with real observability. Adopt them in risk order, run everything in audit before you enforce, and secure-by-default stops being a slogan and becomes the baseline every workload inherits.

Frequently asked questions

Is a default GKE cluster secure?

No. A default GKE cluster is functional, not hardened. Out of the box you can end up with a public control plane endpoint, nodes with broad service-account scopes, no image-provenance enforcement, workloads running as root, and no admission policy. Each of those is a reasonable default for getting started, but together they leave a wide attack surface. Secure-by-default means flipping those defaults deliberately — private cluster, Workload Identity, Binary Authorization, deny-by-default policy, hardened pods — so the safe configuration is the baseline rather than a project you get to later.

What is Workload Identity Federation and why does it matter for GKE?

Workload Identity Federation lets a Kubernetes service account impersonate a Google identity and receive short-lived, IAM-scoped tokens, instead of mounting a long-lived service-account JSON key in the pod. It matters because static keys are the single most common cloud credential leak: they end up in Git history, images, and env dumps, and they keep working until someone notices and rotates them. With federation there is no key to leak — the identity is bound to the workload and the token expires on its own, usually within about an hour.

How does Binary Authorization protect the supply chain?

Binary Authorization is an admission gate: before any image is allowed to run on the cluster, it must carry the attestations your policy requires — for example, a signature from your CI system proving the image was built from reviewed source and passed its checks. Unsigned or untrusted images are blocked at admission and never reach a node. This closes the gap between 'we build securely' and 'only what we built securely actually runs,' which is exactly where supply-chain attacks live.

Do I need both OPA Gatekeeper/Kyverno and Pod Security?

They solve overlapping but different problems, and in a hardened cluster you usually want both. Pod Security (the built-in admission standard) enforces a baseline of pod-level controls like blocking privileged containers. OPA Gatekeeper or Kyverno add flexible, organization-specific policy — required labels, allowed registries, resource limits, network rules — expressed as code and enforced deny-by-default. Pod Security is the floor; a policy engine is how you encode the rules that are specific to your platform.

Where do I start if my cluster is already running in production?

Start with the layer that removes standing risk fastest: identity. Migrate off static service-account keys to Workload Identity Federation, because a leaked key is the highest-impact, lowest-effort compromise. Next, lock the network — make the control plane private and restrict authorized networks. Then add supply-chain and admission controls (Binary Authorization, a policy engine) in audit mode first so you can see what would break before you enforce. Harden the runtime and wire up observability last. Doing it in that order retires the biggest risks first and lets you roll out enforcement without a big-bang outage.