Back to Blog
Jun 16, 202614 min

I Became a Google Cloud Ambassador — Infrastructure: Here Is What It Actually Takes in 2026

A candid look at what it means to become a Google Cloud Ambassador — Infrastructure in 2026: the real criteria, the community work, and what production GCP experience actually looks like at this level.

I Became a Google Cloud Ambassador — Infrastructure: Here Is What It Actually Takes in 2026

Most people think the Google Cloud Ambassador title is about passing certifications. It is not.

I have held five Google Cloud professional certifications since 2022. I reached the Diamond League on Cloud Skills Boost with over 180,000 points and 1,000+ hands-on labs. I am a Google Product Expert. I have published 100+ technical articles read by tens of thousands of engineers.

None of that is why I became an ambassador.

The title came from something harder to measure: consistent, production-grade engineering work that other engineers could actually use — combined with the kind of public technical presence that makes Google trust you to represent them.

This post is a technical account of what that work looked like. Not the marketing version. The real one.

What the Google Cloud Ambassador — Infrastructure Program Actually Is

The Google Cloud Ambassador program recognizes practitioners who demonstrate deep Google Cloud expertise and actively contribute to the community. It is not self-nomination. It is not a certification program. Selection is based on what you have actually built and published.

The Infrastructure track covers the hardest parts of cloud engineering: multi-region platform design, Kubernetes at production scale, network security, supply chain integrity, cost architecture, and increasingly — AI infrastructure. These are the systems where misconfiguration causes outages and security gaps cause breaches.

The bar is not "deployed a GKE cluster." The bar is: have you solved the problems that do not have Stack Overflow answers?

The Real Criteria: What Google Actually Evaluates

There is no public scoring rubric. But across community conversations and my own experience, selection consistently looks at three dimensions.

Dimension 1: Production depth that is hard to fake

Anyone can complete a lab. Production depth means you have operated systems where failure has real consequences — financial, reputational, regulatory.

For me, this meant:

  • Multi-region GKE clusters with Fleet-based config management, where workload placement is governed by policy and failover is automatic, not manual
  • Terraform-managed infrastructure across multiple projects with shared VPCs, private Google access, and org-level policy constraints
  • Zero-trust network architecture where every service-to-service call is mutually authenticated via Workload Identity Federation — no static service account keys, ever
  • Binary authorization pipelines where unsigned container images cannot reach production regardless of who pushes them
  • Cloud Armor WAF configurations that have blocked real DDoS attempts and OWASP Top 10 attacks against production Kubernetes ingress

That last one deserves its own section.

Cloud Armor: What Production WAF Actually Looks Like

Most engineers who "use Cloud Armor" have enabled it with default rules and called it done. That is not a security posture. That is a checkbox.

Real Cloud Armor configuration at Ambassador-level depth involves:

Custom security policies with adaptive protection

Adaptive Protection uses machine learning to detect layer 7 DDoS attacks and automatically suggest — or automatically apply — blocking rules. In production, this means your WAF learns the baseline traffic pattern for each service and flags anomalies in real time.

The key insight most engineers miss: Adaptive Protection is not a fire-and-forget feature. You need to tune the confidence threshold. A threshold too low generates false positives that block legitimate traffic. Too high and you miss real attacks. The right value depends on your traffic volume and pattern variance — and it takes weeks of observation to calibrate.

Rate limiting that actually works

Basic rate limiting at the IP level is trivially bypassed by distributing attacks across thousands of IPs. Effective rate limiting requires:

  • Per-user rate limits based on request headers or JWT claims for authenticated APIs
  • Throttling rules that distinguish between human browsing patterns and automated scraping
  • THROTTLE vs RATE_BASED_BAN actions — the difference matters because banning increases attacker cost while throttling is recoverable for legitimate users who hit limits

OWASP Top 10 preconfigured rule sets with override logic

Cloud Armor ships preconfigured rule sets for SQL injection, XSS, remote file inclusion, and local file inclusion. But the preconfigured rules generate false positives in production — especially for applications that accept complex JSON payloads or use non-standard URL patterns.

The production approach: enable preconfigured rules in preview mode first. Monitor the preview logs for two weeks. Identify the false positive patterns. Write exclusion rules. Then switch to enforce mode. Skip this step and you will block legitimate users within hours of going live.

Edge security for GKE Ingress at global scale

Attaching Cloud Armor to a global external HTTPS load balancer in front of GKE means your WAF operates at Google's network edge — before traffic reaches your cluster. This architecture absorbs attack volume at Google's infrastructure layer, protecting your cluster nodes from resource exhaustion.

The configuration requires understanding how backend services, URL maps, and security policies interact. Getting this wrong — for example, attaching a policy to the wrong backend or missing a path rule — creates blind spots where unprotected traffic reaches your workloads.

Kubernetes Security: The Layers Most Teams Skip

Running Kubernetes on GKE in 2026 means navigating a security model with at least eight distinct layers. Most teams handle two or three. Ambassador-level work means understanding all of them and making deliberate choices about each.

Layer 1: Node security

Shielded GKE nodes with Secure Boot and vTPM prevent boot-level compromise. Container-Optimized OS with no SSH access eliminates a common lateral movement path. Workload identity on node pools removes the need for node-level service account credentials.

Layer 2: Network policy

Default GKE clusters allow all pod-to-pod communication. In a real multi-tenant cluster, this means a compromised pod can reach every other pod. Kubernetes NetworkPolicy (or better, Cilium with eBPF-based enforcement) should restrict traffic to declared communication paths only.

Layer 3: Pod security

Pod Security Admission in enforce mode, non-root containers by policy, read-only root filesystems, dropped Linux capabilities, seccomp profiles. Every relaxation of these defaults is a documented exception with a business justification — not a convenience shortcut.

Layer 4: Supply chain integrity

Binary Authorization with attestation. Every container image that enters production must have a cryptographic attestation from the vulnerability scanner (Container Analysis) and the build system (CI/CD pipeline). Images without valid attestations are rejected at admission — not flagged, rejected.

Layer 5: Runtime detection

GCP Security Command Center with Event Threat Detection and Container Threat Detection. Anomaly detection for credential access patterns, unusual network connections from containers, and cryptomining signatures. The alerts need to route to a real incident response workflow — not just a dashboard that nobody watches.

Layer 6: Secrets management

Secret Manager with automatic rotation, Workload Identity binding, and audit logging on every secret access. No environment variables containing credentials. No mounted ConfigMaps with sensitive values. Secrets delivered via sidecar injection or direct Secret Manager API calls with short-lived tokens.

Layer 7: Admission control

OPA/Gatekeeper or Kyverno policies enforcing organizational standards: required labels, resource limits, approved image registries, prohibited privileged containers. Policy violations are blocked at admission time — not discovered in audits weeks later.

Layer 8: Cluster network isolation

Private GKE clusters with private nodes, private control plane, authorized networks, and VPC Service Controls around the cluster API. The control plane should not be reachable from the public internet. In 2026, there is no justification for a public GKE API endpoint in production.

Gemini and AI Infrastructure: The Next Frontier for Platform Engineers

The most interesting infrastructure problems in 2026 are not in traditional cloud engineering. They are in AI infrastructure — and specifically in making Gemini and other large models work reliably, securely, and cost-effectively at production scale.

Vertex AI and Gemini in production

Using Gemini via the Vertex AI API is straightforward. Running it as part of a reliable production system is not. The challenges are:

Quota and rate limits: Gemini API quotas are per-project and per-region. Production systems need quota planning, per-service budget controls, and fallback logic when quota is exhausted. The fallback strategy — retry with exponential backoff, degrade gracefully, or fail fast — depends on the use case.

Context window management: Gemini 1.5 Pro supports a 1M token context window. But sending 1M tokens per request has cost and latency implications. Production RAG systems need intelligent chunking, embedding-based retrieval, and context assembly that balances recall quality against token cost.

Grounding with Google Search and Vertex AI Search: Grounding reduces hallucination by anchoring model responses to retrieved documents. In enterprise contexts, grounding against internal knowledge bases via Vertex AI Search gives you RAG without building the retrieval pipeline from scratch. The tradeoff: you lose fine-grained control over retrieval logic.

Building secure AI pipelines on GCP

AI workloads introduce new attack surfaces that traditional security models do not cover. Prompt injection — where malicious content in user input or retrieved documents manipulates model behavior — is not a network security problem. It requires input validation, output filtering, and guardrails at the application layer.

The infrastructure response: deploy LLM guardrails as a separate service in the request path. Validate inputs against a classifier before they reach the model. Filter outputs for PII, harmful content, and off-topic responses. Log every model call with full input/output for audit purposes — noting that this creates significant storage and privacy obligations.

MLOps on Vertex AI Pipelines

Production ML is not Jupyter notebooks. It is reproducible training pipelines with versioned datasets, tracked experiments, validated models, and automated deployment with rollback capability.

Vertex AI Pipelines with Kubeflow components gives you a serverless pipeline execution environment where each step runs in an isolated container, inputs and outputs are versioned in Google Cloud Storage and Artifact Registry, and the entire pipeline run is auditable. The infrastructure challenge: designing pipeline components that are reusable, testable in isolation, and produce artifacts that downstream steps can validate before consuming.

Zero Trust Architecture: Beyond the Buzzword

Zero Trust is one of the most misused terms in cloud security. In most organizations it means "we added MFA." Real zero trust architecture changes the fundamental trust model of your infrastructure.

Workload Identity Federation across clouds

The classic problem: your CI/CD pipeline runs on GitHub Actions (or GitLab) but needs to deploy to GCP. The traditional solution is a service account key stored as a GitHub secret. The problem: long-lived credentials that can be exfiltrated, that do not expire, and that are often over-privileged because it is easier to grant broad access than to enumerate minimum required permissions.

Workload Identity Federation solves this by establishing a trust relationship between the external identity provider (GitHub's OIDC endpoint) and GCP's IAM system. The pipeline authenticates using a short-lived OIDC token, which GCP exchanges for a temporary GCP credential scoped to exactly the permissions needed for that pipeline. No stored credentials. No rotation burden. Every credential access logged.

VPC Service Controls

VPC Service Controls create a security perimeter around GCP services that prevents data exfiltration even if credentials are compromised. A service perimeter around BigQuery means that even if an attacker obtains a valid BigQuery credential, they cannot exfiltrate data to a bucket outside the perimeter.

The nuance most engineers miss: VPC Service Controls break legitimate cross-project access patterns. You need to explicitly define access policies for every authorized cross-perimeter flow — including your own CI/CD pipelines, monitoring agents, and backup systems. The configuration work is significant. The security benefit is significant. Skipping it because it is complex is how data exfiltration attacks succeed.

FinOps: The Infrastructure Skill Nobody Talks About

Cloud cost optimization is infrastructure work. It requires the same depth of understanding as security architecture — and it has direct business impact that security work often does not.

The patterns that actually move the number:

Committed use discounts with CUD sharing

Resource-based CUDs committed at the organization level and shared across projects via CUD sharing give you discount coverage without requiring individual teams to forecast their own usage. The infrastructure team owns the commitment strategy. Individual product teams get the benefit.

Spot instance orchestration for batch workloads

GCP Spot VMs are preemptible but 60–91% cheaper than on-demand. For ML training jobs, data pipeline workers, and non-latency-sensitive batch processing, Spot VMs with checkpointing reduce infrastructure costs dramatically. The engineering investment: instrumenting workloads to checkpoint state and resume gracefully after preemption.

GKE node auto-provisioning with bin packing

Node Auto-Provisioning creates node pools optimized for the workloads actually running in the cluster — matching CPU/memory ratios to pod resource requests. Combined with Vertical Pod Autoscaler recommendations (applied as suggestions, not automatically, to avoid disruption), this eliminates the systematic over-provisioning that inflates most Kubernetes bills.

In real implementations, these three patterns combined have reduced infrastructure spend by 30–40% without changing application SLOs. The work is not glamorous. It requires understanding GCP billing at a level most engineers never reach. But it is the kind of outcome that makes an organization trust you with their infrastructure.

What the Path Actually Looks Like

The Ambassador journey took four years of consistent work. Not four years of studying for certifications — four years of building real systems, writing honestly about what worked and what did not, and showing up in the community consistently.

The milestones that mattered were not the ones I planned:

A Cloud Armor post I wrote about false positive tuning got shared in several security Slack communities and generated more inbound conversations than anything else I had published. Not because it was the most technically impressive thing I had written — but because it solved a specific, painful problem that many engineers were hitting and no one had documented clearly.

A multi-region GKE architecture I published with Terraform module links got forked dozens of times. Engineers were not just reading it — they were using it as a starting point for real deployments. That is the difference between content that demonstrates expertise and content that creates value.

The Google Cloud teams noticed both. Not because I submitted them as evidence of anything. They were already watching.

The Honest Part: What Most Engineers Miss

I spent over 1,000 hours preparing for Google Cloud professional certifications — not passively watching videos, but working through real architectures, building labs from scratch, and going back to the documentation every time something did not make sense. The Professional Cloud Network Engineer and Professional Security Engineer exams have pass rates well below 50%. Getting through all five required sustained, deliberate effort over multiple years.

But the certifications were only the foundation. The real acceleration came from enterprise work — designing and operating production systems for organizations where downtime has financial and regulatory consequences. When you are responsible for a platform that processes real transactions, stores real customer data, and must satisfy real compliance audits, your understanding of GCP changes fundamentally. You stop thinking about features and start thinking about failure modes, blast radius, and recovery time objectives.

After four years of this I can say honestly: I know Google Cloud deeply — the architecture decisions that matter, where the defaults are wrong, where the documentation is misleading, and where the problems only surface at scale. That knowledge cannot be acquired through certifications alone. It requires structured study, enterprise production exposure, and the discipline to keep documenting what you learn.

That combination is genuinely rare. Most engineers stop at one or two of those three. The ones who do all three, consistently, over years — those are the engineers Google selects as Ambassadors.

Three things that separate Ambassador-level work from competent cloud engineering:

You need to have been wrong, publicly, and recovered well. The most credible technical writers are the ones who document failures as clearly as successes. I have published post-mortems on architectural decisions I made that turned out to be wrong. Those posts generated more trust than any "best practices" guide I have written.

Depth over breadth, always. It is better to be the definitive resource on Cloud Armor production configuration than to have a surface-level post about every GCP service. Google's Ambassador program is not looking for generalists. It is looking for practitioners who have gone deep enough to find the problems that only appear at scale.

The community work has to be genuine. Ambassadors who are in it for the badge show up in communities when they have something to promote. The engineers who get selected are the ones who answer questions, review architecture proposals, and share knowledge without expecting anything in return — consistently, over years.

What Changes After the Title

The practical benefits: early access to new GCP features before public release, direct communication channels with Google Cloud engineering teams, invitations to Google Cloud Next and partner summits, and a peer network of practitioners building at the same level.

The more important change: accountability. When you carry this title, the quality bar for your public technical work rises. You are no longer just a practitioner sharing experiences. You represent a standard. Engineers look at your work differently. That weight is real, and it is appropriate.

What I Am Building Now

Three areas define my current infrastructure work:

AI-native platform engineering on GCP: building the infrastructure layer for AI systems — Gemini-powered applications with proper security boundaries, cost controls, and operational instrumentation. Most AI infrastructure in 2026 is held together with duct tape. Production-grade AI infrastructure requires the same rigor as production-grade transactional systems — plus new disciplines around model observability, prompt audit trails, and inference cost management.

Automated compliance for regulated industries in DACH: combining policy-as-code (OPA, Kyverno), continuous control monitoring, and AI-assisted audit evidence generation to make compliance a continuous engineering process rather than a periodic manual exercise. The regulatory landscape in Germany — NIS-2, GDPR, DORA — creates specific requirements that generic cloud security frameworks do not address.

Open infrastructure tooling: contributing to tools that make the patterns above accessible to teams that do not have a dedicated platform engineering function. The work that matters most is the work that scales beyond one organization.

If you are working on any of these problems — or want to talk about the Ambassador path — reach out.

Summary

Becoming a Google Cloud Ambassador — Infrastructure in 2026 required four years of production work across the full stack of cloud infrastructure: security architecture with Cloud Armor and zero trust, Kubernetes hardening across eight security layers, AI infrastructure with Gemini and Vertex AI, FinOps patterns that move real numbers, and consistent community contribution that helped other engineers solve real problems.

The certifications were table stakes. The labs were practice. The work that mattered was the production systems, the documented failures, and the community presence maintained long before any recognition came.

The bar is high. It should be.