Every EKS Cluster I Audit Has the Same Five Problems

Kenneth Kasuba

Kenneth Kasuba

Director of Security, AI Research

14 min read

Over the past four years, I've been responsible for architecting and hardening EKS clusters across organizations ranging from Series B startups to Fortune 500 financial services firms. Across every engagement, the pattern is the same: teams adopt Kubernetes for velocity, then discover, usually through a painful audit or an actual incident, that their security posture is nowhere near where it needs to be.

This post distills the architecture patterns I've standardized on after securing EKS clusters running production workloads across AWS, GCP (via GKE with EKS migration patterns), and Azure (AKS). These aren't theoretical recommendations. Every pattern here has been implemented, tested under load, and validated against the CIS Kubernetes Benchmark and the NSA/CISA Kubernetes Hardening Guide v1.2.

The 5 Misconfigurations I Find in Every Audit

Before diving into architecture, let me share the five issues I find in literally every Kubernetes security assessment. If you recognize your environment here, you're not alone, but you do need to act.

  1. Overprivileged IRSA roles (or no IRSA at all). Teams either skip IAM Roles for Service Accounts entirely and let pods inherit the node's instance profile, or they create a single "app" IAM role with s3:* and secretsmanager:* and bind it to every service account. In my experience, this is the single highest-risk finding because it turns any container escape into a full AWS account compromise.
  2. Default namespace usage with no network policies. I've walked into environments with 200+ pods in default, zero NetworkPolicy objects, and flat east-west traffic. Any compromised pod can reach every other pod, the metadata service, and often the Kubernetes API server.
  3. No admission control. No OPA Gatekeeper, no Kyverno, no Pod Security Standards enforcement. Developers can deploy privileged containers, mount the host filesystem, and disable all security contexts. The AWS EKS Best Practices Guide explicitly recommends admission control as a baseline, yet I find it absent in roughly 70% of clusters I audit.
  4. No runtime detection. No Falco, no Tetragon, no syscall monitoring of any kind. This means that even if an attacker exploits something like CVE-2022-23648 (the containerd image pull vulnerability) or CVE-2021-25741 (kubelet path traversal allowing host filesystem access), you won't know until they've exfiltrated data or moved laterally.
  5. Stale, unpatched AMIs and control plane versions. I routinely find clusters running EKS versions two or three minor releases behind, with node AMIs that haven't been rotated in months. The control plane and data plane version skew alone violates the Kubernetes version skew policy, and unpatched nodes carry known CVEs in containerd, runc, and the kubelet.

Zero-Trust Pod Architecture

The pattern I've standardized on treats every pod as a potential adversary. This isn't paranoia. It's the only defensible architecture when you're running multi-tenant workloads at scale. The diagram below shows how I layer security controls around every pod in a production namespace:

Zero-Trust Pod Architecture namespace: production (restricted) OPA Gatekeeper / Kyverno: Admission Control ValidatingWebhookConfiguration → deny non-compliant specs Cilium eBPF Network Policy Cilium L7 Visibility + mTLS Pod AWS IRSA Identity eks.amazonaws.com/role-arn projected serviceAccountToken volume Container (distroless, non-root) readOnlyRootFilesystem: true Seccomp + AppArmor Profile RuntimeDefault / custom Falco + Tetragon: Runtime Monitoring eBPF kernel hooks → syscall auditing → alert pipeline

Let me walk through each layer and the specific implementation details.

Layer 1: Admission Control with OPA Gatekeeper and Kyverno

In my experience, the most impactful security control you can deploy to an EKS cluster is admission control. I've shifted from using OPA Gatekeeper exclusively to a hybrid approach where Kyverno handles the common cases and Gatekeeper handles complex cross-resource policies.

Here's the Kyverno ClusterPolicy I deploy to every cluster as a baseline. This single policy prevents the majority of privilege escalation attacks:

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: baseline-pod-security
  annotations:
    policies.kyverno.io/title: Baseline Pod Security
    policies.kyverno.io/category: Pod Security
    policies.kyverno.io/severity: high
    policies.kyverno.io/description: >-
      Enforces baseline pod security standards across all namespaces.
      Maps to CIS Kubernetes Benchmark sections 5.2.x.
spec:
  validationFailureAction: Enforce
  background: true
  rules:
    - name: deny-privileged-containers
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Privileged containers are not allowed. Set securityContext.privileged to false."
        pattern:
          spec:
            =(initContainers):
              - =(securityContext):
                  =(privileged): false
            containers:
              - =(securityContext):
                  =(privileged): false
    - name: require-non-root
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Containers must run as non-root. Set runAsNonRoot to true."
        pattern:
          spec:
            =(securityContext):
              =(runAsNonRoot): true
            containers:
              - securityContext:
                  runAsNonRoot: true
    - name: deny-host-namespaces
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Host namespaces (hostPID, hostIPC, hostNetwork) are not allowed."
        pattern:
          spec:
            =(hostPID): false
            =(hostIPC): false
            =(hostNetwork): false
    - name: restrict-volume-types
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Only configMap, emptyDir, projected, secret, downwardAPI, persistentVolumeClaim, and CSI volumes are allowed."
        deny:
          conditions:
            any:
              - key: "{{ request.object.spec.volumes[].hostPath | length(@) }}"
                operator: GreaterThan
                value: 0
    - name: require-read-only-root
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: "Root filesystem must be read-only."
        pattern:
          spec:
            containers:
              - securityContext:
                  readOnlyRootFilesystem: true

For more complex policies, like ensuring every ServiceAccount that has an IRSA annotation also has a corresponding NetworkPolicy in the same namespace, I use OPA Gatekeeper's ConstraintTemplate with Rego:

apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequirenetworkpolicy
spec:
  crd:
    spec:
      names:
        kind: K8sRequireNetworkPolicy
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequirenetworkpolicy

        violation[{"msg": msg}] {
          input.review.kind.kind == "Pod"
          namespace := input.review.object.metadata.namespace
          not has_network_policy(namespace)
          msg := sprintf("Namespace %v has no NetworkPolicy. All namespaces with workloads must have at least one NetworkPolicy.", [namespace])
        }

        has_network_policy(namespace) {
          some i
          pol := data.inventory.namespace[namespace]["networking.k8s.io/v1"]["NetworkPolicy"][i]
        }

Layer 2: Network Segmentation with Cilium

I migrated away from the default VPC CNI's network policy support to Cilium two years ago and haven't looked back. The reasons are concrete: Cilium's eBPF-based datapath gives you L7 visibility (you can see HTTP methods, gRPC calls, and DNS queries at the pod level), kernel-level enforcement that doesn't depend on iptables chains, and transparent mTLS via Cilium's service mesh integration.

Here's the Terraform module I use to deploy Cilium on EKS with the security-relevant options enabled:

module "cilium" {
  source  = "cilium/cilium/helm"
  version = "1.15.3"

  values = [yamlencode({
    kubeProxyReplacement = "strict"
    k8sServiceHost       = var.eks_cluster_endpoint
    k8sServicePort       = "443"

    hubble = {
      enabled = true
      relay   = { enabled = true }
      ui      = { enabled = true }
      metrics = {
        enabled = [
          "dns:query;rcode;ips",
          "drop:sourceContext;destinationContext;reason",
          "tcp:flag;sourceContext;destinationContext",
          "flow:sourceContext;destinationContext",
          "http:method;status;sourceContext;destinationContext"
        ]
      }
    }

    # Enable transparent encryption
    encryption = {
      enabled = true
      type    = "wireguard"
    }

    # Default deny all ingress/egress
    policyEnforcementMode = "always"

    # eBPF-based masquerading (replaces iptables)
    bpf = {
      masquerade = true
      tproxy     = true
    }

    # Host-level firewall
    hostFirewall = {
      enabled = true
    }

    # Enable bandwidth manager for fair queuing
    bandwidthManager = {
      enabled = true
    }
  })]
}

The critical setting is policyEnforcementMode = "always". This implements default-deny at the cluster level: no pod can communicate with any other pod unless explicitly allowed by a CiliumNetworkPolicy. Combined with Hubble's flow logs, this gives you complete visibility into every network connection in the cluster.

Layer 3: IRSA Hardening: The Pattern Most Teams Get Wrong

IRSA (IAM Roles for Service Accounts) is the single most important security feature in EKS, and it's also the one I see misconfigured most frequently. The core issue is that teams create IAM trust policies that are too broad.

Here's the pattern I enforce. Notice the Condition block: it restricts the role to a specific service account in a specific namespace, not just "any service account in the OIDC provider":

resource "aws_iam_role" "app_payment_processor" {
  name = "${var.cluster_name}-payment-processor"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = var.oidc_provider_arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${var.oidc_provider}:sub" = "system:serviceaccount:payments:payment-processor"
          "${var.oidc_provider}:aud" = "sts.amazonaws.com"
        }
      }
    }]
  })

  # Permission boundary: defense in depth
  permissions_boundary = aws_iam_policy.workload_boundary.arn

  tags = {
    managed-by  = "terraform"
    cluster     = var.cluster_name
    namespace   = "payments"
    service     = "payment-processor"
    data-class  = "pci"
  }
}

# Least-privilege policy: only the specific S3 bucket and KMS key needed
resource "aws_iam_role_policy" "payment_processor" {
  name = "payment-processor-access"
  role = aws_iam_role.app_payment_processor.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "s3:GetObject",
          "s3:PutObject"
        ]
        Resource = "${var.payment_bucket_arn}/transactions/*"
        Condition = {
          StringEquals = {
            "s3:x-amz-server-side-encryption" = "aws:kms"
          }
        }
      },
      {
        Effect = "Allow"
        Action = [
          "kms:Decrypt",
          "kms:GenerateDataKey"
        ]
        Resource = var.payment_kms_key_arn
      }
    ]
  })
}

The permission boundary (permissions_boundary) is the key defense-in-depth measure that most teams skip. Even if someone modifies the inline policy to grant broader access, the boundary policy caps the maximum permissions. I define the boundary to prohibit IAM modification, Organizations actions, and access to management-plane resources.

Multi-Cloud IAM Integration Patterns

One of the questions I get most frequently is how pod identity works across cloud providers and which approach is "best." Having implemented all three in production, here's my honest comparison:

Cross-Provider Identity Comparison

IAM-to-Pod Identity Integration: Multi-Cloud Comparison AWS IRSA Mechanism OIDC federation with projected SA token Token Lifetime 12hr (configurable) Scope Per-ServiceAccount Per-Namespace Key Advantage No node-level creds Fine-grained IAM Watch Out OIDC provider config Trust policy sprawl GCP Workload Identity Mechanism K8s SA ↔ GCP SA binding via IAM policy Token Lifetime 1hr (auto-rotated) Scope Per-ServiceAccount Per-Namespace Key Advantage Native GCP integration No key management Watch Out Fleet WI complexity Cross-project bindings Azure Workload Identity Mechanism Federated credentials via Azure AD app reg Token Lifetime 1hr (auto-rotated) Scope Per-ServiceAccount Per-Namespace Key Advantage AAD Conditional Access Managed Identity parity Watch Out AAD Pod Identity v1 EOL Migrate to v2 (federated)

Enforcing One Identity Per Service

All three providers have converged on the same fundamental pattern: projected service account tokens with OIDC federation. The implementation details differ, but the security model is equivalent. The critical thing to understand is that in all three cases, the security boundary is the ServiceAccount-to-Namespace binding. If you allow multiple services to share a ServiceAccount, you've collapsed your identity boundary.

The pattern I enforce is one ServiceAccount per microservice per namespace, with the IAM role scoped to exactly that combination. In Terraform, this looks like a module that takes the namespace, service account name, and a minimal IAM policy document as inputs, and produces the fully-scoped IRSA binding as output. This eliminates the manual error of copy-pasting trust policies with incorrect sub claims.

Runtime Detection

Layer 4: Runtime Detection with Falco and Tetragon

Admission control and network policies are preventive controls. You also need detective controls: something that watches what's actually happening inside containers at runtime. I deploy both Falco and Tetragon, because they complement each other: Falco excels at high-level behavioral rules with a rich community rule library, while Tetragon provides kernel-level enforcement via eBPF that can actually kill a process before it completes a malicious action.

Here's a custom Falco rule I've written that catches a specific attack pattern I've seen in the wild: an attacker exploiting a web application vulnerability to access the IMDS endpoint and steal IAM credentials:

- rule: Detect IMDS Token Theft via Web Process
  desc: >
    Detects when a web server process (nginx, apache, node, python, java)
    makes an HTTP connection to the EC2 Instance Metadata Service.
    This is a strong indicator of SSRF exploitation or container escape
    attempting to steal IAM credentials.
  condition: >
    evt.type in (connect, sendto) and
    evt.dir = < and
    fd.sip = "169.254.169.254" and
    container.id != host and
    (proc.name in (nginx, apache2, httpd, node, python, python3, java, dotnet) or
     proc.pname in (nginx, apache2, httpd, node, python, python3, java, dotnet))
  output: >
    IMDS access from web process detected
    (container=%container.name pod=%k8s.pod.name ns=%k8s.ns.name
    process=%proc.name parent=%proc.pname cmdline=%proc.cmdline
    connection=%fd.name user=%user.name image=%container.image.repository)
  priority: CRITICAL
  tags: [aws, imds, ssrf, credential_theft, mitre_credential_access]

- rule: Detect Service Account Token Read
  desc: >
    Detects processes reading the projected service account token.
    While normal for application initialization, reading this token
    from a shell or unexpected process indicates potential credential theft.
  condition: >
    open_read and
    fd.name startswith /var/run/secrets/kubernetes.io/serviceaccount and
    container.id != host and
    not proc.name in (vault-agent, aws-iam-authenticator, envoy, pilot-agent) and
    proc.pname in (bash, sh, dash, zsh, curl, wget)
  output: >
    Service account token read by suspicious process
    (container=%container.name pod=%k8s.pod.name ns=%k8s.ns.name
    process=%proc.name parent=%proc.pname file=%fd.name)
  priority: HIGH
  tags: [kubernetes, credential_access, service_account]

For Tetragon, I configure enforcement policies that go beyond detection: they actually prevent the malicious action. This TracingPolicy kills any process that attempts to load a kernel module from within a container:

apiVersion: cilium.io/v1alpha1
kind: TracingPolicy
metadata:
  name: block-kernel-module-load
spec:
  kprobes:
    - call: "init_module"
      syscall: false
      args:
        - index: 0
          type: "nop"
      selectors:
        - matchNamespaces:
            - namespace: Pid
              operator: NotIn
              values:
                - "host_ns"
          matchActions:
            - action: Sigkill
    - call: "finit_module"
      syscall: false
      args:
        - index: 0
          type: "nop"
      selectors:
        - matchNamespaces:
            - namespace: Pid
              operator: NotIn
              values:
                - "host_ns"
          matchActions:
            - action: Sigkill

This is particularly important for mitigating kernel exploit chains. If an attacker gains code execution inside a container and attempts to escalate privileges by loading a malicious kernel module, Tetragon terminates the process at the kernel level before the module loads. This would have been effective against several real-world container escape techniques that rely on CAP_SYS_MODULE or abusing writable cgroup mounts.

The Shift-Left Pipeline: Catching Misconfigurations Before They Deploy

All of the runtime controls I've described are your last line of defense. The real goal is to catch misconfigurations before they ever reach the cluster. Over the past two years, I've refined a shift-left pipeline that catches approximately 90% of security issues before they leave the developer's machine or CI environment:

Shift-Left Kubernetes Security Pipeline 👨‍💻 Developer git commit kubectl apply PRE-COMMIT kyverno apply kubesec scan trivy config . ✓ Catch 60% of issues CI PIPELINE OPA conftest trivy image scan SBOM generation ✓ Catch 30% more ARGOCD Admission webhooks Kyverno enforce Image verification ✓ Final gate EKS RUNTIME Falco alerts Tetragon enforce Cilium Hubble ✓ Detect unknown Cost of Remediation Over Time $1 $100 $10,000+ Shift left: catch misconfigurations where they cost $1, not $10,000 in incident response.

The pipeline has five stages, and the key insight is that each stage catches a different class of issues:

Stage 1: Pre-commit Hooks

I use Kyverno's CLI in pre-commit hooks. Developers get immediate feedback before they even push code:

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/kyverno/kyverno
    rev: v1.11.4
    hooks:
      - id: kyverno-apply
        name: Kyverno Policy Check
        entry: kyverno apply ./policies/ --resource
        files: '(deployment|statefulset|daemonset|pod).*\.ya?ml$'
        language: golang
  - repo: https://github.com/aquasecurity/trivy
    rev: v0.50.1
    hooks:
      - id: trivy-config
        name: Trivy Config Scan
        entry: trivy config --severity HIGH,CRITICAL --exit-code 1
        files: '.*\.ya?ml$'
        language: golang

Stage 2: CI Policy Scanning

In CI (I use GitHub Actions for most clients, but this works with any CI system), I run a comprehensive policy scan that includes OPA Conftest for custom Rego policies, Trivy for both config and image scanning, and SBOM generation with Syft:

# .github/workflows/security-scan.yml
name: Security Scan
on:
  pull_request:
    paths:
      - 'k8s/**'
      - 'Dockerfile*'
      - 'terraform/**'

jobs:
  policy-scan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run OPA Conftest
        uses: open-policy-agent/conftest-action@v2
        with:
          files: k8s/
          policy: policies/

      - name: Trivy Image Scan
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: image
          image-ref: ${{ github.event.pull_request.head.sha }}
          severity: HIGH,CRITICAL
          exit-code: 1
          format: sarif
          output: trivy-results.sarif

      - name: Upload SARIF
        uses: github/codeql-action/upload-sarif@v3
        with:
          sarif_file: trivy-results.sarif

      - name: Generate SBOM
        uses: anchore/sbom-action@v0
        with:
          format: spdx-json
          output-file: sbom.spdx.json

Stage 3: ArgoCD Admission Gate

ArgoCD serves as my GitOps deployment engine, and it's the final gate before resources reach the cluster. I configure ArgoCD to sync through the same admission webhooks that protect the cluster, so even if someone pushes a non-compliant manifest to the Git repo, ArgoCD's sync will fail with a clear policy violation message.

The pattern is straightforward: Kyverno or Gatekeeper is already running in the cluster in Enforce mode. When ArgoCD attempts to apply a resource that violates policy, the admission webhook rejects the request, ArgoCD marks the sync as failed, and the team gets alerted via Slack or PagerDuty. This creates a feedback loop where developers learn the policies by encountering them during deployment, not during an incident post-mortem.

Stage 4: Runtime Monitoring

Even with three layers of shift-left controls, runtime monitoring catches what static analysis cannot: zero-day exploits, supply chain attacks in base images, and insider threats. The Falco and Tetragon rules I showed earlier form this layer, with alerts routed to a SIEM (I typically use Datadog or Elastic Security) for correlation and incident response.

Putting It All Together: The Terraform Module Structure

I package all of these controls into a single Terraform module that can be applied to any EKS cluster. The module structure looks like this:

module "eks_security_baseline" {
  source = "git::https://github.com/internal/terraform-eks-security.git?ref=v3.2.1"

  cluster_name       = module.eks.cluster_name
  cluster_endpoint   = module.eks.cluster_endpoint
  oidc_provider_arn  = module.eks.oidc_provider_arn
  oidc_provider      = module.eks.oidc_provider

  # Cilium configuration
  enable_cilium             = true
  cilium_version            = "1.15.3"
  cilium_encryption         = "wireguard"
  cilium_default_deny       = true
  cilium_hubble_enabled     = true

  # Admission control
  enable_kyverno            = true
  kyverno_version           = "3.1.4"
  kyverno_validation_action = "Enforce"
  enable_gatekeeper         = true
  gatekeeper_version        = "3.15.1"

  # Runtime security
  enable_falco              = true
  falco_version             = "0.37.1"
  falco_driver              = "modern_ebpf"
  enable_tetragon           = true
  tetragon_version          = "1.0.2"

  # Monitoring integration
  alert_webhook_url         = var.slack_security_webhook
  siem_endpoint             = var.datadog_log_endpoint

  # IRSA defaults
  irsa_permission_boundary  = aws_iam_policy.workload_boundary.arn
  irsa_token_expiration     = 3600  # 1 hour, not the default 12

  tags = var.common_tags
}

Notice irsa_token_expiration = 3600. The default IRSA token lifetime is 12 hours, which is far too long. If an attacker exfiltrates a projected service account token, they have a 12-hour window to use it from outside the cluster. I reduce this to 1 hour for all workloads and 15 minutes for highly sensitive ones. The AWS SDK handles token refresh transparently, so there's no application impact.

Real-World Incident: Why This Matters

Let me share a sanitized version of an incident that validated this architecture. During a penetration test on a financial services client's EKS cluster, the red team exploited a Server-Side Request Forgery (SSRF) vulnerability in a web application to reach the EC2 Instance Metadata Service (IMDS). In a cluster without IRSA, this would have given them the node's IAM role, which typically has permissions to pull ECR images, write CloudWatch logs, and interact with the EKS API. From there, lateral movement is trivial.

In this case, the layered defense worked exactly as designed:

  1. IRSA meant the node's instance profile had minimal permissions: no S3, no Secrets Manager, no ability to assume other roles. The IMDS credentials were nearly useless.
  2. Cilium network policy blocked the pod's connection to 169.254.169.254 entirely (we allow IMDS access only from specific system pods that need it). The SSRF attempt never reached IMDS.
  3. Falco detected the attempted connection to the metadata service from a web process (using the rule I showed above) and fired a CRITICAL alert within 800ms.
  4. Tetragon logged the full process tree, giving the incident response team a complete timeline: which process initiated the connection, what command-line arguments were used, and the parent process chain back to the container entrypoint.

The total time from exploit attempt to SOC alert was under 2 seconds. The red team's report noted this was the most effective container security architecture they'd encountered across their client base.

Operational Considerations

A few things I've learned the hard way that aren't in any documentation:

Falco: Use the Modern eBPF Driver

Falco's modern eBPF driver is worth the kernel version requirement. The older kernel module driver caused node instability under high syscall volumes in our load tests. Since moving to the modern eBPF driver (requires kernel 5.8+, which all current EKS AMIs support), we've had zero Falco-related node issues. If you're still on the kernel module driver, migrate immediately.

Kyverno: Audit Mode Before Enforce

Kyverno's Audit mode is your friend during rollout. Never deploy Kyverno policies in Enforce mode to an existing cluster without first running in Audit for at least two weeks. I've seen well-intentioned policy rollouts take down production because a critical system pod (usually something in kube-system) violated the new policy. Use Kyverno's policy reports to identify violations, fix them, then switch to Enforce.

Cilium: Plan Default-Deny Rollout

Cilium's policyEnforcementMode: always requires careful planning. When you enable default-deny, every pod that doesn't have an explicit CiliumNetworkPolicy will lose all network connectivity. I handle this by deploying network policies as part of the same Helm chart or Kustomize overlay as the application, so the policy and the workload arrive together. For existing clusters, start with policyEnforcementMode: default and add policies namespace by namespace.

Admission Webhook Latency at Scale

Monitor your admission webhook latency. Both Kyverno and Gatekeeper add latency to every API server request. In clusters with high pod churn (1000+ pod creates per minute during scaling events), I've seen webhook latency spike to 500ms+, which cascades into Deployment rollout timeouts. Set resource requests/limits appropriately, run multiple replicas, and monitor the kyverno_admission_review_duration_seconds metric closely.

What's Next: The Architecture Is Never Done

The patterns I've described here represent the current state of production-hardened EKS security, but the landscape is evolving rapidly. I'm actively evaluating Cilium's service mesh mode as a replacement for Istio (fewer moving parts, better performance), exploring Tetragon's new runtime enforcement policies for file integrity monitoring, and working on integrating Sigstore/cosign image verification into the admission pipeline so that only signed, attested images can run in production.

If you're starting from scratch, my recommendation is: IRSA first, Kyverno second, Cilium third, Falco fourth. Each layer is independently valuable, and you can adopt them incrementally without disrupting existing workloads. The key is to start. A partial implementation of this architecture is infinitely better than the default EKS configuration, which provides almost no workload security out of the box.

The AWS EKS Best Practices Guide is an excellent companion to this post, and I'd encourage every EKS operator to treat the CIS Kubernetes Benchmark as a minimum baseline, not an aspirational target. If you're running production workloads on EKS and you haven't implemented at least IRSA and admission control, you have an urgent security gap that needs to be addressed now, not next quarter.

Get security research in your inbox

AI security, cloud architecture, threat analysis. No spam.

No spam. Unsubscribe anytime. Privacy

Sending...
Check your inbox — click the confirmation link to start receiving weekly security insights.

Share this analysis