Skip to content

fix(eksdetector): add IRSA and Pod Identity env var checks before JWT fallback#2107

Open
musa-asad wants to merge 1 commit intomainfrom
fix/harden-eksdetector-env-var-checks
Open

fix(eksdetector): add IRSA and Pod Identity env var checks before JWT fallback#2107
musa-asad wants to merge 1 commit intomainfrom
fix/harden-eksdetector-env-var-checks

Conversation

@musa-asad
Copy link
Copy Markdown
Contributor

@musa-asad musa-asad commented Apr 30, 2026

Description of the issue

The CWA eksdetector parses the JWT token issuer to detect EKS. This fails on:

  • Custom OIDC providers (non-EKS issuers)
  • Opaque tokens (no parseable JWT payload)
  • Token mount race conditions (file not yet available at startup)

Description of changes

Add fast-path environment variable checks before the existing JWT fallback:

  • Check for IRSA token path (AWS_WEB_IDENTITY_TOKEN_FILE)
  • Check for Pod Identity path (EKS Pod Identity agent endpoint)
  • Zero-cost: no I/O, no network calls
  • Falls back to existing JWT parsing if env vars are absent

Important

Co-PRs: amazon-contributing/opentelemetry-collector-contrib#516 — OTEL resource detector 5-step fallback (replaces aws-auth ConfigMap dependency)

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Setup

Tested on an EKS cluster with authentication_mode = API (Access Entries only, no aws-auth ConfigMap present). 1x managed node group (t3.medium). IRSA configured via OIDC provider. Custom CWA image built with this fork change applied via go.mod replace directives. Deployed via mainline amazon-cloudwatch-observability helm chart with mainline operator.

Unit Tests

10 tests (4 existing + 6 new): TestEKS, TestNonEKS, TestEmptyToken, Test_getIssuer, TestEKS_IRSA_EnvVar, TestEKS_PodIdentity_EnvVar, TestEKS_EnvVarsAbsent_FallsThrough, TestNonEKS_EnvVarsAbsent_NonEKSToken, TestEKS_PartialEnvVars_IRSAWithoutEKS, TestEKS_BothEnvVarsSet. All pass.

E2E — CWA eksdetector identifies EKS

Raw eksdetector output from agent logs:

container_orchestrator: eks

E2E — IRSA env var fast-path exercised

The service account has the IRSA annotation set, which causes EKS to inject AWS_WEB_IDENTITY_TOKEN_FILE into the pod via webhook — triggering the env var fast-path before JWT parsing:

apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/cwbs-4852-e2e-v3-cwagent
    meta.helm.sh/release-name: amazon-cloudwatch-observability
    meta.helm.sh/release-namespace: amazon-cloudwatch
  creationTimestamp: "2026-05-05T19:08:58Z"
  labels:
    app.kubernetes.io/managed-by: Helm
  name: cloudwatch-agent
  namespace: amazon-cloudwatch
  resourceVersion: "1943"
  uid: 4790b424-2711-429a-8df9-fd5e116e8d29

E2E — ContainerInsights metrics flowing

The CWA eksdetector result gates whether ContainerInsights receivers are configured during config translation. Metrics flowing to CloudWatch confirms isEKS() returned true via the env var fast-path, enabling the full ContainerInsights pipeline.

Metric count: 1154
Metric names:

apiserver_admission_controller_admission_duration_seconds
apiserver_admission_step_admission_duration_seconds
apiserver_current_inflight_requests
apiserver_current_inqueue_requests
apiserver_longrunning_requests
apiserver_request_duration_seconds
apiserver_request_total
apiserver_request_total_5xx
apiserver_storage_list_duration_seconds
apiserver_storage_objects
apiserver_storage_size_bytes
cluster_failed_node_count
cluster_node_count
cluster_number_of_running_pods
container_cpu_limit
container_cpu_request
container_cpu_utilization
container_cpu_utilization_over_container_limit
container_memory_failures_total
container_memory_limit
container_memory_request
container_memory_utilization
container_memory_utilization_over_container_limit
etcd_request_duration_seconds
namespace_number_of_running_pods
node_cpu_limit
node_cpu_reserved_capacity
node_cpu_usage_total
node_cpu_utilization
node_filesystem_inodes
node_filesystem_inodes_free
node_filesystem_utilization
node_interface_network_rx_dropped
node_interface_network_tx_dropped
node_memory_limit
node_memory_reserved_capacity
node_memory_utilization
node_memory_working_set
node_network_total_bytes
node_number_of_running_containers
node_number_of_running_pods
node_status_allocatable_pods
node_status_capacity_pods
node_status_condition_disk_pressure
node_status_condition_memory_pressure
node_status_condition_pid_pressure
node_status_condition_ready
node_status_condition_unknown
persistent_volume_count
pod_container_status_running
pod_container_status_terminated
pod_container_status_waiting
pod_cpu_limit
pod_cpu_request
pod_cpu_reserved_capacity
pod_cpu_usage_total
pod_cpu_utilization
pod_cpu_utilization_over_pod_limit
pod_interface_network_rx_dropped
pod_interface_network_tx_dropped
pod_memory_limit
pod_memory_request
pod_memory_reserved_capacity
pod_memory_utilization
pod_memory_utilization_over_pod_limit
pod_memory_working_set
pod_network_rx_bytes
pod_network_tx_bytes
pod_number_of_container_restarts
pod_number_of_containers
pod_number_of_running_containers
pod_status_failed
pod_status_pending
pod_status_ready
pod_status_running
pod_status_scheduled
pod_status_succeeded
pod_status_unknown
replicas_desired
replicas_ready
rest_client_request_duration_seconds
rest_client_requests_total
service_number_of_running_pods
status_replicas_available
status_replicas_unavailable

Sample datapoint (node_cpu_utilization):

{
    "Datapoints": [
        {
            "Timestamp": "2026-05-05T19:18:00Z", 
            "Average": 3.892803384167258, 
            "Unit": "Percent"
        }, 
        {
            "Timestamp": "2026-05-05T19:17:00Z", 
            "Average": 3.3307807825105575, 
            "Unit": "Percent"
        }, 
        {
            "Timestamp": "2026-05-05T19:19:00Z", 
            "Average": 3.495762318338428, 
            "Unit": "Percent"
        }
    ], 
    "Label": "node_cpu_utilization"
}

E2E — Pod stability (0 restarts)

NAME                                                              READY   STATUS    RESTARTS   AGE     IP              NODE                                          NOMINATED NODE   READINESS GATES
amazon-cloudwatch-observability-controller-manager-56bcdfdk9dn8   1/1     Running   0          11m     <IP>    <NODE>   <none>           <none>
cloudwatch-agent-vfn76                                            1/1     Running   0          5m23s   <IP>   <NODE>   <none>           <none>
fluent-bit-wswk8                                                  1/1     Running   0          11m     <IP>   <NODE>   <none>           <none>

… fallback

The EKS detector parses the service account JWT token to check if the
issuer contains "eks". This fails on edge cases: custom OIDC providers,
opaque tokens, and token mount race conditions at pod startup.

Add fast-path env var checks before the JWT fallback:
1. AWS_WEB_IDENTITY_TOKEN_FILE contains "eks.amazonaws.com" (IRSA)
2. AWS_CONTAINER_AUTHORIZATION_TOKEN_FILE contains "eks-pod-identity" (Pod Identity)

These are zero-cost checks that cover the two most common modern EKS
auth patterns. The existing JWT parsing is preserved as a fallback.
Copy link
Copy Markdown
Contributor

@okankoAMZ okankoAMZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checks failing

@musa-asad musa-asad added the ready for testing Indicates this PR is ready for integration tests to run label May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for testing Indicates this PR is ready for integration tests to run

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants