Skip to content

Mitigate intermittent AWF startup failures when awf-api-proxy health check flaps#27909

Draft
Copilot wants to merge 4 commits intomainfrom
copilot/aw-failures-fix-awf-api-proxy
Draft

Mitigate intermittent AWF startup failures when awf-api-proxy health check flaps#27909
Copilot wants to merge 4 commits intomainfrom
copilot/aw-failures-fix-awf-api-proxy

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 22, 2026

Main-branch agentic workflows were failing before agent activation because Docker compose exited when awf-api-proxy became unhealthy during startup. Failures were transient and cross-engine, with a consistent pre-inference crash signature.

  • AWF startup resilience for api-proxy flaps

    • Added a thin /usr/local/bin/awf wrapper that retries AWF startup when stderr contains:
      • dependency failed to start: container awf-api-proxy is unhealthy
    • Default behavior is one retry with backoff; both are env-configurable:
      • AWF_API_PROXY_START_RETRIES (default 1)
      • AWF_API_PROXY_RETRY_DELAY_SECONDS (default 5)
  • Install-path split: wrapper vs real runtime

    • install_awf_binary.sh now installs the actual runtime to /usr/local/lib/awf/awf-real (bundle launcher or native binary).
    • /usr/local/bin/awf is now the retrying launcher, preserving existing call sites while centralizing retry logic.
  • Operational hardening

    • Wrapper uses secure temp-file creation (umask 077 && mktemp) when capturing AWF output for retry pattern detection.
    • Emits explicit retry telemetry to stderr for post-failure forensics.
# /usr/local/bin/awf (new behavior)
if awf_real_output_contains "dependency failed to start: container awf-api-proxy is unhealthy"; then
  sleep "${AWF_API_PROXY_RETRY_DELAY_SECONDS:-5}"
  retry_once
fi

🤖 Smoke CI scheduled run — https://github.com/github/gh-aw/actions/runs/24812244277

Generated by Smoke CI · ● 401.7K ·


🤖 Smoke CI scheduled run — https://github.com/github/gh-aw/actions/runs/24813191140

Generated by Smoke CI · ● 393.8K ·


🤖 Smoke CI scheduled run — https://github.com/github/gh-aw/actions/runs/24818568683

Generated by Smoke CI · ● 394.8K ·


🤖 Smoke CI scheduled run — https://github.com/github/gh-aw/actions/runs/24820507401

Generated by Smoke CI · ● 399.8K ·

Copilot AI and others added 3 commits April 22, 2026 22:27
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/596920a6-3a87-4dfb-9f45-762e54d420ac

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw/sessions/596920a6-3a87-4dfb-9f45-762e54d420ac

Co-authored-by: pelikhan <4175913+pelikhan@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix awf-api-proxy sidecar health check issue Mitigate intermittent AWF startup failures when awf-api-proxy health check flaps Apr 22, 2026
Copilot AI requested a review from pelikhan April 22, 2026 22:32
@pelikhan pelikhan added the awf label Apr 22, 2026
@pelikhan
Copy link
Copy Markdown
Collaborator

@lpcox could this be something else?

@lpcox
Copy link
Copy Markdown
Collaborator

lpcox commented Apr 22, 2026

@pelikhan looking into it

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Copy Markdown
Contributor

Hey @Copilot 👋 — thanks for working on improving AWF startup resilience! The retry wrapper approach for the awf-api-proxy health check flap is a solid fix, and the PR description clearly explains the problem and solution.

One thing to address before this is ready for review:

  • No tests — There are no test file changes covering the new retry logic or the install-path split. Adding coverage for the retry behavior (e.g., success on first attempt, success after retries, failure after max retries) would significantly strengthen confidence in the change.
Add tests for the retry wrapper and install-path logic introduced in PR #27909 on github/gh-aw:
1. Identify the test file(s) corresponding to the changed shell/workflow code.
2. Add unit or integration tests covering:
   - Successful health check on first attempt (no retry needed)
   - Successful health check after N retries (N < max)
   - Failure after max retries exhausted
3. Run `make agent-finish` to confirm all tests pass.

Generated by Contribution Check · ● 930.7K ·

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[aw-failures] awf-api-proxy sidecar unhealthy — now blocking main-branch workflows (DeepReport, Smoke CI, Test Quality Sentinel)

3 participants