mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-16 01:02:13 +08:00
* fix(browse): single-point Unicode sanitization at server egress Add sanitizeLoneSurrogates (regex-based UTF-16 lone-half cleaner) and sanitizeReplacer (JSON.stringify replacer that runs the cleaner on every string field during encoding). Split handleCommandInternal into handleCommandInternalImpl (raw) plus a thin sanitizing wrapper. The wrapper applies sanitizeLoneSurrogates to cr.result so both single-command (handleCommand line 1034) and batch-loop (line 1966) egress paths inherit it. Inline INVARIANT comment near the wrapper documents the architectural constraint. Both SSE producers (activity feed at /activity/stream and inspector stream) stringify with sanitizeReplacer. Post-stringify regex is ineffective on those paths because JSON.stringify has already converted the lone surrogate into the escape sequence "\\\\uD800" before any regex could match it; the replacer runs during stringify on the raw string value, so the substitution lands. Originated from @realcarsonterry PR #1463 (handleCommand-only wrap). Architectural lift to handleCommandInternal + SSE coverage authored on this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(setup): _link_or_copy helper for Windows file-copy fallback On Windows without Developer Mode (MSYS2/Git Bash), plain ln -snf silently creates a frozen file copy that doesn't refresh on git pull. Skill files become stale after every upgrade. Add a _link_or_copy SRC DST helper near IS_WINDOWS detection (line ~33). It auto-dispatches: on Unix it preserves ln -snf semantics, on Windows it copies (cp -R for directories, cp -f for files). When the source is a Unix-style name-only alias that doesn't resolve on disk (the connect-chrome → gstack/open-gstack-browser pattern), the helper returns 0 silently on Windows rather than aborting setup under set -e. Rewrite all 42 prior ln -snf call sites to route through the helper: link_claude_skill_dirs (line 437), team-claude install paths (lines 556, 581, 592), Codex host adapter block (lines 618-640), Factory host adapter block (lines 658-678), OpenCode host adapter block (lines 696-731), Kiro host adapter block (lines 939-953), plus migration and alias sites. Add _print_windows_copy_note_once helper and call it from link_claude_skill_dirs after any linking work completes so Windows users see one user-visible note explaining they must re-run ./setup after every git pull. Extend cleanup_old_claude_symlinks and cleanup_prefixed_claude_symlinks with a Windows branch: when the target is a real directory containing a real-file SKILL.md (no symlink to readlink), and IS_WINDOWS=1, treat the name-matched directory as gstack-managed and remove it. This makes --prefix / --no-prefix flips work on Windows instead of leaving stale copies behind. Originated from @realcarsonterry PR #1462 (1 of 42 sites). Helper extraction, 42-site rewrite, alias-resolution edge case, and Windows cleanup compat authored on this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(docs): rename stale gbrain_sync_mode to artifacts_sync_mode + register /document-generate Five stale gstack-config references in docs/ pointed to the deprecated gbrain_sync_mode key (renamed to artifacts_sync_mode in v1.27.0.0): - docs/gbrain-sync.md: lines 62, 110, 111, 173 - docs/gbrain-sync-errors.md: lines 26, 203 Users following the docs would set a key that gstack-brain-sync no longer reads, silently breaking artifacts sync. Originated from @realcarsonterry PR #1461 (verbatim). Also register /document-generate in AGENTS.md (Operational + memory table) and docs/skills.md (skill index). The skill shipped in v1.35.0.0 but the doc-inventory cross-check in test/skill-validation.test.ts was failing because neither file mentioned it. Allowlist the new test/docs-config-keys.test.ts file in test/no-stale-gstack-brain-refs.test.ts — it intentionally lists the deprecated keys in its DEPRECATED_KEYS denylist (defending the rename). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci(windows): migrate windows-free-tests to paid faster runner + register wave tests Move the Windows free-test job from GitHub-hosted windows-latest to Blacksmith's paid Windows runner (blacksmith-2vcpu-windows-2022). Spin-up drops from ~60s to ~10s and Bun installs land 3-4x faster. The label can swap to namespace-profile-windows or ubicloud-windows-* if this repo's Blacksmith installation isn't configured. Register the four new wave tests in the workflow's curated test list: - browse/test/server-sanitize-surrogates.test.ts - test/setup-windows-fallback.test.ts - test/build-script-shell-compat.test.ts - test/docs-config-keys.test.ts These tests cover the Windows-hardening surface that this wave ships (sanitizer wiring, _link_or_copy helper, build-script subshells, doc- config drift), so they need to run on Windows where the bug shapes actually manifest. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test: wave coverage for sanitizer, link_or_copy, build script, doc drift Four new test files (29 cases total): browse/test/server-sanitize-surrogates.test.ts: - 11 unit cases for sanitizeLoneSurrogates (passthrough, valid pair, lone high/low mid-string, trailing/leading lone, adjacent doubles, pair-then-lone, lone-then-pair, empty) - 2 bug-repro tests pinning the regression intent (UTF-8 round-trip, JSON.parse round-trip with codepoint assertion) - 4 wiring invariants asserting the architectural choke points stay intact (handleCommandInternalImpl rename, central sanitization line, sanitizeReplacer function exists, SSE producers stringify with replacer) Function extracted from server.ts via regex + eval'd in test scope so no production-code export is needed. test/setup-windows-fallback.test.ts: - Static invariant (D7): zero raw `ln` calls outside the _link_or_copy helper body and comments - Helper-existence assertions - 4-cell behavior matrix (file/dir × Windows/Unix) via awk-style helper extraction + bash -c sourcing - Windows-note printer registration check Mirrors test/setup-conductor-worktree.test.ts patterns. test/build-script-shell-compat.test.ts: - Regex assertion that package.json scripts.* contain no bash brace groups (Bun-Windows-hostile) - Subshell-precedence check for `.version` redirects Strips single-quoted strings before regexing so embedded JS code inside echo '...' doesn't false-positive. test/docs-config-keys.test.ts: - DEPRECATED_KEYS denylist scanned across docs/**/*.md - Round-trip test for `gstack-config get artifacts_sync_mode` Defends the v1.27.0.0 rename from doc drift. Updates to two existing tests: - test/setup-conductor-worktree.test.ts: expect `_link_or_copy` instead of `ln -snf` at the Conductor-worktree guard call site - test/gen-skill-docs.test.ts: same swap at three assertion sites (Codex section, Claude link_claude_skill_dirs body, Codex link_codex_skill_dirs body) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * chore: bump v1.38.0.0 + build-script subshells + CHANGELOG VERSION 1.35.0.0 → 1.38.0.0 (MINOR). PR #1500 (lyon-v2) claimed v1.37.0.0 ahead of this branch; v1.38.0.0 is the next free MINOR slot per bin/gstack-next-version queue check. Workspace-aware ship rule applies — queue-advancing past a claimed version within the same bump level is explicitly permitted. package.json build script: three `{ git rev-parse HEAD ...; }` brace groups → `( git rev-parse HEAD ... )` subshells. Bun's Windows shell parser doesn't grok bash brace groups; subshells are POSIX-universal. Originated from @realcarsonterry PR #1460. CHANGELOG entry covers the full wave: - Windows install hardening (42-site _link_or_copy + cleanup compat) - Unicode sanitization architecture (handleCommandInternal + SSE replacer) - Build script POSIX-shell compat (subshells) - Doc rename (gbrain_sync_mode → artifacts_sync_mode) - Windows CI on paid faster runner - 4 new wave tests (29 cases) Frames each item as a current system property, not a fix narrative. Credits @realcarsonterry for PRs #1460, #1461, #1462, #1463 (the seed of the wave). Scope expansion to all 42 setup sites, every server egress path, Windows CI migration, and codex-flagged P0/P1 fixes (connect-chrome alias on Windows, SSE replacer, prefix-cleanup Windows compat) authored on this branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: post-ship sync for v1.38.0.0 Document the two architectural invariants that landed in v1.38.0.0 in their persistent homes (not just CHANGELOG): - README Windows section: add the `./setup` re-run-after-git-pull requirement that `_print_windows_copy_note_once` shows at runtime. - CONTRIBUTING "Things to know": add the no-raw-`ln` invariant for contributors editing `setup`, with the test that enforces it. - ARCHITECTURE: new "Unicode sanitization at server egress" section between Shell injection prevention and Prompt injection defense, with egress table (HTTP/batch/SSE) and the post-stringify-regex rationale. - CLAUDE.md: cross-references for both invariants, matching the v1.6.0.0 dual-listener pattern (each constraint says which files to read before editing and which test pins it). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci(windows): use windows-latest-8-cores instead of unregistered Blacksmith label actionlint failed PR #1505 because `blacksmith-2vcpu-windows-2022` isn't in the repo's approved runner-label list (actionlint.yaml only registers `ubicloud-standard-2`, and Ubicloud doesn't ship a Windows pool). Switch to GitHub's paid larger Windows runner `windows-latest-8-cores` — 4x the cores of the free `windows-latest` at the larger-runner billing rate, no new third-party CI provider, no actionlint config changes. CHANGELOG: replace "Blacksmith" / "blacksmith-2vcpu-windows-2022" / "~6x faster spin-up" claims with the actual choice (8 cores vs 4, paid larger runner). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci(windows): switch from windows-latest-8-cores to ubicloud-standard-2-windows `windows-latest-8-cores` sat queued indefinitely because the GitHub larger-runner billing isn't enabled at the org level — the "Queued — Waiting to run this check" status surfaced on PR #1505 with no progress for the whole CI run. Switch to Ubicloud Windows runners (`ubicloud-standard-2-windows`) so Windows CI uses the same provider as the existing Linux evals (`ubicloud-standard-2`). Billing stays under one account instead of two. Register the new label in actionlint.yaml alongside the existing ubicloud-standard-2 entry so actionlint doesn't reject it as unknown. CHANGELOG entry updated: runner row reflects the actual provider chosen, "Itemized changes" mentions the actionlint.yaml registration, and the narrative paragraph documents why `windows-latest-8-cores` failed first. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci: migrate all workflows to Ubicloud (Linux + Windows, 8-core) Switch every `runs-on` in this repo to Ubicloud so CI has a single billing surface, consistent capacity, and 4x more cores on the workloads that were previously stuck on free `ubuntu-latest` (2 cores). Windows uses Ubicloud's Windows pool too — `ubicloud-standard-8-windows` — so the queued-forever problem with GitHub's `windows-latest-8-cores` paid larger runner (org-level larger-runner billing not enabled) goes away. Workflows touched (9): - evals.yml, evals-periodic.yml, ci-image.yml — bump default + matrix from `ubicloud-standard-2` to `ubicloud-standard-8`. The one matrix entry that was already on -8 stays. - windows-free-tests.yml — `ubicloud-standard-2-windows` → `ubicloud-standard-8-windows`. - make-pdf-gate.yml — matrix `ubuntu-latest` → `ubicloud-standard-8`. macOS entry preserved; the poppler-install `if: matrix.os` conditional swaps to match the new label. - actionlint.yml, pr-title-sync.yml, skill-docs.yml, version-gate.yml — `ubuntu-latest` → `ubicloud-standard-8`. .github/actionlint.yaml registers all four Ubicloud labels in one place: - ubicloud-standard-2 - ubicloud-standard-8 - ubicloud-standard-2-windows (the v1.38.0.0 windows-free-tests target) - ubicloud-standard-8-windows (this PR's windows-free-tests target) Removed the duplicate `actionlint.yaml` at the repo root that I accidentally created in the prior commit — actionlint only reads `.github/actionlint.yaml`, so the root file was dead weight. CHANGELOG entry updated: a single "all Ubicloud" sentence in the narrative plus a metrics-row covering the runner pool change, and the itemized line expanded to enumerate the 9 affected workflows. The previously-orphaned "Itemized changes" line about just `windows-free-tests.yml` is replaced. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * ci(windows): revert to free `windows-latest` Ubicloud doesn't ship Windows runners — confirmed via their docs. The `ubicloud-standard-*-windows` labels I added do not exist and were causing `windows-free-tests` to sit "Queued — Waiting to run this check" forever (GitHub Actions can't tell a typoed label from a self-hosted runner that's about to register; it just waits). Three prior Windows-runner attempts all failed for different reasons: - `blacksmith-2vcpu-windows-2022` — Blacksmith app not installed on the org - `windows-latest-8-cores` — GitHub paid larger-runner billing not enabled - `ubicloud-standard-2/8-windows` — Ubicloud doesn't offer Windows at all The free `windows-latest` runner (4 cores, ~60s spin-up, $0) is the one path that actually runs. The wave-coverage Windows tests are <30s of real work; total job time stays under 2 minutes. Cleaned up `.github/actionlint.yaml` to drop the bogus `ubicloud-standard-*-windows` entries — kept only the two real Linux labels. CHANGELOG: split the runner-pool row into Linux (migrated to Ubicloud-8) vs Windows (stays on free windows-latest), with the why on each. Itemized line for windows-free-tests rewritten to reflect the actual outcome. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(windows): skip Unix-only cases on Windows runner windows-free-tests on GitHub free windows-latest fails three cases that depend on Unix tooling the runner doesn't have: 1. `setup-windows-fallback.test.ts` behavior matrix — IS_WINDOWS=0 cells assert `ln -snf` produces a real symlink. On Windows-without-Developer- Mode (which the free `windows-latest` runner is), `ln -snf` silently creates a file copy. That's literally the bug `_link_or_copy` exists to work around, so the assertion can never pass there. Skip the whole describe block on win32. The static-invariant test (zero raw `ln` outside the helper body) above the matrix still runs and pins the shape the Windows install relies on. 2. `docs-config-keys.test.ts` round-trip — spawnSync(`bin/gstack-config`) on Windows doesn't read the bash shebang and fails to exec. Skip on win32; the deprecated-key denylist test in the same file still runs and is the actual invariant defending the v1.27.0.0 rename at the doc layer. Use `describe.skipIf(process.platform === 'win32', ...)` and `test.skipIf(process.platform === 'win32', ...)`. Tests still run on macOS and Linux unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
249 lines
8.7 KiB
YAML
249 lines
8.7 KiB
YAML
name: E2E Evals
|
|
on:
|
|
pull_request:
|
|
branches: [main]
|
|
workflow_dispatch:
|
|
|
|
concurrency:
|
|
group: evals-${{ github.head_ref }}
|
|
cancel-in-progress: true
|
|
|
|
env:
|
|
IMAGE: ghcr.io/${{ github.repository }}/ci
|
|
EVALS_TIER: gate
|
|
|
|
jobs:
|
|
# Build Docker image with pre-baked toolchain (cached — only rebuilds on Dockerfile/lockfile change)
|
|
build-image:
|
|
runs-on: ubicloud-standard-8
|
|
permissions:
|
|
contents: read
|
|
packages: write
|
|
outputs:
|
|
image-tag: ${{ steps.meta.outputs.tag }}
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
|
|
- id: meta
|
|
run: echo "tag=${{ env.IMAGE }}:${{ hashFiles('.github/docker/Dockerfile.ci', 'package.json', 'bun.lock') }}" >> "$GITHUB_OUTPUT"
|
|
|
|
- uses: docker/login-action@v3
|
|
with:
|
|
registry: ghcr.io
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
|
|
- name: Check if image exists
|
|
id: check
|
|
run: |
|
|
if docker manifest inspect ${{ steps.meta.outputs.tag }} > /dev/null 2>&1; then
|
|
echo "exists=true" >> "$GITHUB_OUTPUT"
|
|
else
|
|
echo "exists=false" >> "$GITHUB_OUTPUT"
|
|
fi
|
|
|
|
- if: steps.check.outputs.exists == 'false'
|
|
run: cp package.json bun.lock .github/docker/
|
|
|
|
- if: steps.check.outputs.exists == 'false'
|
|
uses: docker/build-push-action@v6
|
|
with:
|
|
context: .github/docker
|
|
file: .github/docker/Dockerfile.ci
|
|
push: true
|
|
tags: |
|
|
${{ steps.meta.outputs.tag }}
|
|
${{ env.IMAGE }}:latest
|
|
|
|
evals:
|
|
runs-on: ${{ matrix.suite.runner || 'ubicloud-standard-8' }}
|
|
needs: build-image
|
|
container:
|
|
image: ${{ needs.build-image.outputs.image-tag }}
|
|
credentials:
|
|
username: ${{ github.actor }}
|
|
password: ${{ secrets.GITHUB_TOKEN }}
|
|
options: --user runner
|
|
timeout-minutes: 25
|
|
strategy:
|
|
fail-fast: false
|
|
matrix:
|
|
suite:
|
|
- name: llm-judge
|
|
file: test/skill-llm-eval.test.ts
|
|
- name: e2e-browse
|
|
file: test/skill-e2e-bws.test.ts
|
|
runner: ubicloud-standard-8
|
|
- name: e2e-plan
|
|
file: test/skill-e2e-plan.test.ts
|
|
- name: e2e-deploy
|
|
file: test/skill-e2e-deploy.test.ts
|
|
- name: e2e-design
|
|
file: test/skill-e2e-design.test.ts
|
|
- name: e2e-qa-bugs
|
|
file: test/skill-e2e-qa-bugs.test.ts
|
|
- name: e2e-qa-workflow
|
|
file: test/skill-e2e-qa-workflow.test.ts
|
|
- name: e2e-review
|
|
file: test/skill-e2e-review.test.ts
|
|
- name: e2e-workflow
|
|
file: test/skill-e2e-workflow.test.ts
|
|
- name: e2e-routing
|
|
file: test/skill-routing-e2e.test.ts
|
|
- name: e2e-codex
|
|
file: test/codex-e2e.test.ts
|
|
- name: e2e-gemini
|
|
file: test/gemini-e2e.test.ts
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
with:
|
|
fetch-depth: 0
|
|
|
|
# Bun creates root-owned temp dirs during Docker build. GH Actions runs as
|
|
# runner user with HOME=/github/home. Redirect bun's cache to a writable dir.
|
|
- name: Fix bun temp
|
|
run: |
|
|
mkdir -p /home/runner/.cache/bun
|
|
{
|
|
echo "BUN_INSTALL_CACHE_DIR=/home/runner/.cache/bun"
|
|
echo "BUN_TMPDIR=/home/runner/.cache/bun"
|
|
echo "TMPDIR=/home/runner/.cache"
|
|
} >> "$GITHUB_ENV"
|
|
|
|
# Restore pre-installed node_modules from Docker image via recursive
|
|
# copy. Symlink (`ln -s`) breaks bun's module resolution because bun
|
|
# resolves a file's realpath when walking up to find node_modules/<dep>;
|
|
# from a symlinked path, realpath escapes the workspace and sibling
|
|
# deps no longer resolve. Hardlink copy (`cp -al`) fails because /opt
|
|
# and /workspace are on different overlay-fs layers ("Invalid
|
|
# cross-device link"). Recursive copy works on every layout. Cost:
|
|
# ~5s for ~200 packages of small JS files vs ~0s for symlink — still
|
|
# vastly cheaper than rerunning `bun install` (network + resolution).
|
|
- name: Restore deps
|
|
run: |
|
|
if [ -d /opt/node_modules_cache ] && diff -q /opt/node_modules_cache/.package.json package.json >/dev/null 2>&1; then
|
|
cp -r /opt/node_modules_cache node_modules
|
|
else
|
|
bun install
|
|
fi
|
|
|
|
- run: bun run build
|
|
|
|
# Verify Playwright can launch Chromium (fails fast if sandbox/deps are broken)
|
|
- name: Verify Chromium
|
|
if: matrix.suite.name == 'e2e-browse'
|
|
run: |
|
|
echo "whoami=$(whoami) HOME=$HOME TMPDIR=${TMPDIR:-unset}"
|
|
touch /tmp/.bun-test && rm /tmp/.bun-test && echo "/tmp writable"
|
|
bun -e "import {chromium} from 'playwright';const b=await chromium.launch({args:['--no-sandbox']});console.log('Chromium OK');await b.close()"
|
|
|
|
- name: Run ${{ matrix.suite.name }}
|
|
env:
|
|
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
|
|
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
|
|
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
|
|
EVALS_CONCURRENCY: "40"
|
|
PLAYWRIGHT_BROWSERS_PATH: /opt/playwright-browsers
|
|
run: EVALS=1 bun test --retry 2 --concurrent --max-concurrency 40 ${{ matrix.suite.file }}
|
|
|
|
- name: Upload eval results
|
|
if: always()
|
|
uses: actions/upload-artifact@v4
|
|
with:
|
|
name: eval-${{ matrix.suite.name }}
|
|
path: ~/.gstack-dev/evals/*.json
|
|
retention-days: 90
|
|
|
|
report:
|
|
runs-on: ubicloud-standard-8
|
|
needs: evals
|
|
if: always() && github.event_name == 'pull_request'
|
|
timeout-minutes: 5
|
|
permissions:
|
|
contents: read
|
|
pull-requests: write
|
|
steps:
|
|
- uses: actions/checkout@v4
|
|
with:
|
|
fetch-depth: 1
|
|
|
|
- name: Download all eval artifacts
|
|
uses: actions/download-artifact@v4
|
|
with:
|
|
pattern: eval-*
|
|
path: /tmp/eval-results
|
|
merge-multiple: true
|
|
|
|
- name: Post PR comment
|
|
env:
|
|
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
|
|
run: |
|
|
# shellcheck disable=SC2086,SC2059
|
|
RESULTS=$(find /tmp/eval-results -name '*.json' 2>/dev/null | sort)
|
|
if [ -z "$RESULTS" ]; then
|
|
echo "No eval results found"
|
|
exit 0
|
|
fi
|
|
|
|
TOTAL=0; PASSED=0; FAILED=0; COST="0"
|
|
SUITE_LINES=""
|
|
for f in $RESULTS; do
|
|
if ! jq -e '.total_tests' "$f" >/dev/null 2>&1; then
|
|
echo "Skipping malformed JSON: $f"
|
|
continue
|
|
fi
|
|
T=$(jq -r '.total_tests // 0' "$f")
|
|
P=$(jq -r '.passed // 0' "$f")
|
|
F=$(jq -r '.failed // 0' "$f")
|
|
C=$(jq -r '.total_cost_usd // 0' "$f")
|
|
TIER=$(jq -r '.tier // "unknown"' "$f")
|
|
[ "$T" -eq 0 ] && continue
|
|
TOTAL=$((TOTAL + T))
|
|
PASSED=$((PASSED + P))
|
|
FAILED=$((FAILED + F))
|
|
COST=$(echo "$COST + $C" | bc)
|
|
STATUS_ICON="✅"
|
|
[ "$F" -gt 0 ] && STATUS_ICON="❌"
|
|
SUITE_LINES="${SUITE_LINES}| ${TIER} | ${P}/${T} | ${STATUS_ICON} | \$${C} |\n"
|
|
done
|
|
|
|
STATUS="✅ PASS"
|
|
[ "$FAILED" -gt 0 ] && STATUS="❌ FAIL"
|
|
|
|
BODY="## E2E Evals: ${STATUS}
|
|
|
|
**${PASSED}/${TOTAL}** tests passed | **\$${COST}** total cost | **12 parallel runners**
|
|
|
|
| Suite | Result | Status | Cost |
|
|
|-------|--------|--------|------|
|
|
$(echo -e "$SUITE_LINES")
|
|
|
|
---
|
|
*12x ubicloud-standard-8 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite*"
|
|
|
|
if [ "$FAILED" -gt 0 ]; then
|
|
FAILURES=""
|
|
for f in $RESULTS; do
|
|
if ! jq -e '.failed' "$f" >/dev/null 2>&1; then continue; fi
|
|
F=$(jq -r '.failed // 0' "$f")
|
|
[ "$F" -eq 0 ] && continue
|
|
FAILS=$(jq -r '.tests[] | select(.passed == false) | "- ❌ \(.name): \(.exit_reason // "unknown")"' "$f" 2>/dev/null || echo "- ⚠️ $(basename "$f"): parse error")
|
|
FAILURES="${FAILURES}${FAILS}\n"
|
|
done
|
|
BODY="${BODY}
|
|
|
|
### Failures
|
|
$(echo -e "$FAILURES")"
|
|
fi
|
|
|
|
# Update existing comment or create new one
|
|
COMMENT_ID=$(gh api repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/comments \
|
|
--jq '.[] | select(.body | startswith("## E2E Evals")) | .id' | tail -1)
|
|
|
|
if [ -n "$COMMENT_ID" ]; then
|
|
gh api "repos/${{ github.repository }}/issues/comments/${COMMENT_ID}" \
|
|
-X PATCH -f body="$BODY"
|
|
else
|
|
gh pr comment "${{ github.event.pull_request.number }}" --body "$BODY"
|
|
fi
|