Commit Graph

3 Commits

Author SHA1 Message Date
Jamkris
116e61d8cb fix(lib): retry rename on Windows EPERM/EACCES/EBUSY in writeBridgeAtomic
PR #1983 round 1 introduced unique-suffix tmp paths so two concurrent
writers no longer share a single `.tmp` file. That fix is correct
under POSIX semantics — `rename(2)` is atomic between source and
destination, so each writer renames onto the same target without
conflict.

Windows `MoveFileExW` is not the same. It fails with
EPERM / EACCES / EBUSY when the target is currently being renamed
by *another* process — a short race window that fires reliably under
this hook's PostToolUse + statusline concurrency. Round 1's CI run
made this visible:

  Test (windows-latest, Node 18.x, npm) — FAILURE
  Error: EPERM: operation not permitted, rename
    'C:\…\ecc-metrics-test-bridge-race-….json.9504.4aef575a.tmp' ->
    'C:\…\ecc-metrics-test-bridge-race-….json'
      at writeBridgeAtomic (scripts/lib/session-bridge.js:79:8)

All nine Windows matrix cells (Node 18 / 20 / 22 × npm / pnpm / yarn)
hit the same path. POSIX matrices (Linux + macOS) passed unchanged.

Fix: extract a `renameWithRetry(tmp, target)` helper that retries
`fs.renameSync` up to 5 times on EPERM / EACCES / EBUSY with
exponential backoff (20 ms → 320 ms total). Other error codes
(ENOENT, ENOSPC, EROFS, …) re-throw on the first attempt — they are
not transient. POSIX runs hit the first try and exit immediately.

The backoff uses `Atomics.wait` on a throwaway `SharedArrayBuffer`
so the retry path does not busy-spin the CPU; verified on Node ≥ 17
that this works on the main thread. There is a `try/catch` fallback
to a brief busy-wait for older runtimes where `Atomics.wait` is
restricted to workers.

`writeBridgeAtomic` calls the helper instead of `fs.renameSync` and
keeps its existing best-effort tmp cleanup on terminal failure.

`renameWithRetry` is added to `module.exports` so the companion
`writeWarnState` in `scripts/hooks/ecc-context-monitor.js` can
adopt the same retry policy without duplicating the helper. That
adoption lands in the next commit.

Local: `node tests/lib/session-bridge.test.js` 14/14, `yarn test`
green, `yarn lint` clean. The round-1 test (two concurrent child
writers, 200 iterations each) now passes on macOS without retrying
at all (POSIX path) and is expected to pass on Windows via the new
retry loop.
2026-05-19 04:57:10 -04:00
Jamkris
28548f67ba fix(lib): use unique tmp suffix in writeBridgeAtomic to eliminate ENOENT race
`writeBridgeAtomic` wrote to a fixed `${target}.tmp` path before
calling `renameSync`. When two processes write to the same session
bridge concurrently (e.g. PostToolUse `ecc-metrics-bridge` + the
background `ecc-statusline`, both calling `writeBridgeAtomic(sessionId, ...)`),
the canonical atomic-rename race fires:

  1. Process A: writeFileSync(target.tmp, JSON_A) — tmp file exists.
  2. Process B: writeFileSync(target.tmp, JSON_B) — tmp file overwritten.
  3. Process A: renameSync(target.tmp, target) — succeeds; target = JSON_B
     (A's payload silently corrupted en-route).
  4. Process B: renameSync(target.tmp, target) — throws ENOENT (the
     rename consumed the file).

Every caller in the repo wraps `writeBridgeAtomic` in `try {} catch {}`,
so the ENOENT exception is swallowed and the user-visible symptom is
just "the bridge file occasionally contains the wrong process's
payload" with no diagnostic.

Reproduced before this commit:

  $ # two concurrent writers, each calling writeBridgeAtomic 500 times
  $ # against the same session ID
  [A] errors=244   # 244 ENOENT exceptions swallowed
  [B] errors=248   # ditto

After this commit the same workload reports 0 errors in both
subprocesses: tmp paths no longer collide.

Fix: change `${target}.tmp` to
`${target}.${process.pid}.${crypto.randomBytes(4).toString('hex')}.tmp`,
matching the pattern already used by `writeCostWarningIfChanged` in
`scripts/hooks/ecc-metrics-bridge.js` (commit 9b1d8918). The pid +
4-byte nonce gives each writer process a distinct tmp path, so step 2
above no longer overwrites step 1's payload and step 4 no longer
races step 3.

Also added: on `renameSync` failure, attempt `fs.unlinkSync(tmp)` so
a writer that fails (disk full, permission, parent dir gone) does
not leak its tmp file. The cleanup is best-effort and the original
error is still re-thrown.

**Scope clarification.** This commit closes the atomic-rename
primitive's race only. The *read-modify-write* race in callers —
two writers each read the same bridge state, increment, and write
back, the second clobbering the first — is a separate concern that
needs locking or per-writer logs, and is intentionally out of scope
for this PR. The cost-tracker / metrics-bridge callers tolerate
last-writer-wins on their cumulative aggregates today and this
commit does not change that contract.

The companion `writeWarnState` in `ecc-context-monitor.js` has the
same fixed-suffix pattern and the same race; that fix lands in the
next commit so each can be reviewed against its own diff.
2026-05-19 04:57:10 -04:00
Affaan Mustafa
940135ea47 feat: add ECC statusline observability hooks
Salvages the useful statusline/context monitor work from stale PR #1504 while preserving the current continuous-learning hook runner wiring.

Adds the metrics bridge, context monitor, statusline script, shared cost/session bridge utilities, and tests. Fixes the reviewed false loop-detection hash collision for non-file tools, avoids default-session cost inflation, sanitizes statusline task lookup, and records hook payload session IDs in cost-tracker.
2026-05-11 23:44:06 -04:00