mirror of
https://github.com/garrytan/gstack.git
synced 2026-05-17 17:51:27 +08:00
feat: test coverage catalog — shared audit across plan/ship/review (v0.10.1.0) (#259)
* refactor: extract {{TEST_COVERAGE_AUDIT}} shared resolver
DRY extraction of the test coverage audit methodology into a shared
generator function with three explicit placeholders:
- TEST_COVERAGE_AUDIT_PLAN (plan-eng-review)
- TEST_COVERAGE_AUDIT_SHIP (ship)
- TEST_COVERAGE_AUDIT_REVIEW (review)
Shared across all modes: codepath tracing, ASCII diagram format,
quality scoring rubric, E2E test decision matrix, regression rule,
and test framework detection via CLAUDE.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: plan-eng-review uses shared test coverage audit
Replace the thin 6-line Section 3 test review with the full shared
methodology via {{TEST_COVERAGE_AUDIT_PLAN}}. Plan mode now:
- Traces every codepath with full ASCII diagrams
- Adds missing tests to the plan (not just "check for tests")
- Writes test plan artifact for /qa consumption
- Includes E2E/eval recommendations and regression detection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* refactor: ship uses shared test coverage audit
Replace 135 lines of inline Step 3.4 methodology with
{{TEST_COVERAGE_AUDIT_SHIP}}. Functionally identical output plus:
- E2E test decision matrix (marks paths needing E2E vs unit)
- Eval recommendations for LLM prompt changes
- Regression detection iron rule
- Test framework detection via CLAUDE.md first
- Test plan artifact for /qa consumption
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: /review Step 4.75 test coverage diagram
Add codepath tracing to the pre-landing review via
{{TEST_COVERAGE_AUDIT_REVIEW}}. Review mode:
- Produces ASCII coverage diagram (same methodology as plan/ship)
- Generates tests for gaps via Fix-First (ASK user)
- Subsumes Pass 2 "Test Gaps" checklist category
- Gaps are INFORMATIONAL findings
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: mode differentiation + regression guard for coverage audit
10 new tests verifying the three TEST_COVERAGE_AUDIT placeholders:
- All modes share: codepath tracing, E2E matrix, regression rule
- Plan mode: adds to plan + artifact, no ship-specific content
- Ship mode: auto-generates + before/after count + coverage summary
- Review mode: Fix-First ASK + INFORMATIONAL, no artifact
- Regression guard: ship SKILL.md preserves all key phrases
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* test: extract shared coverage audit fixture + review E2E
- Extract billing.ts fixture into coverage-audit-fixture.ts (DRY)
- Refactor ship-coverage-audit E2E to use shared fixture
- Add review-coverage-audit E2E for Step 4.75
- Update touchfiles: both E2Es depend on shared fixture
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: strengthen E2E assertions for coverage audit tests
The coverage audit E2E tests (ship + review) were only asserting
exitReason === 'success' and readCalls > 0 — they passed even
if the agent produced no coverage diagram. Add assertion that
the output contains either GAP or TESTED markers.
Found during /review.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: plan mode traces the plan, not the git diff
Codex adversarial review caught that plan-eng-review was inheriting
"git diff origin/<base>...HEAD" from the shared resolver, but plan mode
reviews a plan document, not a code diff. Plan mode now says:
"Trace every codepath in the plan" and "Read the plan document."
Ship and review modes keep the git diff instruction.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* feat: test coverage catalog + failure triage (merged branches) (#285)
* feat: add bin/gstack-repo-mode — solo vs collaborative detection with caching
Detects whether a repo is solo-dev (one person does 80%+ of recent commits)
or collaborative. Uses 90-day git shortlog window with 7-day cache in
~/.gstack/projects/{SLUG}/repo-mode.json. Config override via
`gstack-config set repo_mode solo|collaborative` takes precedence over
the heuristic. Minimum 5 commits required to classify (otherwise unknown).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* feat: test failure ownership triage — see something say something
Adds two new preamble sections to all gstack skills:
- Repo Ownership Mode: explains solo vs collaborative behavior
- See Something, Say Something: proactive issue flagging principle
Adds {{TEST_FAILURE_TRIAGE}} template variable (opt-in, used by /ship):
- Classifies test failures as in-branch vs pre-existing
- Solo mode defaults to "investigate and fix now"
- Collaborative mode offers "blame + assign GitHub issue" option
- Also offers P0 TODO and skip options
/ship Step 3 now triages test failures instead of hard-stopping on all
failures. In-branch failures still block shipping. Pre-existing failures
get user-directed triage based on repo mode.
Adds P2 TODO for gstack notes system (deferred lightweight reminder).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate SKILL.md files for Claude and Codex hosts
All 22 Claude skills and 21 Codex skills regenerated with new preamble
sections (Repo Ownership Mode, See Something Say Something) and
{{TEST_FAILURE_TRIAGE}} resolved in ship/SKILL.md.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: validate repo mode values to prevent shell injection
Codex adversarial review found that unvalidated config/cache values
could be injected into shell via source <(gstack-repo-mode). Added
validate_mode() that only allows solo|collaborative|unknown — anything
else becomes "unknown". Prevents persistent code execution through
malicious config.yaml or tampered cache JSON.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: shell injection via branch names + feature-branch sampling bias
Codex code review found two issues:
P1: eval $(gstack-slug) in gstack-repo-mode executes branch names as
shell. Branch names like foo$(touch${IFS}pwned) are valid git refs and
would execute arbitrary commands. Fix: compute SLUG directly with sed
instead of eval'ing gstack-slug output.
P2: git shortlog HEAD only sees current branch history. On feature
branches that haven't merged main recently, other contributors disappear
from the sample. Fix: use git shortlog on the default branch
(origin/main) instead of HEAD.
Also improved blame lookup in collaborative triage to check both the
test file and the production code it covers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: broaden codex-host stripping test to accommodate triage section
"Investigate and fix" now appears in TEST_FAILURE_TRIAGE (not just the
Codex review step). Use CODEX_REVIEWS config string as a more specific
marker for detecting the Codex review step in Codex-hosted skills.
* fix: replace template placeholder in TODOS.md with readable text
{{TEST_FAILURE_TRIAGE}} is template syntax but TODOS.md is not processed
by gen-skill-docs — replaced with human-readable reference.
* chore: bump version and changelog (v0.9.5.0)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add bin/ directory to project structure in CLAUDE.md
* test: add triage resolver unit tests, plan-eng coverage audit E2E, and triage E2E
- TEST_FAILURE_TRIAGE resolver: 6 unit tests verifying all triage steps (T1-T4),
REPO_MODE branching, and safety default for ambiguous failures
- plan-eng-coverage-audit E2E: tests /plan-eng-review coverage audit codepath
(gap identified during eng review — existed on neither branch)
- ship-triage E2E: planted-bug fixture with in-branch (truncate null) and
pre-existing (divide-by-zero) failures; verifies correct classification
- Touchfile entries for diff-based test selection
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: regenerate stale Codex SKILL.md for retro
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gstack-repo-mode handles repos without origin remote
Split `git remote get-url origin` into a separate variable with `|| true`
so the script doesn't crash under `set -euo pipefail` in local-only repos.
Falls back to REPO_MODE=unknown gracefully.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: REPO_MODE defaults to unknown when helper emits nothing
Changed preamble from `source <(...) || REPO_MODE=unknown` (which doesn't
catch empty output) to `source <(...) || true` followed by
`REPO_MODE=${REPO_MODE:-unknown}`. Regenerated all SKILL.md files.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: triage E2E runs both test files in subprocesses
math.test.js called process.exit(1) which killed the runner before
string.test.js could execute. Changed test runner to use child_process
so each test runs independently and both failure classes are exercised.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: gstack-repo-mode handles repos without origin remote
Fall back through origin/main → origin/master → HEAD when
git symbolic-ref refs/remotes/origin/HEAD is not set. Prevents
shortlog crash in repos where origin/HEAD isn't configured.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: triage E2E runs both test files in subprocesses
Add assertions verifying both math.test.js (pre-existing failure) and
string.test.js (in-branch failure) actually executed during triage.
Prevents false passes where only one failure class is exercised.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* fix: REPO_MODE defaults to unknown when helper emits nothing
- Remove head -20 truncation that biased solo classification by
dropping low-volume contributors from the denominator
- Use atomic write (mktemp + mv) for cache to prevent concurrent
preamble reads from seeing partial JSON
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: add test coverage catalog to CHANGELOG + update project structure
- CHANGELOG: add 6 entries for coverage audit, review Step 4.75, E2E
recommendations, regression iron rule, failure triage, repo-mode fix
- CLAUDE.md: add missing skill directories (autoplan, benchmark, canary,
codex, land-and-deploy, setup-deploy) to project structure
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* chore: bump version and changelog (v0.10.1.0)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* docs: CHANGELOG rules — branch-scoped versions, never fold into old entries
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
221
ship/SKILL.md
221
ship/SKILL.md
@@ -32,6 +32,9 @@ _PROACTIVE=$(~/.claude/skills/gstack/bin/gstack-config get proactive 2>/dev/null
|
||||
_BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown")
|
||||
echo "BRANCH: $_BRANCH"
|
||||
echo "PROACTIVE: $_PROACTIVE"
|
||||
source <(~/.claude/skills/gstack/bin/gstack-repo-mode 2>/dev/null) || true
|
||||
REPO_MODE=${REPO_MODE:-unknown}
|
||||
echo "REPO_MODE: $REPO_MODE"
|
||||
_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
|
||||
echo "LAKE_INTRO: $_LAKE_SEEN"
|
||||
_TEL=$(~/.claude/skills/gstack/bin/gstack-config get telemetry 2>/dev/null || true)
|
||||
@@ -132,6 +135,18 @@ AI-assisted coding makes the marginal cost of completeness near-zero. When you p
|
||||
- BAD: "Let's defer test coverage to a follow-up PR." (Tests are the cheapest lake to boil.)
|
||||
- BAD: Quoting only human-team effort: "This would take 2 weeks." (Say: "2 weeks human / ~1 hour CC.")
|
||||
|
||||
## Repo Ownership Mode — See Something, Say Something
|
||||
|
||||
`REPO_MODE` from the preamble tells you who owns issues in this repo:
|
||||
|
||||
- **`solo`** — One person does 80%+ of the work. They own everything. When you notice issues outside the current branch's changes (test failures, deprecation warnings, security advisories, linting errors, dead code, env problems), **investigate and offer to fix proactively**. The solo dev is the only person who will fix it. Default to action.
|
||||
- **`collaborative`** — Multiple active contributors. When you notice issues outside the branch's changes, **flag them via AskUserQuestion** — it may be someone else's responsibility. Default to asking, not fixing.
|
||||
- **`unknown`** — Treat as collaborative (safer default — ask before fixing).
|
||||
|
||||
**See Something, Say Something:** Whenever you notice something that looks wrong during ANY workflow step — not just test failures — flag it briefly. One sentence: what you noticed and its impact. In solo mode, follow up with "Want me to fix it?" In collaborative mode, just flag it and move on.
|
||||
|
||||
Never let a noticed issue silently pass. The whole point is proactive communication.
|
||||
|
||||
## Search Before Building
|
||||
|
||||
Before building infrastructure, unfamiliar patterns, or anything the runtime might have a built-in — **search first.** Read `~/.claude/skills/gstack/ETHOS.md` for the full philosophy.
|
||||
@@ -268,7 +283,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat
|
||||
**Only stop for:**
|
||||
- On the base branch (abort)
|
||||
- Merge conflicts that can't be auto-resolved (stop, show conflicts)
|
||||
- Test failures (stop, show failures)
|
||||
- In-branch test failures (pre-existing failures are triaged, not auto-blocking)
|
||||
- Pre-landing review finds ASK items that need user judgment
|
||||
- MINOR or MAJOR version bump needed (ask — see Step 4)
|
||||
- Greptile review comments that need user decision (complex fixes, false positives)
|
||||
@@ -552,7 +567,105 @@ wait
|
||||
|
||||
After both complete, read the output files and check pass/fail.
|
||||
|
||||
**If any test fails:** Show the failures and **STOP**. Do not proceed.
|
||||
**If any test fails:** Do NOT immediately stop. Apply the Test Failure Ownership Triage:
|
||||
|
||||
## Test Failure Ownership Triage
|
||||
|
||||
When tests fail, do NOT immediately stop. First, determine ownership:
|
||||
|
||||
### Step T1: Classify each failure
|
||||
|
||||
For each failing test:
|
||||
|
||||
1. **Get the files changed on this branch:**
|
||||
```bash
|
||||
git diff origin/<base>...HEAD --name-only
|
||||
```
|
||||
|
||||
2. **Classify the failure:**
|
||||
- **In-branch** if: the failing test file itself was modified on this branch, OR the test output references code that was changed on this branch, OR you can trace the failure to a change in the branch diff.
|
||||
- **Likely pre-existing** if: neither the test file nor the code it tests was modified on this branch, AND the failure is unrelated to any branch change you can identify.
|
||||
- **When ambiguous, default to in-branch.** It is safer to stop the developer than to let a broken test ship. Only classify as pre-existing when you are confident.
|
||||
|
||||
This classification is heuristic — use your judgment reading the diff and the test output. You do not have a programmatic dependency graph.
|
||||
|
||||
### Step T2: Handle in-branch failures
|
||||
|
||||
**STOP.** These are your failures. Show them and do not proceed. The developer must fix their own broken tests before shipping.
|
||||
|
||||
### Step T3: Handle pre-existing failures
|
||||
|
||||
Check `REPO_MODE` from the preamble output.
|
||||
|
||||
**If REPO_MODE is `solo`:**
|
||||
|
||||
Use AskUserQuestion:
|
||||
|
||||
> These test failures appear pre-existing (not caused by your branch changes):
|
||||
>
|
||||
> [list each failure with file:line and brief error description]
|
||||
>
|
||||
> Since this is a solo repo, you're the only one who will fix these.
|
||||
>
|
||||
> RECOMMENDATION: Choose A — fix now while the context is fresh. Completeness: 9/10.
|
||||
> A) Investigate and fix now (human: ~2-4h / CC: ~15min) — Completeness: 10/10
|
||||
> B) Add as P0 TODO — fix after this branch lands — Completeness: 7/10
|
||||
> C) Skip — I know about this, ship anyway — Completeness: 3/10
|
||||
|
||||
**If REPO_MODE is `collaborative` or `unknown`:**
|
||||
|
||||
Use AskUserQuestion:
|
||||
|
||||
> These test failures appear pre-existing (not caused by your branch changes):
|
||||
>
|
||||
> [list each failure with file:line and brief error description]
|
||||
>
|
||||
> This is a collaborative repo — these may be someone else's responsibility.
|
||||
>
|
||||
> RECOMMENDATION: Choose B — assign it to whoever broke it so the right person fixes it. Completeness: 9/10.
|
||||
> A) Investigate and fix now anyway — Completeness: 10/10
|
||||
> B) Blame + assign GitHub issue to the author — Completeness: 9/10
|
||||
> C) Add as P0 TODO — Completeness: 7/10
|
||||
> D) Skip — ship anyway — Completeness: 3/10
|
||||
|
||||
### Step T4: Execute the chosen action
|
||||
|
||||
**If "Investigate and fix now":**
|
||||
- Switch to /investigate mindset: root cause first, then minimal fix.
|
||||
- Fix the pre-existing failure.
|
||||
- Commit the fix separately from the branch's changes: `git commit -m "fix: pre-existing test failure in <test-file>"`
|
||||
- Continue with the workflow.
|
||||
|
||||
**If "Add as P0 TODO":**
|
||||
- If `TODOS.md` exists, add the entry following the format in `review/TODOS-format.md` (or `.claude/skills/review/TODOS-format.md`).
|
||||
- If `TODOS.md` does not exist, create it with the standard header and add the entry.
|
||||
- Entry should include: title, the error output, which branch it was noticed on, and priority P0.
|
||||
- Continue with the workflow — treat the pre-existing failure as non-blocking.
|
||||
|
||||
**If "Blame + assign GitHub issue" (collaborative only):**
|
||||
- Find who likely broke it. Check BOTH the test file AND the production code it tests:
|
||||
```bash
|
||||
# Who last touched the failing test?
|
||||
git log --format="%an (%ae)" -1 -- <failing-test-file>
|
||||
# Who last touched the production code the test covers? (often the actual breaker)
|
||||
git log --format="%an (%ae)" -1 -- <source-file-under-test>
|
||||
```
|
||||
If these are different people, prefer the production code author — they likely introduced the regression.
|
||||
- Create a GitHub issue assigned to that person:
|
||||
```bash
|
||||
gh issue create \
|
||||
--title "Pre-existing test failure: <test-name>" \
|
||||
--body "Found failing on branch <current-branch>. Failure is pre-existing.\n\n**Error:**\n```\n<first 10 lines>\n```\n\n**Last modified by:** <author>\n**Noticed by:** gstack /ship on <date>" \
|
||||
--assignee "<github-username>"
|
||||
```
|
||||
- If `gh` is not available or `--assignee` fails (user not in org, etc.), create the issue without assignee and note who should look at it in the body.
|
||||
- Continue with the workflow.
|
||||
|
||||
**If "Skip":**
|
||||
- Continue with the workflow.
|
||||
- Note in output: "Pre-existing test failure skipped: <test-name>"
|
||||
|
||||
**After triage:** If any in-branch failures remain unfixed, **STOP**. Do not proceed. If all failures were pre-existing and handled (fixed, TODOed, assigned, or skipped), continue to Step 3.25.
|
||||
|
||||
**If all pass:** Continue silently — just note the counts briefly.
|
||||
|
||||
@@ -624,6 +737,27 @@ If multiple suites need to run, run them sequentially (each needs a test lane).
|
||||
|
||||
100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.
|
||||
|
||||
### Test Framework Detection
|
||||
|
||||
Before analyzing coverage, detect the project's test framework:
|
||||
|
||||
1. **Read CLAUDE.md** — look for a `## Testing` section with test command and framework name. If found, use that as the authoritative source.
|
||||
2. **If CLAUDE.md has no testing section, auto-detect:**
|
||||
|
||||
```bash
|
||||
# Detect project runtime
|
||||
[ -f Gemfile ] && echo "RUNTIME:ruby"
|
||||
[ -f package.json ] && echo "RUNTIME:node"
|
||||
[ -f requirements.txt ] || [ -f pyproject.toml ] && echo "RUNTIME:python"
|
||||
[ -f go.mod ] && echo "RUNTIME:go"
|
||||
[ -f Cargo.toml ] && echo "RUNTIME:rust"
|
||||
# Check for existing test infrastructure
|
||||
ls jest.config.* vitest.config.* playwright.config.* cypress.config.* .rspec pytest.ini phpunit.xml 2>/dev/null
|
||||
ls -d test/ tests/ spec/ __tests__/ cypress/ e2e/ 2>/dev/null
|
||||
```
|
||||
|
||||
3. **If no framework detected:** falls through to the Test Framework Bootstrap step (Step 2.5) which handles full setup.
|
||||
|
||||
**0. Before/after test count:**
|
||||
|
||||
```bash
|
||||
@@ -686,9 +820,41 @@ Quality scoring rubric:
|
||||
- ★★ Tests correct behavior, happy path only
|
||||
- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")
|
||||
|
||||
### E2E Test Decision Matrix
|
||||
|
||||
When checking each branch, also determine whether a unit test or E2E/integration test is the right tool:
|
||||
|
||||
**RECOMMEND E2E (mark as [→E2E] in the diagram):**
|
||||
- Common user flow spanning 3+ components/services (e.g., signup → verify email → first login)
|
||||
- Integration point where mocking hides real failures (e.g., API → queue → worker → DB)
|
||||
- Auth/payment/data-destruction flows — too important to trust unit tests alone
|
||||
|
||||
**RECOMMEND EVAL (mark as [→EVAL] in the diagram):**
|
||||
- Critical LLM call that needs a quality eval (e.g., prompt change → test output still meets quality bar)
|
||||
- Changes to prompt templates, system instructions, or tool definitions
|
||||
|
||||
**STICK WITH UNIT TESTS:**
|
||||
- Pure function with clear inputs/outputs
|
||||
- Internal helper with no side effects
|
||||
- Edge case of a single function (null input, empty array)
|
||||
- Obscure/rare flow that isn't customer-facing
|
||||
|
||||
### REGRESSION RULE (mandatory)
|
||||
|
||||
**IRON RULE:** When the coverage audit identifies a REGRESSION — code that previously worked but the diff broke — a regression test is written immediately. No AskUserQuestion. No skipping. Regressions are the highest-priority test because they prove something broke.
|
||||
|
||||
A regression is when:
|
||||
- The diff modifies existing behavior (not new code)
|
||||
- The existing test suite (if any) doesn't cover the changed path
|
||||
- The change introduces a new failure mode for existing callers
|
||||
|
||||
When uncertain whether a change is a regression, err on the side of writing the test.
|
||||
|
||||
Format: commit as `test: regression test for {what broke}`
|
||||
|
||||
**4. Output ASCII coverage diagram:**
|
||||
|
||||
Include BOTH code paths and user flows in the same diagram:
|
||||
Include BOTH code paths and user flows in the same diagram. Mark E2E-worthy and eval-worthy paths:
|
||||
|
||||
```
|
||||
CODE PATH COVERAGE
|
||||
@@ -709,9 +875,9 @@ USER FLOW COVERAGE
|
||||
[+] Payment checkout flow
|
||||
│
|
||||
├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
|
||||
├── [GAP] Double-click submit — NO TEST
|
||||
├── [GAP] Navigate away during payment — NO TEST
|
||||
└── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40
|
||||
├── [GAP] [→E2E] Double-click submit — needs E2E, not just unit
|
||||
├── [GAP] Navigate away during payment — unit test sufficient
|
||||
└── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40
|
||||
|
||||
[+] Error states
|
||||
│
|
||||
@@ -719,12 +885,16 @@ USER FLOW COVERAGE
|
||||
├── [GAP] Network timeout UX (what does user see?) — NO TEST
|
||||
└── [GAP] Empty cart submission — NO TEST
|
||||
|
||||
[+] LLM integration
|
||||
│
|
||||
└── [GAP] [→EVAL] Prompt template change — needs eval test
|
||||
|
||||
─────────────────────────────────
|
||||
COVERAGE: 5/12 paths tested (42%)
|
||||
COVERAGE: 5/13 paths tested (38%)
|
||||
Code paths: 3/5 (60%)
|
||||
User flows: 2/7 (29%)
|
||||
User flows: 2/8 (25%)
|
||||
QUALITY: ★★★: 2 ★★: 2 ★: 1
|
||||
GAPS: 7 paths need tests
|
||||
GAPS: 8 paths need tests (2 need E2E, 1 needs eval)
|
||||
─────────────────────────────────
|
||||
```
|
||||
|
||||
@@ -736,6 +906,8 @@ If test framework detected (or bootstrapped in Step 2.5):
|
||||
- Prioritize error handlers and edge cases first (happy paths are more likely already tested)
|
||||
- Read 2-3 existing test files to match conventions exactly
|
||||
- Generate unit tests. Mock all external dependencies (DB, API, Redis).
|
||||
- For paths marked [→E2E]: generate integration/E2E tests using the project's E2E framework (Playwright, Cypress, Capybara, etc.)
|
||||
- For paths marked [→EVAL]: generate eval tests using the project's eval framework, or flag for manual eval if none exists
|
||||
- Write tests that exercise the specific uncovered path with real assertions
|
||||
- Run each test. Passes → commit as `test: coverage for {feature}`
|
||||
- Fails → fix once. Still fails → revert, note gap in diagram.
|
||||
@@ -756,6 +928,37 @@ find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec
|
||||
For PR body: `Tests: {before} → {after} (+{delta} new)`
|
||||
Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.`
|
||||
|
||||
### Test Plan Artifact
|
||||
|
||||
After producing the coverage diagram, write a test plan artifact so `/qa` and `/qa-only` can consume it:
|
||||
|
||||
```bash
|
||||
source <(~/.claude/skills/gstack/bin/gstack-slug 2>/dev/null) && mkdir -p ~/.gstack/projects/$SLUG
|
||||
USER=$(whoami)
|
||||
DATETIME=$(date +%Y%m%d-%H%M%S)
|
||||
```
|
||||
|
||||
Write to `~/.gstack/projects/{slug}/{user}-{branch}-ship-test-plan-{datetime}.md`:
|
||||
|
||||
```markdown
|
||||
# Test Plan
|
||||
Generated by /ship on {date}
|
||||
Branch: {branch}
|
||||
Repo: {owner/repo}
|
||||
|
||||
## Affected Pages/Routes
|
||||
- {URL path} — {what to test and why}
|
||||
|
||||
## Key Interactions to Verify
|
||||
- {interaction description} on {page}
|
||||
|
||||
## Edge Cases
|
||||
- {edge case} on {page}
|
||||
|
||||
## Critical Paths
|
||||
- {end-to-end flow that must work}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Step 3.5: Pre-Landing Review
|
||||
|
||||
@@ -27,7 +27,7 @@ You are running the `/ship` workflow. This is a **non-interactive, fully automat
|
||||
**Only stop for:**
|
||||
- On the base branch (abort)
|
||||
- Merge conflicts that can't be auto-resolved (stop, show conflicts)
|
||||
- Test failures (stop, show failures)
|
||||
- In-branch test failures (pre-existing failures are triaged, not auto-blocking)
|
||||
- Pre-landing review finds ASK items that need user judgment
|
||||
- MINOR or MAJOR version bump needed (ask — see Step 4)
|
||||
- Greptile review comments that need user decision (complex fixes, false positives)
|
||||
@@ -119,7 +119,11 @@ wait
|
||||
|
||||
After both complete, read the output files and check pass/fail.
|
||||
|
||||
**If any test fails:** Show the failures and **STOP**. Do not proceed.
|
||||
**If any test fails:** Do NOT immediately stop. Apply the Test Failure Ownership Triage:
|
||||
|
||||
{{TEST_FAILURE_TRIAGE}}
|
||||
|
||||
**After triage:** If any in-branch failures remain unfixed, **STOP**. Do not proceed. If all failures were pre-existing and handled (fixed, TODOed, assigned, or skipped), continue to Step 3.25.
|
||||
|
||||
**If all pass:** Continue silently — just note the counts briefly.
|
||||
|
||||
@@ -189,139 +193,7 @@ If multiple suites need to run, run them sequentially (each needs a test lane).
|
||||
|
||||
## Step 3.4: Test Coverage Audit
|
||||
|
||||
100% coverage is the goal — every untested path is a path where bugs hide and vibe coding becomes yolo coding. Evaluate what was ACTUALLY coded (from the diff), not what was planned.
|
||||
|
||||
**0. Before/after test count:**
|
||||
|
||||
```bash
|
||||
# Count test files before any generation
|
||||
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
|
||||
```
|
||||
|
||||
Store this number for the PR body.
|
||||
|
||||
**1. Trace every codepath changed** using `git diff origin/<base>...HEAD`:
|
||||
|
||||
Read every changed file. For each one, trace how data flows through the code — don't just list functions, actually follow the execution:
|
||||
|
||||
1. **Read the diff.** For each changed file, read the full file (not just the diff hunk) to understand context.
|
||||
2. **Trace data flow.** Starting from each entry point (route handler, exported function, event listener, component render), follow the data through every branch:
|
||||
- Where does input come from? (request params, props, database, API call)
|
||||
- What transforms it? (validation, mapping, computation)
|
||||
- Where does it go? (database write, API response, rendered output, side effect)
|
||||
- What can go wrong at each step? (null/undefined, invalid input, network failure, empty collection)
|
||||
3. **Diagram the execution.** For each changed file, draw an ASCII diagram showing:
|
||||
- Every function/method that was added or modified
|
||||
- Every conditional branch (if/else, switch, ternary, guard clause, early return)
|
||||
- Every error path (try/catch, rescue, error boundary, fallback)
|
||||
- Every call to another function (trace into it — does IT have untested branches?)
|
||||
- Every edge: what happens with null input? Empty array? Invalid type?
|
||||
|
||||
This is the critical step — you're building a map of every line of code that can execute differently based on input. Every branch in this diagram needs a test.
|
||||
|
||||
**2. Map user flows, interactions, and error states:**
|
||||
|
||||
Code coverage isn't enough — you need to cover how real users interact with the changed code. For each changed feature, think through:
|
||||
|
||||
- **User flows:** What sequence of actions does a user take that touches this code? Map the full journey (e.g., "user clicks 'Pay' → form validates → API call → success/failure screen"). Each step in the journey needs a test.
|
||||
- **Interaction edge cases:** What happens when the user does something unexpected?
|
||||
- Double-click/rapid resubmit
|
||||
- Navigate away mid-operation (back button, close tab, click another link)
|
||||
- Submit with stale data (page sat open for 30 minutes, session expired)
|
||||
- Slow connection (API takes 10 seconds — what does the user see?)
|
||||
- Concurrent actions (two tabs, same form)
|
||||
- **Error states the user can see:** For every error the code handles, what does the user actually experience?
|
||||
- Is there a clear error message or a silent failure?
|
||||
- Can the user recover (retry, go back, fix input) or are they stuck?
|
||||
- What happens with no network? With a 500 from the API? With invalid data from the server?
|
||||
- **Empty/zero/boundary states:** What does the UI show with zero results? With 10,000 results? With a single character input? With maximum-length input?
|
||||
|
||||
Add these to your diagram alongside the code branches. A user flow with no test is just as much a gap as an untested if/else.
|
||||
|
||||
**3. Check each branch against existing tests:**
|
||||
|
||||
Go through your diagram branch by branch — both code paths AND user flows. For each one, search for a test that exercises it:
|
||||
- Function `processPayment()` → look for `billing.test.ts`, `billing.spec.ts`, `test/billing_test.rb`
|
||||
- An if/else → look for tests covering BOTH the true AND false path
|
||||
- An error handler → look for a test that triggers that specific error condition
|
||||
- A call to `helperFn()` that has its own branches → those branches need tests too
|
||||
- A user flow → look for an integration or E2E test that walks through the journey
|
||||
- An interaction edge case → look for a test that simulates the unexpected action
|
||||
|
||||
Quality scoring rubric:
|
||||
- ★★★ Tests behavior with edge cases AND error paths
|
||||
- ★★ Tests correct behavior, happy path only
|
||||
- ★ Smoke test / existence check / trivial assertion (e.g., "it renders", "it doesn't throw")
|
||||
|
||||
**4. Output ASCII coverage diagram:**
|
||||
|
||||
Include BOTH code paths and user flows in the same diagram:
|
||||
|
||||
```
|
||||
CODE PATH COVERAGE
|
||||
===========================
|
||||
[+] src/services/billing.ts
|
||||
│
|
||||
├── processPayment()
|
||||
│ ├── [★★★ TESTED] Happy path + card declined + timeout — billing.test.ts:42
|
||||
│ ├── [GAP] Network timeout — NO TEST
|
||||
│ └── [GAP] Invalid currency — NO TEST
|
||||
│
|
||||
└── refundPayment()
|
||||
├── [★★ TESTED] Full refund — billing.test.ts:89
|
||||
└── [★ TESTED] Partial refund (checks non-throw only) — billing.test.ts:101
|
||||
|
||||
USER FLOW COVERAGE
|
||||
===========================
|
||||
[+] Payment checkout flow
|
||||
│
|
||||
├── [★★★ TESTED] Complete purchase — checkout.e2e.ts:15
|
||||
├── [GAP] Double-click submit — NO TEST
|
||||
├── [GAP] Navigate away during payment — NO TEST
|
||||
└── [★ TESTED] Form validation errors (checks render only) — checkout.test.ts:40
|
||||
|
||||
[+] Error states
|
||||
│
|
||||
├── [★★ TESTED] Card declined message — billing.test.ts:58
|
||||
├── [GAP] Network timeout UX (what does user see?) — NO TEST
|
||||
└── [GAP] Empty cart submission — NO TEST
|
||||
|
||||
─────────────────────────────────
|
||||
COVERAGE: 5/12 paths tested (42%)
|
||||
Code paths: 3/5 (60%)
|
||||
User flows: 2/7 (29%)
|
||||
QUALITY: ★★★: 2 ★★: 2 ★: 1
|
||||
GAPS: 7 paths need tests
|
||||
─────────────────────────────────
|
||||
```
|
||||
|
||||
**Fast path:** All paths covered → "Step 3.4: All new code paths have test coverage ✓" Continue.
|
||||
|
||||
**5. Generate tests for uncovered paths:**
|
||||
|
||||
If test framework detected (or bootstrapped in Step 2.5):
|
||||
- Prioritize error handlers and edge cases first (happy paths are more likely already tested)
|
||||
- Read 2-3 existing test files to match conventions exactly
|
||||
- Generate unit tests. Mock all external dependencies (DB, API, Redis).
|
||||
- Write tests that exercise the specific uncovered path with real assertions
|
||||
- Run each test. Passes → commit as `test: coverage for {feature}`
|
||||
- Fails → fix once. Still fails → revert, note gap in diagram.
|
||||
|
||||
Caps: 30 code paths max, 20 tests generated max (code + user flow combined), 2-min per-test exploration cap.
|
||||
|
||||
If no test framework AND user declined bootstrap → diagram only, no generation. Note: "Test generation skipped — no test framework configured."
|
||||
|
||||
**Diff is test-only changes:** Skip Step 3.4 entirely: "No new application code paths to audit."
|
||||
|
||||
**6. After-count and coverage summary:**
|
||||
|
||||
```bash
|
||||
# Count test files after generation
|
||||
find . -name '*.test.*' -o -name '*.spec.*' -o -name '*_test.*' -o -name '*_spec.*' | grep -v node_modules | wc -l
|
||||
```
|
||||
|
||||
For PR body: `Tests: {before} → {after} (+{delta} new)`
|
||||
Coverage line: `Test Coverage Audit: N new code paths. M covered (X%). K tests generated, J committed.`
|
||||
{{TEST_COVERAGE_AUDIT_SHIP}}
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user