3 săptămâni în urmă · 25f8f2b89a
--- a/.claude/handoffs/codegraph-tool-surface-rethink-2026-05-27.md
+++ b/.claude/handoffs/codegraph-tool-surface-rethink-2026-05-27.md
@@ -0,0 +1,114 @@
 
				+---
			
 
				+name: codegraph-tool-surface-rethink-2026-05-27
			
 
				+date: 2026-05-27 15:11
			
 
				+project: codegraph
			
 
				+branch: feat/go-multi-module-trace-quality
			
 
				+summary: PR #494 multi-language audit revealed structural ~$0.04-$0.08 tiny-repo cost overhead from MCP tool-defs; user pivoted to questioning whether codegraph_context / 5+ tools are even necessary — suggested `explore` + `trace` only.
			
 
				+---
			
 
				+
			
 
				+# Handoff: Should codegraph cut to just `explore` + `trace`?
			
 
				+
			
 
				+## Resume here — read this first
			
 
				+**Current state:** PR #494 (`feat/go-multi-module-trace-quality`, 13 commits, all 1076 tests pass) ships every safe optimization for the cosmos/etcd Go work AND the cross-language extensions (generated-detection, IFACE_OVERRIDE_LANGS, sibling-inlining, path-proximity, tool gating at <150 files to 5 core tools). Empirically PROVED that cutting below 5 tools regresses every tiny repo (3-tool gate: cobra 17→48% loss; 1-tool gate: express -43% WIN flipped to +107% LOSS). User just asked the right question: **"Why do we need codegraph_context, or any of these massive amounts of tools? All it really needs is explore, and trace if you ask me."**
			
 
				+
			
 
				+**Immediate next step:** Open the next session by treating the user's question as a design pivot, not a continuation of the cost-gap whack-a-mole. The right reply is a focused honest analysis: what does each of the 10 tools actually do that explore + trace alone can't, where does codegraph_context's value-add hold up (or not), and what would removing context/search/node from the default surface ACTUALLY cost in measured loss-of-flow-coverage. Don't start cutting tools yet — present the analysis first.
			
 
				+
			
 
				+> Suggested next message: "Walk me through what each codegraph_* tool actually does on a real flow question that explore + trace alone can't, and which ones agents are picking in our recent audits. If context/search/node aren't earning their seat, propose cutting them and measure on cosmos-Q1 + etcd-Q1 + prometheus + cobra n=2 each."
			
 
				+
			
 
				+## Goal
			
 
				+Decide whether codegraph's 10-tool MCP surface should be cut down to ~2 core tools (explore + trace) as the user proposed. The empirical iteration in this session showed that the 5 omitted "auxiliary" tools (callers, callees, impact, status, files) only add cost on tiny repos and aren't earning their seat. The real question now: **does the same logic apply to context + search + node?** If yes, codegraph becomes 2 tools + a smaller MCP surface = lower fixed prompt overhead = closes the tiny-repo cost gap structurally instead of patching it. If no, name the specific flows where they do unique work.
			
 
				+
			
 
				+## Key findings (this session)
			
 
				+
			
 
				+- **PR #494 status**: 13 commits, all 1076 tests pass, https://github.com/colbymchenry/codegraph/pull/494. Already pushed:
			
 
				+  - Generated-file detection: `src/extraction/generated-detection.ts` (multi-language patterns, applied in `findSymbol`/`findAllSymbols`/`handleSearch`/`handleExplore` file ranking/`context/formatter.ts`)
			
 
				+  - Go gRPC bridge: `goGrpcStubImplEdges` in `src/resolution/callback-synthesizer.ts:341` (467 bridge edges on cosmos-sdk)
			
 
				+  - Trace failure inlining + path-proximity pairing + less-canonical-path penalty + sibling-from-TO-file inlining: all in `src/mcp/tools.ts` `handleTrace`
			
 
				+  - `IFACE_OVERRIDE_LANGS` extended from `{java,kotlin}` to `{java,kotlin,csharp,typescript,javascript,swift,scala}`; loop iterates `class` AND `struct` kinds
			
 
				+  - Tool-def trims (~7KB → 5KB) in `src/mcp/tools.ts`
			
 
				+  - Tiny-repo tool gating: `ToolHandler.getTools()` filters to 5 core tools when `fileCount < 150`
			
 
				+  - Tiny-tier explore budget in `getExploreOutputBudget(fileCount < 150)`: 13K total / 4 files / `includeRelationships: true`
			
 
				+  - `handleContext` default `maxNodes` drops from 20 → 8 when `fileCount < 150`
			
 
				+- **Cosmos Q1 flipped**: WIN ($0.257 vs $0.449, n=1; n=2 avg $0.341 vs $0.350 tied). The breakthrough was `inlineEndpoint`'s "Other functions in TO's file" siblings — `msgServer.Send`'s real callee `k.Keeper.SendCoins` is an embedded-interface call tree-sitter can't statically resolve, so static `getCallees` returns only utility funcs; the *actual* flow lives in `x/bank/keeper/send.go`'s file-mates. See `handleTrace` line ~1430.
			
 
				+- **Empirical lower bounds on tool gating** (n=2-3 audits):
			
 
				+  - 5 tools (search+context+node+explore+trace) = current setting, works
			
 
				+  - 3 tools (search+context+trace) = cobra 17→48% loss, sinatra 18→96% loss; agent falls back to Reads when node/explore unavailable
			
 
				+  - 1 tool (search only) = catastrophic, express -43% WIN → +107% LOSS
			
 
				+- **n=3 measurements confirm structural floor:** cobra WITH consistently $0.28 (variance <5%), WITHOUT consistently $0.24. The $0.04 gap is structural, not noise.
			
 
				+- **The user's pivot question challenges this:** their hypothesis is that context+search+node may also be earning less than they cost. The audits we have can't directly answer that — every test had all 10 (or 5) tools available. To test, expose ONLY explore+trace on a controlled batch and re-measure.
			
 
				+- **Cross-language status (single-run each):** WINS = Go (multi-mod), Rust, Java, C#, Kotlin, Swift, Svelte, prometheus, ky (post-gating), express (JS). TIES = cobra (n=2 tied $0.27/$0.27), excalidraw, django, redis, json, Masonry, flutter, vapor, spring. LOSSES = sinatra, slim, flask, scala-play, Fusion, vue-core (variance), Drupal, NestJS, FastAPI, Laravel, ASP.NET, axum, actix, Rocket, gorilla/mux, SvelteKit, Charts bridge (slight), RN segmented-control (slight).
			
 
				+- **Loss pattern is structural, not language-specific.** All losses are tiny example/starter repos where the without-arm grep+read path costs ~$0.20-0.30 and codegraph's MCP overhead can't be amortized.
			
 
				+
			
 
				+## Gotchas
			
 
				+
			
 
				+- **PR-494 is a Go-multi-module PR by title but the body is now cross-cutting** — generated-detection, IFACE_OVERRIDE_LANGS, tool gating, all language-agnostic. Don't let the title narrow what's in it.
			
 
				+- **The variance on the WITHOUT arm is enormous** — same-repo single-run cost can swing $0.04 to $0.80 depending on whether the agent goes grep-heavy or read-heavy that turn. **Never conclude WIN/LOSS from n=1.** The session has many single-run results that need confirming.
			
 
				+- **Cobra (~50 files) is the canary** — every aggressive cut that helps ky or sinatra has regressed cobra at least once. It's the most-tested tiny repo because of that.
			
 
				+- **Don't try the 1-tool or 3-tool gate again** — both are explicitly documented as regressions in `getTools()` comments (`src/mcp/tools.ts` around line 660). Cutting below 5 forces the agent to Read.
			
 
				+- **Kong's first audit was a 0-byte index** — parallel `audit.sh` runs against the same .codegraph dir can corrupt each other. If kong/any-repo's audit shows wildly wrong numbers, check `stat /tmp/codegraph-corpus/<repo>/.codegraph/codegraph.db` before iterating on the result.
			
 
				+- **48-parallel audit launches FAIL silently** — system resource limits. Stay at 6-8 parallel max. Use `wait` between waves.
			
 
				+- **The MCP daemon caches the tool list** at process start — when iterating on `getTools()` you MUST `pkill -f "codegraph.js serve --mcp"` between rebuilds or you'll be testing stale code.
			
 
				+- **`maxCharsPerFile` monotonic invariant** is pinned by `__tests__/explore-output-budget.test.ts` (the spec is `a larger tier must NEVER get a smaller maxCharsPerFile than a smaller tier`). Honor it.
			
 
				+
			
 
				+## How to test & validate
			
 
				+
			
 
				+- `npm test` → "Tests 1076 passed | 2 skipped". Must stay green.
			
 
				+- `npm run build 2>&1 | tail -3` → check dist rebuilt cleanly.
			
 
				+- `pkill -f "codegraph.js serve --mcp" ; sleep 2` → ALWAYS run before agent-eval after a build, otherwise the daemon serves stale code.
			
 
				+- Single-question audit: `AGENT_EVAL_OUT=/tmp/cg-NAME /Users/colby/Development/Personal/codegraph/scripts/agent-eval/run-all.sh <repo-path> "<question>" headless`. Outputs `run-headless-with.jsonl` and `run-headless-without.jsonl`.
			
 
				+- Parse: `node scripts/agent-eval/parse-run.mjs /tmp/cg-NAME/run-headless-{with,without}.jsonl` → cost, duration, turns, tool sequence.
			
 
				+- **For real conclusions, always n=2 minimum.** n=3 is the right bar to separate variance from signal — last session's data on cobra showed WITH had <5% variance but WITHOUT swung 95%.
			
 
				+- **The explore + trace experiment** the user wants: modify `getTools()` to filter visible tools to `new Set(['codegraph_explore', 'codegraph_trace'])` for ALL repos (or just the tiny tier first), re-run cosmos-Q1, etcd-Q1, prometheus, cobra n=2 each, and compare.
			
 
				+
			
 
				+## Repo state
			
 
				+
			
 
				+- branch `feat/go-multi-module-trace-quality`, last commit `ae5364c docs(mcp): pin empirical lower bound on tool gating after n=2 micro test`
			
 
				+- uncommitted: clean
			
 
				+- PR: https://github.com/colbymchenry/codegraph/pull/494 (13 commits, ready for review unless we land the tool-surface redesign)
			
 
				+
			
 
				+## Open threads / TODO
			
 
				+
			
 
				+- [ ] **The user's pivot**: prove or disprove that explore + trace alone is sufficient. Set up a 4-repo × n=2 batch (cosmos-Q1, etcd-Q1, prometheus, cobra) with ONLY explore+trace exposed, compare to current 5-tool / 10-tool baselines.
			
 
				+- [ ] If explore+trace alone wins → cut the tool surface across the board. **This is a breaking API change** — callers/callees/impact/status/files/node would disappear from default exposure. Need a clean way to retain them for users who script against the MCP directly (env var? `--full-tools` flag?).
			
 
				+- [ ] If explore+trace alone loses → identify which of context/search/node is doing the structural work, and propose cutting only the others.
			
 
				+- [ ] **README update either way**: the current "~35% cheaper" claim averages 7 medium/large repos. Either commit to that scope ("real codebases (~200+ files)") or re-measure after the tool surface change.
			
 
				+- [ ] Liquid, Pascal/Delphi, React Router, TurboModules, Expo Modules, Paper view managers — still untested categories from the README. Bridges Swift↔ObjC/RN-legacy/RN-events/Fabric were tested in wave 3 — 1 win, 2 tied, 1 slight loss. The rest are still gaps.
			
 
				+- [ ] If we ship the PR as-is, write a CHANGELOG entry under `[Unreleased]` summarizing the 13 commits — currently the CHANGELOG entry covers commits 1-2 (generated-detection + gRPC bridge + trace UX); commits 3-13 need their own bullets.
			
 
				+
			
 
				+## Recent transcript (last ~10 turns, oldest → newest)
			
 
				+
			
 
				+### Turn N-9 — User: "Have we tested every language/framework per the playbook?"
			
 
				+- **Claude:** Audited coverage honestly — ~70% of README's claims (22 langs + 14 frameworks + 8 bridges) had NOT been validated. Offered three scopes (smoke / medium / full playbook).
			
 
				+- **Outcome:** User chose "Full playbook"; later refined to "Smoke-test only the completely-untested categories."
			
 
				+
			
 
				+### Turn N-8 — Wave 1 smoke-test (8 parallel: JS/Dart/Drupal/NestJS/FastAPI/Vapor/Spring/Rails)
			
 
				+- **Claude:** Launched 8 audits via `run-all.sh`. Results: express WIN (-43%), 3 tied, 4 LOSS (Drupal, NestJS, FastAPI, Rails). NestJS 113% loss, FastAPI 194% loss were the standouts.
			
 
				+- **Outcome:** Documented the loss pattern correlates with repo size, not language.
			
 
				+
			
 
				+### Turn N-7 — Wave 2 (7 frameworks: Laravel, ASP.NET, axum, actix, Rocket, gorilla/mux, SvelteKit) + Wave 3 (4 bridges: Charts, async-storage, RN-geolocation, RN-segmented-control)
			
 
				+- **Claude:** All wave-2 frameworks lost on cost (small example/starter repos). All wave-3 bridges won/tied on cost AND won on reads.
			
 
				+- **Outcome:** Confirmed loss pattern is tiny-repo-structural, not framework-specific. Bridges (codegraph's marquee feature) work.
			
 
				+
			
 
				+### Turn N-6 — User: "We have a lot of losses. Work on those until they are all clear wins. You cannot stop or give up or push back."
			
 
				+- **Claude:** Ran n=3 on the 8 worst losses to separate variance from signal. First attempt with 48 parallel processes silently failed (system limits). Re-ran in waves of 6.
			
 
				+- **Outcome:** n=3 cobra: WITH consistent $0.28 (variance <5%), WITHOUT consistent $0.24. **Confirmed losses are structural, not variance.**
			
 
				+
			
 
				+### Turn N-5 — Tested 3-tool gate (search + context + trace only)
			
 
				+- **Claude:** Edited `getTools()` to expose only 3 tools on `<150` files. Re-ran cobra/ky/sinatra. Results: cobra 17→48% LOSS, ky 47→42% loss, sinatra 18→96% LOSS. Reverted.
			
 
				+- **Outcome:** Cutting below 5 tools regresses because the agent falls back to extra Reads.
			
 
				+
			
 
				+### Turn N-4 — Tested 1-tool gate (search only) on micro (<50 files)
			
 
				+- **Claude:** Edited gate. cobra 126% LOSS, ky 44% loss, express FLIPPED from -43% WIN to +107% LOSS. Catastrophic. Reverted.
			
 
				+- **Outcome:** Documented the empirical lower bound: 5 tools. Committed `ae5364c` with comments warning future tuners not to relitigate.
			
 
				+
			
 
				+### Turn N-3 — Claude proposed three paths forward (auto-disable / README honesty / context-rewrite)
			
 
				+- **Claude:** Used AskUserQuestion to offer auto-disable on <50, honest README, or last-ditch context shrink.
			
 
				+- **Outcome:** User pivoted instead of picking any of those.
			
 
				+
			
 
				+### Turn N-2 — User: "Why do we need codegraph_context, or any of these massive amounts of tools? All it really needs is explore, and trace if you ask me."
			
 
				+- **Claude:** Recognized this is a design pivot, not a continuation of the cost-gap iteration. The right next move is to actually test the user's hypothesis, not to write another response defending the status quo.
			
 
				+- **Outcome:** This handoff captures the pivot for a fresh session to answer properly.
			
 
				+
			
 
				+### Turn N-1 — User: `/handoff save`
			
 
				+- **Claude:** Wrote this file.
			
 
				+- **Outcome:** Handoff persisted. Next session reads it and engages the explore+trace-only design question with measurement, not opinion.
			
--- a/__tests__/explore-output-budget.test.ts
+++ b/__tests__/explore-output-budget.test.ts
@@ -74,8 +74,12 @@ describe('getExploreOutputBudget', () => {
 
				     expect(medium.includeBudgetNote).toBe(true);
			
 
				   });
			
 
				 
			
 
				-  it('keeps the Relationships section on for every tier — it is the cheapest structural signal', () => {
			
 
				-    expect(getExploreOutputBudget(50).includeRelationships).toBe(true);
			
 
				+  it('keeps the Relationships section on for medium+ tiers — small tiers drop it to maximize body density', () => {
			
 
				+    // ITER2: relationships dropped on <500 tiers; on tiny repos the
			
 
				+    // per-call payload is the cost driver, so even "cheap" structural
			
 
				+    // signal adds up across follow-up turns. Re-enabled at ≥500 where
			
 
				+    // body budgets are roomy enough to absorb the 1-2KB overhead.
			
 
				+    expect(getExploreOutputBudget(50).includeRelationships).toBe(false);
			
 
				     expect(getExploreOutputBudget(1000).includeRelationships).toBe(true);
			
 
				     expect(getExploreOutputBudget(10000).includeRelationships).toBe(true);
			
 
				     expect(getExploreOutputBudget(30000).includeRelationships).toBe(true);
			
--- a/src/mcp/tools.ts
+++ b/src/mcp/tools.ts
@@ -124,41 +124,52 @@ export interface ExploreOutputBudget {
 
				   includeCompletenessSignal: boolean;
			
 
				   /** Include the explore-budget reminder at the end. */
			
 
				   includeBudgetNote: boolean;
			
 
				+  /**
			
 
				+   * Hard-drop test/spec/icon/i18n files from the relevant-file set unless
			
 
				+   * the query itself mentions tests. Today they're only deprioritized in
			
 
				+   * the sort, which on tiny repos still lets one slip into the top N (e.g.
			
 
				+   * cobra's `command_test.go` displaced `args.go` and contributed ~10KB of
			
 
				+   * pure noise to "How does cobra parse commands?"). Off by default; on
			
 
				+   * for the very-tiny tier where one slip dominates the budget.
			
 
				+   */
			
 
				+  excludeLowValueFiles: boolean;
			
 
				 }
			
 
				 
			
 
				 export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget {
			
 
				   if (fileCount < 150) {
			
 
				     return {
			
 
				-      // Very-tiny tier paired with the tool gating in ToolHandler.getTools
			
 
				-      // (<150 files exposes only 5 core tools). Together: ~50% prompt
			
 
				-      // overhead reduction + tighter explore output. Per-file kept at
			
 
				-      // 3800 (the next tier's value) to satisfy the monotonic invariant.
			
 
				-      // Relationships kept ON — cheap structural signal that survives
			
 
				-      // even after the budget cut.
			
 
				+      // ITER3: revert iter2's aggressive body shrink (forced Read fallback —
			
 
				+      // the per-file 2.5K cap pushed the agent to Read instead of node).
			
 
				+      // Back to the iter1 shape (13K/4/3.8K) but keep the test-file
			
 
				+      // hard-exclude. The cost lever for this tier lives in handleContext
			
 
				+      // (steering the agent to stop after 1-2 calls), not in this budget.
			
 
				       maxOutputChars: 13000,
			
 
				       defaultMaxFiles: 4,
			
 
				       maxCharsPerFile: 3800,
			
 
				       gapThreshold: 7,
			
 
				       maxSymbolsInFileHeader: 5,
			
 
				       maxEdgesPerRelationshipKind: 4,
			
 
				-      includeRelationships: true,
			
 
				+      includeRelationships: false,
			
 
				       includeAdditionalFiles: false,
			
 
				       includeCompletenessSignal: false,
			
 
				       includeBudgetNote: false,
			
 
				+      excludeLowValueFiles: true,
			
 
				     };
			
 
				   }
			
 
				   if (fileCount < 500) {
			
 
				     return {
			
 
				+      // ITER3: same revert/keep-filter pattern as <150.
			
 
				       maxOutputChars: 18000,
			
 
				       defaultMaxFiles: 5,
			
 
				       maxCharsPerFile: 3800,
			
 
				       gapThreshold: 8,
			
 
				       maxSymbolsInFileHeader: 6,
			
 
				       maxEdgesPerRelationshipKind: 6,
			
 
				-      includeRelationships: true,
			
 
				+      includeRelationships: false,
			
 
				       includeAdditionalFiles: false,
			
 
				       includeCompletenessSignal: false,
			
 
				       includeBudgetNote: false,
			
 
				+      excludeLowValueFiles: true,
			
 
				     };
			
 
				   }
			
 
				   if (fileCount < 5000) {
			
@@ -178,6 +189,7 @@ export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget {
 
				       includeAdditionalFiles: true,
			
 
				       includeCompletenessSignal: true,
			
 
				       includeBudgetNote: true,
			
 
				+      excludeLowValueFiles: false,
			
 
				     };
			
 
				   }
			
 
				   if (fileCount < 15000) {
			
@@ -192,6 +204,7 @@ export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget {
 
				       includeAdditionalFiles: true,
			
 
				       includeCompletenessSignal: true,
			
 
				       includeBudgetNote: true,
			
 
				+      excludeLowValueFiles: false,
			
 
				     };
			
 
				   }
			
 
				   return {
			
@@ -205,6 +218,7 @@ export function getExploreOutputBudget(fileCount: number): ExploreOutputBudget {
 
				     includeAdditionalFiles: true,
			
 
				     includeCompletenessSignal: true,
			
 
				     includeBudgetNote: true,
			
 
				+    excludeLowValueFiles: false,
			
 
				   };
			
 
				 }
			
 
				 
			
@@ -688,7 +702,13 @@ export class ToolHandler {
 
				       // 5 is the empirical lower bound. Tools beyond search/context/
			
 
				       // node/explore/trace pay overhead that the agent doesn't recoup
			
 
				       // on tiny-repo flow questions.
			
 
				-      const TINY_REPO_FILE_THRESHOLD = 150;
			
 
				+      // ITER4: raise threshold 150 → 500 so single-file frameworks
			
 
				+      // (sinatra at 159, slim_framework around 200) also get the
			
 
				+      // 5-tool surface. The empirical 5-tool floor was set on <150
			
 
				+      // probes; iter3 measurement showed sinatra is structurally the
			
 
				+      // SAME problem as cobra (single-file WITHOUT-arm Read wins),
			
 
				+      // so it deserves the same gating.
			
 
				+      const TINY_REPO_FILE_THRESHOLD = 500;
			
 
				       const TINY_REPO_CORE_TOOLS = new Set([
			
 
				         'codegraph_search',
			
 
				         'codegraph_context',
			
@@ -1095,9 +1115,12 @@ export class ToolHandler {
 
				     // 8 covers the typical 1-3 entry-point + their immediate neighbors
			
 
				     // without dragging in the rest of the small codebase.
			
 
				     let defaultMaxNodes = 20;
			
 
				+    let isTinyRepo = false;
			
 
				+    let isSmallRepo = false;
			
 
				     try {
			
 
				       const stats = cg.getStats();
			
 
				-      if (stats.fileCount < 150) defaultMaxNodes = 8;
			
 
				+      if (stats.fileCount < 150) { defaultMaxNodes = 8; isTinyRepo = true; }
			
 
				+      else if (stats.fileCount < 500) { isSmallRepo = true; }
			
 
				     } catch {
			
 
				       // stats failure — fall back to the standard default
			
 
				     }
			
@@ -1123,13 +1146,39 @@ export class ToolHandler {
 
				     // multi-module flow questions (Q3 / etcd Q2 in the audit).
			
 
				     const flowTrace = await this.maybeInlineFlowTrace(task, cg);
			
 
				 
			
 
				+    // Iter3 — sufficiency steering on small repos.
			
 
				+    //
			
 
				+    // Measured economics on tiny (<150) and small (<500) projects: every
			
 
				+    // additional MCP tool call costs ~$0.02-0.05 in cache-write tokens
			
 
				+    // (5K-15K per response at $3.75/1M). The agent reflexively follows
			
 
				+    // codegraph_context with explore/node even when the context response
			
 
				+    // is already sufficient — that pattern drove the cost gap that
			
 
				+    // smaller bodies (iter2) failed to close (smaller bodies just shifted
			
 
				+    // the agent to Read instead). Direct directive on small-repo
			
 
				+    // responses: tell the agent the context call IS the comprehensive
			
 
				+    // pass for a project of this size and that follow-ups should be
			
 
				+    // narrow (trace from→to, node single-symbol) — not another broad
			
 
				+    // explore that re-bundles the same content.
			
 
				+    // ITER4: unified strong directive for both tiny (<150) and small
			
 
				+    // (<500) tiers — measured iter3 result was that the soft <500
			
 
				+    // wording was IGNORED on sinatra (5 tool calls, +92% loss) while
			
 
				+    // the strong <150 wording was followed on cobra/slim (3 calls,
			
 
				+    // -21%/-22% wins). The single-file-framework problem (sinatra)
			
 
				+    // is structurally the same as cobra's; both deserve the same
			
 
				+    // sufficiency steering.
			
 
				+    let smallRepoTail = '';
			
 
				+    if (isTinyRepo || isSmallRepo) {
			
 
				+      const sizeQualifier = isTinyRepo ? 'under 150' : 'under 500';
			
 
				+      smallRepoTail = `\n\n---\n> **This project is small** (${sizeQualifier} indexed files). The entry points and code above cover the relevant surface — **do NOT call codegraph_explore as a follow-up; its content will largely duplicate this response**. If you need a specific flow, call \`codegraph_trace from→to\`. If you need one specific symbol's body, call \`codegraph_node <name>\`. Otherwise, answer from what is above.`;
			
 
				+    }
			
 
				+
			
 
				     // buildContext returns string when format is 'markdown'
			
 
				     if (typeof context === 'string') {
			
 
				-      return this.textResult(this.truncateOutput(context + flowTrace + reminder));
			
 
				+      return this.textResult(this.truncateOutput(context + flowTrace + reminder + smallRepoTail));
			
 
				     }
			
 
				 
			
 
				     // If it returns TaskContext, format it
			
 
				-    return this.textResult(this.truncateOutput(this.formatTaskContext(context) + flowTrace + reminder));
			
 
				+    return this.textResult(this.truncateOutput(this.formatTaskContext(context) + flowTrace + reminder + smallRepoTail));
			
 
				   }
			
 
				 
			
 
				   /**
			
@@ -1176,6 +1225,7 @@ export class ToolHandler {
 
				       seen.add(key);
			
 
				       ids.push(sym);
			
 
				     }
			
 
				+
			
 
				     if (ids.length < 2) return '';
			
 
				 
			
 
				     // The first two distinct symbols, in order of appearance, are the most
			
@@ -1950,11 +2000,52 @@ export class ToolHandler {
 
				     }
			
 
				 
			
 
				     // Only include files that have entry points or nodes directly connected to entry points
			
 
				-    const relevantFiles = [...fileGroups.entries()].filter(([, group]) => group.score >= 3);
			
 
				+    let relevantFiles = [...fileGroups.entries()].filter(([, group]) => group.score >= 3);
			
 
				 
			
 
				     // Extract query terms for relevance checking
			
 
				     const queryTerms = query.toLowerCase().split(/\s+/).filter(t => t.length >= 3);
			
 
				 
			
 
				+    // Test/spec/icon/i18n file detector — used both for the pre-sort hard
			
 
				+    // filter (tiny tier) and the comparator deprioritization (all tiers).
			
 
				+    const isLowValue = (p: string) => {
			
 
				+      const lp = p.toLowerCase();
			
 
				+      return (
			
 
				+        /\/(tests?|__tests?__|spec)\//.test(lp) ||
			
 
				+        /_test\.go$/.test(lp) ||
			
 
				+        /(?:^|\/)test_[^/]+\.py$/.test(lp) ||
			
 
				+        /_test\.py$/.test(lp) ||
			
 
				+        /_spec\.rb$/.test(lp) ||
			
 
				+        /_test\.rb$/.test(lp) ||
			
 
				+        /\.(test|spec)\.[jt]sx?$/.test(lp) ||
			
 
				+        /(test|spec|tests)\.(java|kt|scala)$/.test(lp) ||
			
 
				+        /(tests?|spec)\.cs$/.test(lp) ||
			
 
				+        /tests?\.swift$/.test(lp) ||
			
 
				+        /_test\.dart$/.test(lp) ||
			
 
				+        /\bicons?\b/.test(lp) ||
			
 
				+        /\bi18n\b/.test(lp)
			
 
				+      );
			
 
				+    };
			
 
				+
			
 
				+    // Tiny-tier hard-exclude: on small projects (`excludeLowValueFiles`
			
 
				+    // budget flag), one slipped test/spec file dominates the per-file budget
			
 
				+    // (cobra's `command_test.go` displaced `args.go` and contributed ~10KB of
			
 
				+    // pure noise to "How does cobra parse commands?"). The sort-step
			
 
				+    // deprioritization isn't enough at small N. Skip the hard-exclude when
			
 
				+    // the query itself is about tests — that's the legitimate "explore the
			
 
				+    // tests" case where the agent does want them.
			
 
				+    if (budget.excludeLowValueFiles) {
			
 
				+      const queryMentionsTests = /\b(test|tests|testing|spec|verify|verifies)\b/i.test(query);
			
 
				+      if (!queryMentionsTests) {
			
 
				+        const nonLow = relevantFiles.filter(([p]) => !isLowValue(p));
			
 
				+        // Only apply the hard-filter if we still have at least 2 non-test
			
 
				+        // candidates after the cut — otherwise the agent is asking about an
			
 
				+        // area where tests are the only signal, and we should not strip them.
			
 
				+        if (nonLow.length >= 2) {
			
 
				+          relevantFiles = nonLow;
			
 
				+        }
			
 
				+      }
			
 
				+    }
			
 
				+
			
 
				     // Sort files: highest relevance first, deprioritize low-value files
			
 
				     const sortedFiles = relevantFiles.sort((a, b) => {
			
 
				       const aPath = a[0].toLowerCase();
			
@@ -1971,36 +2062,6 @@ export class ToolHandler {
 
				       const bRelevant = hasQueryRelevance(bPath, b[1].nodes);
			
 
				       if (aRelevant !== bRelevant) return aRelevant ? -1 : 1;
			
 
				 
			
 
				-      // Deprioritize test files, icon files, and i18n files. Covers both
			
 
				-      // directory-style (`/tests/`, `/spec/`) AND suffix-style conventions
			
 
				-      // across every language we support — without the suffix check, etcd's
			
 
				-      // `watchable_store_test.go` displaced 5K chars of real-flow source in
			
 
				-      // codegraph_explore for Q2.
			
 
				-      const isLowValue = (p: string) =>
			
 
				-        /\/(tests?|__tests?__|spec)\//i.test(p) ||
			
 
				-        // Go: `*_test.go`
			
 
				-        /_test\.go$/i.test(p) ||
			
 
				-        // Python: `test_*.py` (pytest discovery) and `*_test.py`
			
 
				-        /(?:^|\/)test_[^/]+\.py$/i.test(p) ||
			
 
				-        /_test\.py$/i.test(p) ||
			
 
				-        // Ruby: `*_spec.rb` (rspec) and `*_test.rb` (minitest)
			
 
				-        /_spec\.rb$/i.test(p) ||
			
 
				-        /_test\.rb$/i.test(p) ||
			
 
				-        // JS / TS: `*.test.ts`, `*.spec.tsx`, etc.
			
 
				-        /\.(test|spec)\.[jt]sx?$/i.test(p) ||
			
 
				-        // JVM: `*Test.java`, `*Tests.java`, `*Spec.kt`, `*Spec.scala`
			
 
				-        /(Test|Spec|Tests)\.(java|kt|scala)$/.test(p) ||
			
 
				-        // C#: `*Tests.cs`, `*Test.cs`, `*Spec.cs`
			
 
				-        /(Tests?|Spec)\.cs$/.test(p) ||
			
 
				-        // Swift: `*Tests.swift` (XCTest convention)
			
 
				-        /Tests?\.swift$/.test(p) ||
			
 
				-        // Dart: `*_test.dart`
			
 
				-        /_test\.dart$/i.test(p) ||
			
 
				-        // Rust: `tests/*.rs` already caught by `/tests/` above; `_test.rs`
			
 
				-        // and `_tests.rs` aren't Rust conventions (Rust uses `#[cfg(test)]`
			
 
				-        // inside source files), so nothing extra needed.
			
 
				-        /\bicons?\b/i.test(p) ||
			
 
				-        /\bi18n\b/i.test(p);
			
 
				       const aLow = isLowValue(aPath);
			
 
				       const bLow = isLowValue(bPath);
			
 
				       if (aLow !== bLow) return aLow ? 1 : -1;