Extend `isDangerousInvisibleCodePoint` with five additional code
points / ranges that are routinely cited in invisible-character
smuggling references but were not in the previous denylist:
- **U+180E** MONGOLIAN VOWEL SEPARATOR. Formerly classified as a
space separator (Zs) until Unicode 6.3 reclassified it as Cf
(Format control). Renders as zero-width; widely abused for
homograph attacks and prompt smuggling.
- **U+115F** HANGUL CHOSEONG FILLER and **U+1160** HANGUL JUNGSEONG
FILLER. Zero-width fillers used in Korean text shaping. Both are
cited as common LLM-injection vectors in Korean / multilingual
threat models.
- **U+2061–U+2064** invisible math operators (FUNCTION APPLICATION,
INVISIBLE TIMES, INVISIBLE SEPARATOR, INVISIBLE PLUS). Zero-width
and only meaningful inside math typesetting. No legitimate
Markdown or source code uses them.
- **U+3164** HANGUL FILLER. Reported in real-world Discord and
Twitter smuggling incidents; not used in legitimate Korean text.
Reproduced before this commit: a file containing any one of these
code points passed `check-unicode-safety.js` silently.
After this commit each one is reported as
`dangerous-invisible U+<HEX>` and `--write` mode strips it.
Verified by writing 8 single-character probe files
(`probe-0x180E.md`, `probe-0x115F.md`, …) and confirming exit=1 with
each violation listed.
ECC repo self-scan reports only the pre-existing `U+2605` BLACK
STAR warnings (unchanged) and exits with the same status (no new
in-repo violations introduced). Existing 5 unicode-safety tests
still pass; `yarn lint` clean.
Regression coverage for both the previous commit's Tag block fix
and this commit's additions lands in the next commit.
`isDangerousInvisibleCodePoint` enumerated seven ranges of invisible/
bidi/variation-selector code points but omitted the Unicode Tag block
(U+E0000–U+E007F). Tag characters were proposed for language tagging
in Unicode 3.1 and have been deprecated since Unicode 5.1, so no
legitimate text uses them. They are the canonical vector for
"ASCII Smuggling" / "Tag Smuggling" LLM prompt injection: an attacker
hides instructions inside an ASCII-looking string, the model reads
the tag bytes, the human reviewer sees nothing. Demonstrated against
multiple LLM assistants during 2024–2025.
`check-unicode-safety.js` is the repo's last line of defence before
contributor content reaches agent context; the same script also runs
in `--write` auto-sanitize mode on `.md` / `.mdx` / `.txt`. Today it
silently passes tag-block characters through unchanged in both
detection mode and `--write` mode.
Reproduced before this commit:
$ mkdir -p /tmp/uni-test && node -e "
const fs = require('fs');
const hidden = [...Array(5)].map((_,i) =>
String.fromCodePoint(0xE0041 + i)).join('');
fs.writeFileSync('/tmp/uni-test/innocent.md',
'# Title\\n\\nBenign text' + hidden + ' more.\\n');"
$ ECC_UNICODE_SCAN_ROOT=/tmp/uni-test \
node scripts/ci/check-unicode-safety.js
Unicode safety check passed.
$ echo $?
0
Expected: tag-block characters reported as `dangerous-invisible`
violations (exit 1) and stripped under `--write`.
Actual: validator passes, `--write` leaves the bytes intact.
Fix: extend the denylist with one new range
`(codePoint >= 0xE0000 && codePoint <= 0xE007F)`. The change is
purely additive; the existing seven ranges are untouched.
After this commit the same reproduction returns:
$ ECC_UNICODE_SCAN_ROOT=/tmp/uni-test \
node scripts/ci/check-unicode-safety.js
Unicode safety violations detected:
innocent.md:3:12 dangerous-invisible U+E0041
innocent.md:3:14 dangerous-invisible U+E0042
innocent.md:3:16 dangerous-invisible U+E0043
innocent.md:3:18 dangerous-invisible U+E0044
innocent.md:3:20 dangerous-invisible U+E0045
exit=1
`--write` mode also strips the bytes (verified: file length 47 → 42
after sanitize, regex `/[\u{E0000}-\u{E007F}]/u` no longer matches).
Existing 5 unicode-safety tests still pass; `yarn lint` clean. The
ECC repo's own self-scan (`node scripts/ci/check-unicode-safety.js`
with no `ECC_UNICODE_SCAN_ROOT`) reports the same warnings as before
this commit and exits with the same status (no regressions on
in-repo content).
A handful of other widely-cited invisible code points are missing
from the denylist (`U+180E`, `U+115F`, `U+1160`, `U+2061–U+2064`,
`U+3164`); those are addressed in the next commit so each fix
remains independently reviewable. Regression coverage for both
fixes lands two commits later.