Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode #23

RunDevelopment · 2021-06-03T09:36:25Z

When parsing a non-Unicode regex that contains named backreferences with the strict: true option, a syntax error will always be throws regardless of whether the regex is actually correct or not.

Example:

const { RegExpValidator } = require("regexpp")

const validator = new RegExpValidator({ strict: true, ecmaVersion: 2020 })
validator.validatePattern(/(?<foo>A)\k<foo>/.source, undefined, undefined, false)

This produces the following error:

SyntaxError: Invalid regular expression: /(?<foo>A)\k<foo>/: Invalid escape
    at RegExpValidator.raise ([...]\regexpp\.temp\src\validator.ts:847:15)
    at RegExpValidator.consumeAtomEscape ([...]\regexpp\.temp\src\validator.ts:1475:18)
    at RegExpValidator.consumeReverseSolidusAtomEscape ([...]\regexpp\.temp\src\validator.ts:1245:22)
    at RegExpValidator.consumeAtom ([...]\regexpp\.temp\src\validator.ts:1213:18)
    at RegExpValidator.consumeTerm ([...]\regexpp\.temp\src\validator.ts:1027:23)
    at RegExpValidator.consumeAlternative ([...]\regexpp\.temp\src\validator.ts:1000:53)
    at RegExpValidator.consumeDisjunction ([...]\regexpp\.temp\src\validator.ts:976:18)
    at RegExpValidator.consumePattern ([...]\regexpp\.temp\src\validator.ts:901:14)
    at RegExpValidator.validatePattern ([...]\regexpp\.temp\src\validator.ts:531:14)
    at validateRegExpPattern (my-project\app.ts:12:75)

However, the regex /(?<foo>A)\k<foo>/ is valid. As stated in the proposal:

In this proposal, \k<foo> in non-Unicode RegExps will continue to match the literal string "k<foo>" unless the RegExp contains a named group, in which case it will match that group or be a syntax error, depending on whether or not the RegExp has a named group named foo.

Since the regex contains a named capturing group, \k<foo> has to be parsed as a backreference. Since Annex B doesn't say anything about named backreferences, regexpp should parse this regex even with strict: true.

However, regexpp parses it as an invalid(?) escape and throws an error in strict mode. This is because validation is done is two passes (1, 2). The bug occurs because the n flag isn't set in the first pass causing the syntax error. This can be seen in the stack trace: the second-last line - at RegExpValidator.validatePattern ([...]\validator.ts:531:14) - is the first parsing pass.

The fix for this bug is to determine whether the regex contains named groups ahead of time, similar to how the number of capturing groups is counted before parsing. I will make a PR.

The text was updated successfully, but these errors were encountered:

mysticatea · 2021-06-13T22:11:31Z

Hi. Thank you for your detailed report!

Hmm. I'm not 100% sure if it's a spec violation. The two passes parsing came from 22.2.3.2.3 Static Semantics: ParsePattern ( patternText, u ). It says, "if no u flag, parse the pattern without N parameter, then if no syntax errors exist and named capturing groups exist, re-parse the pattern with N parameter." And unfortunately, since 22.2.1 Patterns, \k is a syntax error if no N parameter.

Please tell me if I went wrong.

mysticatea · 2021-06-13T22:23:04Z

So maybe, this is a spec bug, and IdentityEscape production needs [~N] k.

RunDevelopment · 2021-06-13T22:41:17Z

I agree with you. I also read the specification like this. \k<foo> is a syntax error according to the spec. The only reason v8 parses the above regex is because of Annex B.

This is quite interesting because whether this is a bug or not comes down to the semantic of the strict option. Right now strict is implemented as "Annex B syntax is disabled for every parsing pass". In #24, I implemented strict as "Annex B syntax is disabled for the last parsing pass" (well, not quite).

As a question to the creator of the library: What is the semantic of the strict option?

I hope that it's the latter version because that would be a lot more useful IMO.

mysticatea · 2021-06-13T23:04:01Z

I opened an issue: tc39/ecma262#2434

I don't think it's intentional that named capturing groups requires u flag. Let's see how TC39 fixes that.

As a question to the creator of the library: What is the semantic of the strict option?

It disables Annex B completely.

MichaelDeBoey · 2023-10-14T15:36:00Z

For people watching this issue: we've already started with our own fork in order to not hold the wider community back anymore: https://github.com/eslint-community/regexpp

@mysticatea We would still love to move the original repo to the new @eslint-community though.

This PR is released in @eslint-community/regexpp v4.4.1
https://github.com/eslint-community/regexpp/releases/tag/v4.4.1

This was referenced Jun 3, 2021

Improved regexp/strict rule ota-meshi/eslint-plugin-regexp#225

Merged

Fixed named backreferences in strict mode #24

Closed

mysticatea mentioned this issue Jun 13, 2021

/(?<foo>A)\k<foo>/ is a syntax error unless using Annex B tc39/ecma262#2434

Closed

mysticatea added bug Something isn't working pending for spec bug labels Jun 14, 2021

ota-meshi mentioned this issue Feb 15, 2023

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode eslint-community/regexpp#55

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode #23

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode #23

RunDevelopment commented Jun 3, 2021 •

edited

Loading

mysticatea commented Jun 13, 2021

mysticatea commented Jun 13, 2021

RunDevelopment commented Jun 13, 2021 •

edited

Loading

mysticatea commented Jun 13, 2021 •

edited

Loading

MichaelDeBoey commented Oct 14, 2023

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode #23

Bug: Named backreferences will always cause a syntax error for non-Unicode regexes in strict parsing mode #23

Comments

RunDevelopment commented Jun 3, 2021 • edited Loading

mysticatea commented Jun 13, 2021

mysticatea commented Jun 13, 2021

RunDevelopment commented Jun 13, 2021 • edited Loading

mysticatea commented Jun 13, 2021 • edited Loading

MichaelDeBoey commented Oct 14, 2023

RunDevelopment commented Jun 3, 2021 •

edited

Loading

RunDevelopment commented Jun 13, 2021 •

edited

Loading

mysticatea commented Jun 13, 2021 •

edited

Loading