Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex doesn't handle surrogate pairs properly #15

Open
MiSawa opened this issue Jan 31, 2019 · 3 comments
Open

Regex doesn't handle surrogate pairs properly #15

MiSawa opened this issue Jan 31, 2019 · 3 comments

Comments

@MiSawa
Copy link

MiSawa commented Jan 31, 2019

Hi, thank you for providing a great regular expression library!

I have noticed that brics handles input regex string as a sequence of java.lang.Character, and this could cause a somewhat unintuitive behavior.

For example, 𠀋<𠮟<𡵅 as a Unicode Scalar Value (0x2000b, 0x20b9f, 0x21d45 respectively, all of them will be expressed with surrogate pairs), but automaton created from [𠀋-𡵅] doesn't accept 𠮟.

private static void checkOneCodePoint(final String s) {
    if (s.codePointCount(0, s.length()) != 1) throw new IllegalArgumentException();
}

public static boolean testBrics(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final RegExp regex = new RegExp("[" + a + "-" + c + "]");
    return regex.toAutomaton().run(b);
}

public static boolean testJava(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final Pattern pattern = Pattern.compile("[" + a + "-" + c + "]");
    return pattern.matcher(b).matches();
}

public static void main(final String[] args) throws IOException {
    final String a = new String(new int[]{0x2000b}, 0, 1); // 𠀋
    final String b = new String(new int[]{0x20b9f}, 0, 1); // 𠮟
    final String c = new String(new int[]{0x21d45}, 0, 1); // 𡵅
    System.out.println(testBrics(a, b, c)); // false
    System.out.println(testJava(a, b, c)); // true
}

Fixing this would require us to do

  • Read regex string as (not java.lang.Character-by-java.lang.Character but) Code Point stream. This also includes fixes for operator precedence, like 𠀋+.
  • Convert them to java.lang.Characters, and if they involve surrogate pairs, do something similar to what we do for numerical interval <n-m>

Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.

Thanks,

@mrspaceman
Copy link

mrspaceman commented Feb 4, 2019

I have found a similar issue:

When I try to generate strings for a regex containing emoji using the following code :

RegExp r = new RegExp("[😁-😘]{1}");
Automaton a = r.toAutomaton();
System.out.println("automaton=" + a.toString());

the toString() produces the following:

automaton=initial state: 0
state 0 [reject]:
      \ud83d -> 1
      \ude18 -> 1
    state 1 [accept]:

The emoji's used are:

😁

U+1F601
\0xD83D               \0xDE01
55357                 56833
1101 1000 0011 1101   1101 1110 0000 0001

to

😘

U+1F618
\0xD83D               \0xDE18
55357                 56856
1101 1000 0011 1101   1101 1110 0001 1000

@turf00
Copy link
Contributor

turf00 commented May 17, 2023

You cannot fix range handling of Unicode chars that are outside the 16bit values possible with a single Java char, as per above.

However, you can coax Brics to handle repetition and other tokens using grouping (😘) and could potentially replace the range with explicit alternatives.

Just leaving this up here in case its of use to someone.

    // GIVEN; input that contains multiple supplementary characters
    final String input = "😘😘😘abc";
    // WHEN; using standard broken for for repetition emojis
    final RegExp broke = new RegExp("😘+abc");
    final Automaton aBroke = broke.toAutomaton();

    // THEN; does not match incorrectly
    System.out.println("Broke: " + aBroke.run(input));

    // WHEN; coaxing brics to treat two separate chars as single entity
    final RegExp fix = new RegExp("(😘)+abc");
    final Automaton aFix = fix.toAutomaton();

    // THEN; does match correctly
    System.out.println("Fixed: " + aFix.run(input));

Will result in:

Broke: false
Fixed: true

And here is another example

    // GIVEN; input that contains multiple supplementary characters
    final String input = "😘😤😘😤😘abc";
    // WHEN; using standard broken for for repetition emojis
    final RegExp broke = new RegExp("[😤😘]+abc");
    final Automaton aBroke = broke.toAutomaton();

    // THEN; does not match incorrectly
    System.out.println("Broke: " + aBroke.run(input));

    // WHEN; coaxing brics to treat two separate chars as single entity
    final RegExp fix = new RegExp("((😘)|(😤))+abc");
    final Automaton aFix = fix.toAutomaton();

    // THEN; does match correctly
    System.out.println("Fixed: " + aFix.run(input));

In the last example, the broken one actually returns true but I believe its not matching correctly simply treating each of the 4 chars that make up the 2 code points as allowed and therefore matching on the string. However, it could potentially match other code points incorrectly.

@DirkToewe
Copy link

FIY you can also match code point ranges. I had made a corresponding pull request a while ago (PR #35). The relevant function is makeCodePointRange( int min, int max ), which returns an Automaton matching the given (valid) code point range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants