Regex doesn't handle surrogate pairs properly #15

MiSawa · 2019-01-31T07:53:45Z

Hi, thank you for providing a great regular expression library!

I have noticed that brics handles input regex string as a sequence of java.lang.Character, and this could cause a somewhat unintuitive behavior.

For example, 𠀋<𠮟<𡵅 as a Unicode Scalar Value (0x2000b, 0x20b9f, 0x21d45 respectively, all of them will be expressed with surrogate pairs), but automaton created from [𠀋-𡵅] doesn't accept 𠮟.

private static void checkOneCodePoint(final String s) {
    if (s.codePointCount(0, s.length()) != 1) throw new IllegalArgumentException();
}

public static boolean testBrics(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final RegExp regex = new RegExp("[" + a + "-" + c + "]");
    return regex.toAutomaton().run(b);
}

public static boolean testJava(final String a, final String b, final String c) {
    checkOneCodePoint(a); checkOneCodePoint(b); checkOneCodePoint(c);
    final Pattern pattern = Pattern.compile("[" + a + "-" + c + "]");
    return pattern.matcher(b).matches();
}

public static void main(final String[] args) throws IOException {
    final String a = new String(new int[]{0x2000b}, 0, 1); // 𠀋
    final String b = new String(new int[]{0x20b9f}, 0, 1); // 𠮟
    final String c = new String(new int[]{0x21d45}, 0, 1); // 𡵅
    System.out.println(testBrics(a, b, c)); // false
    System.out.println(testJava(a, b, c)); // true
}

Fixing this would require us to do

Read regex string as (not java.lang.Character-by-java.lang.Character but) Code Point stream. This also includes fixes for operator precedence, like 𠀋+.
Convert them to java.lang.Characters, and if they involve surrogate pairs, do something similar to what we do for numerical interval <n-m>

Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.

Thanks,

The text was updated successfully, but these errors were encountered:

mrspaceman · 2019-02-04T10:40:43Z

I have found a similar issue:

When I try to generate strings for a regex containing emoji using the following code :

RegExp r = new RegExp("[😁-😘]{1}");
Automaton a = r.toAutomaton();
System.out.println("automaton=" + a.toString());

the toString() produces the following:

automaton=initial state: 0
state 0 [reject]:
      \ud83d -> 1
      \ude18 -> 1
    state 1 [accept]:

The emoji's used are:

😁

U+1F601
\0xD83D               \0xDE01
55357                 56833
1101 1000 0011 1101   1101 1110 0000 0001

to

😘

U+1F618
\0xD83D               \0xDE18
55357                 56856
1101 1000 0011 1101   1101 1110 0001 1000

turf00 · 2023-05-17T08:31:11Z

You cannot fix range handling of Unicode chars that are outside the 16bit values possible with a single Java char, as per above.

However, you can coax Brics to handle repetition and other tokens using grouping (😘) and could potentially replace the range with explicit alternatives.

Just leaving this up here in case its of use to someone.

    // GIVEN; input that contains multiple supplementary characters
    final String input = "😘😘😘abc";
    // WHEN; using standard broken for for repetition emojis
    final RegExp broke = new RegExp("😘+abc");
    final Automaton aBroke = broke.toAutomaton();

    // THEN; does not match incorrectly
    System.out.println("Broke: " + aBroke.run(input));

    // WHEN; coaxing brics to treat two separate chars as single entity
    final RegExp fix = new RegExp("(😘)+abc");
    final Automaton aFix = fix.toAutomaton();

    // THEN; does match correctly
    System.out.println("Fixed: " + aFix.run(input));

Will result in:

Broke: false
Fixed: true

And here is another example

    // GIVEN; input that contains multiple supplementary characters
    final String input = "😘😤😘😤😘abc";
    // WHEN; using standard broken for for repetition emojis
    final RegExp broke = new RegExp("[😤😘]+abc");
    final Automaton aBroke = broke.toAutomaton();

    // THEN; does not match incorrectly
    System.out.println("Broke: " + aBroke.run(input));

    // WHEN; coaxing brics to treat two separate chars as single entity
    final RegExp fix = new RegExp("((😘)|(😤))+abc");
    final Automaton aFix = fix.toAutomaton();

    // THEN; does match correctly
    System.out.println("Fixed: " + aFix.run(input));

In the last example, the broken one actually returns true but I believe its not matching correctly simply treating each of the 4 chars that make up the 2 code points as allowed and therefore matching on the string. However, it could potentially match other code points incorrectly.

DirkToewe · 2023-05-17T09:21:21Z

FIY you can also match code point ranges. I had made a corresponding pull request a while ago (PR #35). The relevant function is makeCodePointRange( int min, int max ), which returns an Automaton matching the given (valid) code point range.

mrspaceman mentioned this issue Feb 4, 2019

Update dk.brics.automaton library finos/datahelix#537

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regex doesn't handle surrogate pairs properly #15

Regex doesn't handle surrogate pairs properly #15

MiSawa commented Jan 31, 2019 •

edited

Loading

mrspaceman commented Feb 4, 2019 •

edited

Loading

turf00 commented May 17, 2023

DirkToewe commented May 17, 2023

Regex doesn't handle surrogate pairs properly #15

Regex doesn't handle surrogate pairs properly #15

Comments

MiSawa commented Jan 31, 2019 • edited Loading

mrspaceman commented Feb 4, 2019 • edited Loading

turf00 commented May 17, 2023

DirkToewe commented May 17, 2023

MiSawa commented Jan 31, 2019 •

edited

Loading

mrspaceman commented Feb 4, 2019 •

edited

Loading