-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regex doesn't handle surrogate pairs properly #15
Comments
I have found a similar issue: When I try to generate strings for a regex containing emoji using the following code :
the toString() produces the following:
The emoji's used are: 😁
to 😘
|
You cannot fix range handling of Unicode chars that are outside the 16bit values possible with a single Java char, as per above. However, you can coax Brics to handle repetition and other tokens using grouping Just leaving this up here in case its of use to someone. // GIVEN; input that contains multiple supplementary characters
final String input = "😘😘😘abc";
// WHEN; using standard broken for for repetition emojis
final RegExp broke = new RegExp("😘+abc");
final Automaton aBroke = broke.toAutomaton();
// THEN; does not match incorrectly
System.out.println("Broke: " + aBroke.run(input));
// WHEN; coaxing brics to treat two separate chars as single entity
final RegExp fix = new RegExp("(😘)+abc");
final Automaton aFix = fix.toAutomaton();
// THEN; does match correctly
System.out.println("Fixed: " + aFix.run(input)); Will result in:
And here is another example // GIVEN; input that contains multiple supplementary characters
final String input = "😘😤😘😤😘abc";
// WHEN; using standard broken for for repetition emojis
final RegExp broke = new RegExp("[😤😘]+abc");
final Automaton aBroke = broke.toAutomaton();
// THEN; does not match incorrectly
System.out.println("Broke: " + aBroke.run(input));
// WHEN; coaxing brics to treat two separate chars as single entity
final RegExp fix = new RegExp("((😘)|(😤))+abc");
final Automaton aFix = fix.toAutomaton();
// THEN; does match correctly
System.out.println("Fixed: " + aFix.run(input)); In the last example, the broken one actually returns true but I believe its not matching correctly simply treating each of the 4 chars that make up the 2 code points as allowed and therefore matching on the string. However, it could potentially match other code points incorrectly. |
FIY you can also match code point ranges. I had made a corresponding pull request a while ago (PR #35). The relevant function is makeCodePointRange( int min, int max ), which returns an Automaton matching the given (valid) code point range. |
Hi, thank you for providing a great regular expression library!
I have noticed that brics handles input regex string as a sequence of
java.lang.Character
, and this could cause a somewhat unintuitive behavior.For example,
𠀋<𠮟<𡵅
as a Unicode Scalar Value (0x2000b
,0x20b9f
,0x21d45
respectively, all of them will be expressed with surrogate pairs), but automaton created from[𠀋-𡵅]
doesn't accept𠮟
.Fixing this would require us to do
java.lang.Character
-by-java.lang.Character
but) Code Point stream. This also includes fixes for operator precedence, like𠀋+
.java.lang.Character
s, and if they involve surrogate pairs, do something similar to what we do for numerical interval<n-m>
Although won't-fix totally make sense, it'd be great if we could find this fact in the documentation.
Thanks,
The text was updated successfully, but these errors were encountered: