Unicode URL Parsing #546

andymandias · 2024-09-10T02:50:00Z

Updates the regex to use the Unicode letter and number character classes instead of ASCII letter and number ranges. Also adds the examples provided in issue #545 as tests.

The conversion from Regex::new to RegexBuilder::new is because Regex::new runs into the CompiledTooBig error because Unicode letter and number character classes are much larger than their ASCII counterparts. The documentation warns that this error is a potential sign of a slow regex. For now I bumped the size_limit, but an alternative could be to simplify the regex. For example, if we switch the domain parsing from {1,256} and {1,63} to + and +, then the regex would again fit within the default size_limit (the URL parsing would be less strict as a result, treating potential non-URLs as URLs).

tarkah

Thanks!

Convert ASCII letter/number regex to Unicode letter/number regex.

2591e44

andymandias linked an issue Sep 10, 2024 that may be closed by this pull request

URL parser choking on Umlauts #545

Closed

casperstorm approved these changes Sep 10, 2024

View reviewed changes

tarkah approved these changes Sep 10, 2024

View reviewed changes

tarkah merged commit 5accdc7 into main Sep 10, 2024
1 check passed

tarkah deleted the fix/unicode-urls branch September 10, 2024 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode URL Parsing #546

Unicode URL Parsing #546

andymandias commented Sep 10, 2024

tarkah left a comment

Unicode URL Parsing #546

Unicode URL Parsing #546

Conversation

andymandias commented Sep 10, 2024

tarkah left a comment

Choose a reason for hiding this comment