Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode URL Parsing #546

Merged
merged 1 commit into from
Sep 10, 2024
Merged

Unicode URL Parsing #546

merged 1 commit into from
Sep 10, 2024

Conversation

andymandias
Copy link
Collaborator

Updates the regex to use the Unicode letter and number character classes instead of ASCII letter and number ranges. Also adds the examples provided in issue #545 as tests.

The conversion from Regex::new to RegexBuilder::new is because Regex::new runs into the CompiledTooBig error because Unicode letter and number character classes are much larger than their ASCII counterparts. The documentation warns that this error is a potential sign of a slow regex. For now I bumped the size_limit, but an alternative could be to simplify the regex. For example, if we switch the domain parsing from {1,256} and {1,63} to + and +, then the regex would again fit within the default size_limit (the URL parsing would be less strict as a result, treating potential non-URLs as URLs).

@andymandias andymandias linked an issue Sep 10, 2024 that may be closed by this pull request
Copy link
Member

@tarkah tarkah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tarkah tarkah merged commit 5accdc7 into main Sep 10, 2024
1 check passed
@tarkah tarkah deleted the fix/unicode-urls branch September 10, 2024 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

URL parser choking on Umlauts
3 participants