-
Notifications
You must be signed in to change notification settings - Fork 559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf8n_to_uvchr_msgs: Fix some minor bugs #22791
Open
khwilliamson
wants to merge
10
commits into
Perl:blead
Choose a base branch
from
khwilliamson:overflow
base: blead
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously this variable was initialized to 0 and then set to another value in each case in the switch. But most of the cases set it to the same value. Might as well initialize it to that value, remove the statements that merely repeat that, and leave in the cases that set it to something else.
This just moves these around so that they are roughly in increasing order of complexity, which makes it easier to grok. This doesn't change the order the cases are called in when there are multiple ones. The order is dictated by what the switch() does first, which isn't affected by this commit.
Instead of initializing this to 0 and then setting it in each case of the switch, initialize it to the most common value, remove those statements that set it to that, leaving the rest alone.
This condition used a combination of two bits, which makes things a little awkward, and isn't really needed
The outermost block here is executed when any of three types of problematic Unicode code points is encountered, and the caller has indicated special handling of at least one of those types. Before this commit, we set a flag to later look to see if what was encountered matched the type the caller specified. This commit changes to do that looking at the point where the flag had been set, and only sets the flag if necessary. This may completely avoid the later work, which has set-up overhead, and this will make future commits simpler.
This adds extensive comments and adjusts white space a bit.
This function is passed an address of where to write some values. Instead of updating the derefernced pointer each time, do it once after all the information is accumulated. This allows for the values to be updated without updating the pointed to variable, so makes some case in a switch() able to be more uniform.
This commit moves the final two cases of setting up the return to be the REPLACEMENT_CHARACTER to later in the code, where all such malformations are handled. This makes the handling uniform for a bunch of cases, which will enable a future commit to combine them.
This creates a #define of the string in a string array constant. The constant was to avoid having this common string appear in multiple places, but there is a benefit to having it in one place separately, which this commit also does. And that is, it makes a call to Perl_form() consistent with two other calls that the next commit will combine.
This is the most complicated of the problematic UTF-8 conditions. There are three types of these, each with a different warning message, and any combination of them can be set to be warned or not warned about or an error bit returned. More rigorous testing, yet to be committed, has indicated that there are bugs in the existing implementation, which this commit fixes. The three cases are partially combined, so that if one case finds it is not authorized to handle things, it drops to the next lower severity one.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This set of commits tidies up the handling of found malformations in a UTF-8 string being converted to a code point ordinal value. But most importantly the final commit fixes some bugs in which the the wrong bit in a bit map was being returned for malformations involving code points that are above the legal Unicode maximum.
These are normally accepted by Perl, except those that won't fit in an IV on the platform. But the caller of this function can specify special handling of all or a subset of these. Different types of special handling could apply to different subsets even in the same call. This was complicated and there were bugs in getting the right thing returned to the caller.
These bugs were found by more rigorous testing; a separate pull request will be issued for that.