Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8n_to_uvchr_msgs: Fix some minor bugs #22791

Open
wants to merge 10 commits into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

This set of commits tidies up the handling of found malformations in a UTF-8 string being converted to a code point ordinal value. But most importantly the final commit fixes some bugs in which the the wrong bit in a bit map was being returned for malformations involving code points that are above the legal Unicode maximum.

These are normally accepted by Perl, except those that won't fit in an IV on the platform. But the caller of this function can specify special handling of all or a subset of these. Different types of special handling could apply to different subsets even in the same call. This was complicated and there were bugs in getting the right thing returned to the caller.

These bugs were found by more rigorous testing; a separate pull request will be issued for that.

  • This set of changes does not require a perldelta entry.

Previously this variable was initialized to 0 and then set to another
value in each case in the switch.  But most of the cases set it to the
same value.  Might as well initialize it to that value, remove the
statements that merely repeat that, and leave in the cases that set it
to something else.
This just moves these around so that they are roughly in increasing
order of complexity, which makes it easier to grok.

This doesn't change the order the cases are called in when there are
multiple ones.  The order is dictated by what the switch() does first,
which isn't affected by this commit.
Instead of initializing this to 0 and then setting it in each case of
the switch, initialize it to the most common value, remove those
statements that set it to that, leaving the rest alone.
This condition used a combination of two bits, which makes things a
little awkward, and isn't really needed
The outermost block here is executed when any of three types of
problematic Unicode code points is encountered, and the caller has
indicated special handling of at least one of those types.  Before this
commit, we set a flag to later look to see if what was encountered
matched the type the caller specified.  This commit changes to do that
looking at the point where the flag had been set, and only sets the flag
if necessary.  This may completely avoid the later work, which has
set-up overhead, and this will make future commits simpler.
This adds extensive comments and adjusts white space a bit.
This function is passed an address of where to write some values.
Instead of updating the derefernced pointer each time, do it once after
all the information is accumulated.

This allows for the values to be updated without updating the pointed to
variable, so makes some case in a switch() able to be more uniform.
This commit moves the final two cases of setting up the return to be the
REPLACEMENT_CHARACTER to later in the code, where all such malformations
are handled.

This makes the handling uniform for a bunch of cases, which will enable
a future commit to combine them.
This creates a #define of the string in a string array constant.  The
constant was to avoid having this common string appear in multiple
places, but there is a benefit to having it in one place separately,
which this commit also does.  And that is, it makes a call to
Perl_form() consistent with two other calls that the next commit will
combine.
This is the most complicated of the problematic UTF-8 conditions.  There
are three types of these, each with a different warning message, and any
combination of them can be set to be warned or not warned about or an
error bit returned.  More rigorous testing, yet to be committed,  has
indicated that there are bugs in the existing implementation, which this
commit fixes.

The three cases are partially combined, so that if one case finds it is
not authorized to handle things, it drops to the next lower severity
one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant