utf8n_to_uvchr_msgs: Fix some minor bugs #22791

khwilliamson · 2024-11-27T19:29:59Z

This set of commits tidies up the handling of found malformations in a UTF-8 string being converted to a code point ordinal value. But most importantly the final commit fixes some bugs in which the the wrong bit in a bit map was being returned for malformations involving code points that are above the legal Unicode maximum.

These are normally accepted by Perl, except those that won't fit in an IV on the platform. But the caller of this function can specify special handling of all or a subset of these. Different types of special handling could apply to different subsets even in the same call. This was complicated and there were bugs in getting the right thing returned to the caller.

These bugs were found by more rigorous testing; a separate pull request will be issued for that.

This set of changes does not require a perldelta entry.

Previously this variable was initialized to 0 and then set to another value in each case in the switch. But most of the cases set it to the same value. Might as well initialize it to that value, remove the statements that merely repeat that, and leave in the cases that set it to something else.

This just moves these around so that they are roughly in increasing order of complexity, which makes it easier to grok. This doesn't change the order the cases are called in when there are multiple ones. The order is dictated by what the switch() does first, which isn't affected by this commit.

Instead of initializing this to 0 and then setting it in each case of the switch, initialize it to the most common value, remove those statements that set it to that, leaving the rest alone.

This condition used a combination of two bits, which makes things a little awkward, and isn't really needed

The outermost block here is executed when any of three types of problematic Unicode code points is encountered, and the caller has indicated special handling of at least one of those types. Before this commit, we set a flag to later look to see if what was encountered matched the type the caller specified. This commit changes to do that looking at the point where the flag had been set, and only sets the flag if necessary. This may completely avoid the later work, which has set-up overhead, and this will make future commits simpler.

This adds extensive comments and adjusts white space a bit.

This function is passed an address of where to write some values. Instead of updating the derefernced pointer each time, do it once after all the information is accumulated. This allows for the values to be updated without updating the pointed to variable, so makes some case in a switch() able to be more uniform.

This commit moves the final two cases of setting up the return to be the REPLACEMENT_CHARACTER to later in the code, where all such malformations are handled. This makes the handling uniform for a bunch of cases, which will enable a future commit to combine them.

This creates a #define of the string in a string array constant. The constant was to avoid having this common string appear in multiple places, but there is a benefit to having it in one place separately, which this commit also does. And that is, it makes a call to Perl_form() consistent with two other calls that the next commit will combine.

This is the most complicated of the problematic UTF-8 conditions. There are three types of these, each with a different warning message, and any combination of them can be set to be warned or not warned about or an error bit returned. More rigorous testing, yet to be committed, has indicated that there are bugs in the existing implementation, which this commit fixes. The three cases are partially combined, so that if one case finds it is not authorized to handle things, it drops to the next lower severity one.

khwilliamson added 10 commits November 27, 2024 12:19

utf8n_to_uvchr_msgs: Hoist common paradigm out of switch()

83754c2

Instead of initializing this to 0 and then setting it in each case of the switch, initialize it to the most common value, remove those statements that set it to that, leaving the rest alone.

utf8n_to_uvchr_msgs: Use a single bit flag

69406cd

This condition used a combination of two bits, which makes things a little awkward, and isn't really needed

utf8n_to_uvchr_msgs: Comments, white-space only

eac4507

This adds extensive comments and adjusts white space a bit.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8n_to_uvchr_msgs: Fix some minor bugs #22791

utf8n_to_uvchr_msgs: Fix some minor bugs #22791

khwilliamson commented Nov 27, 2024

utf8n_to_uvchr_msgs: Fix some minor bugs #22791

Are you sure you want to change the base?

utf8n_to_uvchr_msgs: Fix some minor bugs #22791

Conversation

khwilliamson commented Nov 27, 2024