Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OP_SUBSTR_CHOP - a specialised OP_SUBSTR variant #22785

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

richardleach
Copy link
Contributor

This commit adds OP_SUBSTR_NIBBLE and associated machinery for fast handling of the constructions:

    substr EXPR,0,LENGTH,''

and

    substr EXPR,0,LENGTH

Where EXPR is a scalar lexical, the OFFSET is zero, and either there is no REPLACEMENT or it is the empty string. LENGTH can be anything that OP_SUBSTR supports. These constraints allow for a very stripped back and optimised version of pp_substr.

The primary motivation was for situations where a scalar, containing some network packets or other binary data structure, is being parsed piecemeal. Nibbling away at the scalar can be useful when you don't know how exactly it will be parsed and unpacked until you get started. It also means that you don't need to worry about correctly updating a separate offset variable.

This operator also turns out to be an efficient way to (destructively) break an expression up into fixed size chunks. For example, given:

my $x = ''; my $str = "A"x100_000_000;

This code:

$x = substr($str, 0, 5, "") while ($str);

is twice as fast as doing:

for ($pos = 0; $pos < length($str); $pos += 5) {
    $x = substr($str, $pos, 5);
}

Compared with blead, $y = substr($x, 0, 5) runs 40% faster and $y = substr($x, 0, 5, '') runs 45% faster.


  • This set of changes requires a perldelta entry, and I will add one shortly.

Copy link
Contributor

@leonerd leonerd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of small comments but overall nothing troubling-looking here.

I wonder a bit about the name though. I've usually seen the word "nibble" to mean a half-byte; i.e. a 4-bit value. I wondered if that is what is going on here at first. If there are other candidate names to call it, perhaps something else would be better? Not a huge problem though.

peep.c Outdated Show resolved Hide resolved
peep.c Outdated Show resolved Hide resolved
pp.c Outdated Show resolved Hide resolved
@richardleach
Copy link
Contributor Author

How about substr_chop, since a big part of its design is being a fast route for getting to Perl_sv_chop?

Food related alternatives: substr_peck? substr_graze? substr_smidgen? substr_julienne? substr_shuck? substr_heel, like the end of a loaf? Non-foodie: substr_shave?

@richardleach
Copy link
Contributor Author

Not sure what's going on with the ABRT test failures. Don't get them locally. ./perl -Ilib t/perf/benchmarks.t seemingly only uses 8MB of memory, so running out of memory doesn't seem to be the cause.

@richardleach
Copy link
Contributor Author

Not sure what's going on with the ABRT test failures.

Looks like an op_private flags assertion. I'll dig into it soon.

@richardleach richardleach changed the title OP_SUBSTR_NIBBLE - a specialised OP_SUBSTR variant OP_SUBSTR_CHOP - a specialised OP_SUBSTR variant Nov 28, 2024
@richardleach
Copy link
Contributor Author

I'm rebasing and renaming it to SUBSTR_CHOP.

@leonerd
Copy link
Contributor

leonerd commented Nov 28, 2024

Doesn't perl's chop() function eat from the other end though?

@richardleach
Copy link
Contributor Author

The Perl chop takes from the end, but Perl_sv_chop takes from the front. (I don't know who we have to thank for that amazing piece of naming.) The pp_ function for this op calls Perl_sv_chop.

@jkeenan
Copy link
Contributor

jkeenan commented Nov 29, 2024

@richardleach , merge conflicts ^^

This commit adds OP_SUBSTR_CHOP and associated machinery for fast
handling of the constructions:

        substr EXPR,0,LENGTH,''
and
        substr EXPR,0,LENGTH

Where EXPR is a scalar lexical, the OFFSET is zero, and either there
is no REPLACEMENT or it is the empty string. LENGTH can be anything
that OP_SUBSTR supports. These constraints allow for a very stripped
back and optimised version of pp_substr.

The primary motivation was for situations where a scalar, containing
some network packets or other binary data structure, is being parsed
piecemeal. Nibbling away at the scalar can be useful when you don't
know how exactly it will be parsed and unpacked until you get started.
It also means that you don't need to worry about correctly updating
a separate offset variable.

This operator also turns out to be an efficient way to (destructively)
break an expression up into fixed size chunks. For example, given:

    my $x = ''; my $str = "A"x100_000_000;

This code:

    $x = substr($str, 0, 5, "") while ($str);

is twice as fast as doing:

    for ($pos = 0; $pos < length($str); $pos += 5) {
        $x = substr($str, $pos, 5);
    }

Compared with blead, `$y = substr($x, 0, 5)` runs 40% faster and
`$y = substr($x, 0, 5, '')` runs 45% faster.

Note that this is "chop" in the sense of Perl_sv_chop, which it
efficiently calls, not the Perl language's "chop" function.
@leonerd
Copy link
Contributor

leonerd commented Nov 29, 2024

The Perl chop takes from the end, but Perl_sv_chop takes from the front. (I don't know who we have to thank for that amazing piece of naming.) The pp_ function for this op calls Perl_sv_chop.

Oh wow. Huh. In that case, might as well call this one SUBSTR_CHOP indeed then.

Otherwise my thoughts were going to be something like SUBSTR_PREFIX but that isn't much more descriptive.

@Grinnz
Copy link
Contributor

Grinnz commented Nov 30, 2024

Consider ltrim, with inspiration from PHP and Redis (or lstrip a la Ruby/Python but that sounds more whitespace-specific). Though it is also unrelated to builtin::trim, I think it's a bit more descriptive at least

@richardleach
Copy link
Contributor Author

Consider ltrim, with inspiration from PHP and Redis (or lstrip a la Ruby/Python but that sounds more whitespace-specific). Though it is also unrelated to builtin::trim, I think it's a bit more descriptive at least

Hmmm, I'm not sure about this. It seems only more descriptive to someone who already is familiar with ltrim, otherwise it's likely to lead to confusion with builtin:trim or even reducing the other end of the string. There might be some confusion around _CHOP, but at least the connection to sv_chop is there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants