Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audio: DRC: Some HiFi4 and HiFi3 optimizations #9646

Merged

Conversation

singalsu
Copy link
Collaborator

@singalsu singalsu commented Nov 8, 2024

No description provided.

* p1p0.l holds p0(x)
* in3p1.h holds p1(x) * x^3
*/
acc = AE_MOVAD32_L(AE_ADD32_HL_LH(in3p1, in3p1));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1st working version, this should be possible optimize further. If it works, it should apply to other used polynomial evaluation based functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great - lets takes this as step 1. The future optimizations can merge as other PRs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's now a better version, but I'm out of ideas how to improve this further. Also, the polynomial is too low order for four parallel multipliers version of Horner, there would be setup overhead and redundant zero coefficients.

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* p1p0.l holds p0(x)
* in3p1.h holds p1(x) * x^3
*/
acc = AE_MOVAD32_L(AE_ADD32_HL_LH(in3p1, in3p1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great - lets takes this as step 1. The future optimizations can merge as other PRs.

@singalsu singalsu force-pushed the drc_optimize_detector_average_inverse branch from 8ccefd0 to 9a1a160 Compare November 11, 2024 17:17
int32_t precision_inv;
int32_t sqrt2_extracted = 0;
ae_f32 acc;
ae_f32 coef[2];
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__aligned(8) ae_f32 coef[2];

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu does this need aligned i.e. will it be used for SIMD ? I would assume that the SIMD intrinsics would force alignment in their definition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need to add that since I declare it as array of 32-bit while I use it in hot code with 64 bit load. The compiler may figure that out but I think there is no guarantee. I didn't get align faults when I ran this but better to be safe.

@singalsu
Copy link
Collaborator Author

I added unit test patch for the changed function separately in #9649

@singalsu singalsu marked this pull request as ready for review November 13, 2024 14:17
@singalsu singalsu requested a review from a team as a code owner November 13, 2024 14:17
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one open.

int32_t precision_inv;
int32_t sqrt2_extracted = 0;
ae_f32 acc;
ae_f32 coef[2];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@singalsu does this need aligned i.e. will it be used for SIMD ? I would assume that the SIMD intrinsics would force alignment in their definition.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful comments. A few minor things, but no showstoppers. I'd add the alignment attribute.

src/audio/drc/drc_math_hifi3.c Show resolved Hide resolved
Use 64 bit SIMD for load/store and maximum absolute
values search. This saves about 0.1 MCPS in MTL simulation
in sof-testbench4.

Signed-off-by: Seppo Ingalsuo <[email protected]>
@singalsu singalsu force-pushed the drc_optimize_detector_average_inverse branch from 9a1a160 to 64a8433 Compare November 15, 2024 14:19
This change improves calculation speed. The implementation
is bit exact with previous but implemented with 2-way SIMD
multiply. The Horner polynomial evaluation is changed to
parallel Horner version for two multipliers. The input and
output shifting code is also simplified.

The code changes save 1.0 MCPS in sof-testbench4 simulated
MTL platform.


Signed-off-by: Seppo Ingalsuo <[email protected]>
@singalsu singalsu force-pushed the drc_optimize_detector_average_inverse branch from 64a8433 to c46e4e3 Compare November 15, 2024 14:39
@lgirdwood lgirdwood merged commit 5c0bee8 into thesofproject:main Nov 22, 2024
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants