-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
4x Faster LUT via StringZilla #36
Conversation
StringZilla brings hardware-accelerated Look-Up Table transformations that can leverage AVX-512 VBMI instructions on recent Intel Ice Lake CPUs (installed in most DGX servers), as well as older Intel Haswell, and newer Arm CPUs, like AWS Graviton 4. Preliminary benchmarks on new x86 CPUs suggest up to 4x performance improvements compared to the OpenCV baselines.
Reviewer's Guide by SourceryThis pull request introduces StringZilla, a hardware-accelerated Look-Up Table (LUT) transformation library, to improve the performance of LUT operations. The changes primarily affect the No diagrams generated as the changes look simple and do not need a visual representation. File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ashvardanian - I've reviewed your changes - here's some feedback:
Overall Comments:
- Please provide benchmarks to validate the 4x performance improvement claim across different CPU architectures.
- Consider implementing a fallback mechanism for systems without the required hardware support to maintain broad compatibility.
- Add comments explaining the StringZilla implementation to improve code readability and maintainability.
Here's what I looked at during the review
- 🟡 General issues: 1 issue found
- 🟢 Security: all looks good
- 🟢 Testing: all looks good
- 🟡 Complexity: 1 issue found
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
def serialize_lookup_recover(img: np.ndarray, lut: np.ndarray) -> np.ndarray: | ||
# Encode image into bytes, perform the lookups and then decode the bytes back to numpy array | ||
img_bytes = img.tobytes() | ||
lut_bytes = lut.tobytes() | ||
sz.translate(img_bytes, lut_bytes) | ||
return np.frombuffer(img_bytes, dtype=img.dtype).reshape(img.shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question (performance): Can you provide context for replacing cv2.LUT with serialize_lookup_recover?
This change seems significant. Could you share any performance benchmarks or explain the rationale behind this new approach? It would be helpful to understand the benefits over the previous cv2.LUT method.
@@ -42,16 +45,27 @@ def create_lut_array( | |||
raise ValueError(f"Unsupported operation: {operation}") | |||
|
|||
|
|||
def apply_lut(img: np.ndarray, value: float | np.ndarray, operation: Literal["add", "multiply", "power"]) -> np.ndarray: | |||
def apply_lut( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (complexity): Consider refactoring the implementation to improve code organization and clarity.
While the new implementation using stringzilla
may offer performance benefits, it does increase code complexity. Consider the following suggestions to balance performance and readability:
- Move
serialize_lookup_recover
outside ofapply_lut
:
def serialize_lookup_recover(img: np.ndarray, lut: np.ndarray) -> np.ndarray:
img_bytes = img.tobytes()
lut_bytes = lut.tobytes()
sz.translate(img_bytes, lut_bytes)
return np.frombuffer(img_bytes, dtype=img.dtype).reshape(img.shape)
def apply_lut(
img: np.ndarray,
value: float | np.ndarray,
operation: Literal["add", "multiply", "power"],
) -> np.ndarray:
dtype = img.dtype
if isinstance(value, (int, float)):
lut = create_lut_array(dtype, value, operation)
return serialize_lookup_recover(img, clip(lut, dtype))
num_channels = img.shape[-1]
luts = create_lut_array(dtype, value, operation)
return cv2.merge([serialize_lookup_recover(img[:, :, i], clip(luts[i], dtype)) for i in range(num_channels)])
- Add comments explaining the performance benefits:
def serialize_lookup_recover(img: np.ndarray, lut: np.ndarray) -> np.ndarray:
# This function uses stringzilla for efficient byte-level LUT application,
# which can be faster than cv2.LUT for large images or frequent calls.
img_bytes = img.tobytes()
lut_bytes = lut.tobytes()
sz.translate(img_bytes, lut_bytes)
return np.frombuffer(img_bytes, dtype=img.dtype).reshape(img.shape)
-
Consider adding a benchmark comparison between this method and cv2.LUT to justify the added complexity. If the performance gain is minimal, you might want to revert to the simpler cv2.LUT implementation.
-
If you keep this implementation, add a note in the function docstring explaining why this approach was chosen over cv2.LUT.
These changes will help maintain the potential performance benefits while improving code readability and maintainability.
Already added |
StringZilla brings hardware-accelerated Look-Up Table transformations that can leverage AVX-512 VBMI instructions on recent Intel Ice Lake CPUs (installed in most DGX servers), as well as older Intel Haswell, and newer Arm CPUs, like AWS Graviton 4.
Preliminary benchmarks on new x86 CPUs suggest up to 4x performance improvements compared to the OpenCV baselines. The results will differ depending on the CPU model. I generally recommend using
r7iz
andr8g
AWS instances for profiling.Summary by Sourcery
Implement a faster Look-Up Table transformation using StringZilla, which utilizes hardware acceleration on supported CPUs, replacing the existing OpenCV-based approach for significant performance gains.
New Features:
Enhancements: