Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4x Faster LUT via StringZilla #36

Closed
wants to merge 1 commit into from

Conversation

ashvardanian
Copy link
Contributor

@ashvardanian ashvardanian commented Oct 13, 2024

StringZilla brings hardware-accelerated Look-Up Table transformations that can leverage AVX-512 VBMI instructions on recent Intel Ice Lake CPUs (installed in most DGX servers), as well as older Intel Haswell, and newer Arm CPUs, like AWS Graviton 4.

Preliminary benchmarks on new x86 CPUs suggest up to 4x performance improvements compared to the OpenCV baselines. The results will differ depending on the CPU model. I generally recommend using r7iz and r8g AWS instances for profiling.

Summary by Sourcery

Implement a faster Look-Up Table transformation using StringZilla, which utilizes hardware acceleration on supported CPUs, replacing the existing OpenCV-based approach for significant performance gains.

New Features:

  • Introduce hardware-accelerated Look-Up Table transformations using StringZilla, leveraging AVX-512 VBMI instructions on compatible CPUs.

Enhancements:

  • Replace OpenCV LUT operations with a custom serialization-based approach for improved performance.

StringZilla brings hardware-accelerated
Look-Up Table transformations that can
leverage AVX-512 VBMI instructions on
recent Intel Ice Lake CPUs (installed in
most DGX servers), as well as older
Intel Haswell, and newer Arm CPUs,
like AWS Graviton 4.

Preliminary benchmarks on new x86 CPUs
suggest up to 4x performance improvements
compared to the OpenCV baselines.
Copy link
Contributor

sourcery-ai bot commented Oct 13, 2024

Reviewer's Guide by Sourcery

This pull request introduces StringZilla, a hardware-accelerated Look-Up Table (LUT) transformation library, to improve the performance of LUT operations. The changes primarily affect the apply_lut function in albucore/functions.py, replacing the OpenCV-based LUT application with a new StringZilla-based implementation. This change aims to leverage AVX-512 VBMI instructions on recent CPUs, potentially offering up to 4x performance improvements compared to the OpenCV baselines.

No diagrams generated as the changes look simple and do not need a visual representation.

File-Level Changes

Change Details Files
Replaced OpenCV LUT application with StringZilla-based implementation
  • Introduced a new serialize_lookup_recover function that uses StringZilla for LUT operations
  • Modified the apply_lut function to use serialize_lookup_recover instead of cv2.LUT
  • Updated the multi-channel LUT application to use the new StringZilla-based method
albucore/functions.py
Improved code formatting and readability
  • Split long function signatures into multiple lines for better readability
  • Added type hints to improve code clarity
albucore/functions.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time. You can also use
    this command to specify where the summary should be inserted.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ashvardanian - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Please provide benchmarks to validate the 4x performance improvement claim across different CPU architectures.
  • Consider implementing a fallback mechanism for systems without the required hardware support to maintain broad compatibility.
  • Add comments explaining the StringZilla implementation to improve code readability and maintainability.
Here's what I looked at during the review
  • 🟡 General issues: 1 issue found
  • 🟢 Security: all looks good
  • 🟢 Testing: all looks good
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +55 to +60
def serialize_lookup_recover(img: np.ndarray, lut: np.ndarray) -> np.ndarray:
# Encode image into bytes, perform the lookups and then decode the bytes back to numpy array
img_bytes = img.tobytes()
lut_bytes = lut.tobytes()
sz.translate(img_bytes, lut_bytes)
return np.frombuffer(img_bytes, dtype=img.dtype).reshape(img.shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (performance): Can you provide context for replacing cv2.LUT with serialize_lookup_recover?

This change seems significant. Could you share any performance benchmarks or explain the rationale behind this new approach? It would be helpful to understand the benefits over the previous cv2.LUT method.

@@ -42,16 +45,27 @@ def create_lut_array(
raise ValueError(f"Unsupported operation: {operation}")


def apply_lut(img: np.ndarray, value: float | np.ndarray, operation: Literal["add", "multiply", "power"]) -> np.ndarray:
def apply_lut(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring the implementation to improve code organization and clarity.

While the new implementation using stringzilla may offer performance benefits, it does increase code complexity. Consider the following suggestions to balance performance and readability:

  1. Move serialize_lookup_recover outside of apply_lut:
def serialize_lookup_recover(img: np.ndarray, lut: np.ndarray) -> np.ndarray:
    img_bytes = img.tobytes()
    lut_bytes = lut.tobytes()
    sz.translate(img_bytes, lut_bytes)
    return np.frombuffer(img_bytes, dtype=img.dtype).reshape(img.shape)

def apply_lut(
    img: np.ndarray,
    value: float | np.ndarray,
    operation: Literal["add", "multiply", "power"],
) -> np.ndarray:
    dtype = img.dtype
    if isinstance(value, (int, float)):
        lut = create_lut_array(dtype, value, operation)
        return serialize_lookup_recover(img, clip(lut, dtype))
    num_channels = img.shape[-1]
    luts = create_lut_array(dtype, value, operation)
    return cv2.merge([serialize_lookup_recover(img[:, :, i], clip(luts[i], dtype)) for i in range(num_channels)])
  1. Add comments explaining the performance benefits:
def serialize_lookup_recover(img: np.ndarray, lut: np.ndarray) -> np.ndarray:
    # This function uses stringzilla for efficient byte-level LUT application,
    # which can be faster than cv2.LUT for large images or frequent calls.
    img_bytes = img.tobytes()
    lut_bytes = lut.tobytes()
    sz.translate(img_bytes, lut_bytes)
    return np.frombuffer(img_bytes, dtype=img.dtype).reshape(img.shape)
  1. Consider adding a benchmark comparison between this method and cv2.LUT to justify the added complexity. If the performance gain is minimal, you might want to revert to the simpler cv2.LUT implementation.

  2. If you keep this implementation, add a note in the function docstring explaining why this approach was chosen over cv2.LUT.

These changes will help maintain the potential performance benefits while improving code readability and maintainability.

@ternaus
Copy link
Contributor

ternaus commented Oct 28, 2024

Already added

@ternaus ternaus closed this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants