Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor for wgpu v23 compatibility #2435

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

AsherJingkongChen
Copy link
Contributor

@AsherJingkongChen AsherJingkongChen commented Oct 29, 2024

Checklist

  • Confirmed that run-checks all script has been executed.
  • Made sure the book is up to date with changes in this PR.
    • After this PR merged, contributor-book/src/guides/adding-a-new-operation-to-burn.md needs to be updated.
  • Check dependencies.

Related Issues/PRs

Changes

  • Update cubecl
  • Update wgpu
  • Update WgpuDevice::id() in burn-tensor

Testing

I run ./run-checks.sh all, but too many errors occurred on macOS (Metal) 🥺. Fixing the bugs ...

Copy link

codecov bot commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 82.91%. Comparing base (6e71aaf) to head (975a54a).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/burn-tensor/src/lib.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2435   +/-   ##
=======================================
  Coverage   82.91%   82.91%           
=======================================
  Files         810      810           
  Lines      104904   104903    -1     
=======================================
  Hits        86984    86984           
+ Misses      17920    17919    -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

* The comments in `WgpuDevice::id()` are removed since `id` in `WgpuDevice::Existing` is always `u32`
@AsherJingkongChen
Copy link
Contributor Author

AsherJingkongChen commented Oct 29, 2024

Discovery

The least recent version of wgpu that breaks burn-wgpu tests on Metal is found, see gfx-rs/wgpu@9c781d6.

@AsherJingkongChen
Copy link
Contributor Author

AsherJingkongChen commented Oct 31, 2024

Discovery

I started to do some experiments on macOS.

I noticed that the test cargo t -p burn-wgpu jit::module_avg_pool2d::tests::test_avg_pool2d_simple failed when the kernel pool2d_direct was in Checked mode, but the test passed in Unchecked mode.

In conclusion, the reason of these 200+ errors on macOS tests should be some out-of-bound writes occur in Checked mode.
Since wgpu-hal/metal used BoundsCheckPolicy::Restrict in v23.0.0, these writes cause undesired outputs.
You can see the source code here.

What are the possible solutions?

  1. Changing the execution mode of kernels to Unchecked with manual bound checks added.

    #[cube(launch_unchecked)]
    fn select_kernel<T: Numeric, I: Numeric>(
        input: &Tensor<T>,
        indices: &Tensor<I>,
        output: &mut Tensor<T>,
        dim: u32,
    ) {
        if ABSOLUTE_POS >= output.len() {
            return;
        }
        // ...
    }
  2. Adding manual bound checks only.

    #[cube(launch)]
    pub(crate) fn kernel_scalar_binop<C: Numeric, O: BinaryOp<C>>(
        input: &Tensor<Line<C>>,
        scalar: C,
        output: &mut Tensor<Line<C>>,
    ) {
        let offset_output = ABSOLUTE_POS;
    
        if offset_output >= output.len() {
            return;
        }
        // ...
    }

@AsherJingkongChen
Copy link
Contributor Author

AsherJingkongChen commented Nov 12, 2024

Status Update

Currently, I am busy working at my personal projects. I would run CI every one or two weeks.

It seems that there are more errors occurred on macOS. I would resolve them after I finish my own works.

ArthurBrussee added a commit to ArthurBrussee/brush that referenced this pull request Nov 14, 2024
SSIM on wasm causes bad glitches. I suspect it's related to tracel-ai/burn#2435, but hard to say.
ArthurBrussee added a commit to ArthurBrussee/brush that referenced this pull request Nov 14, 2024
SSIM on wasm causes bad glitches. I suspect it's related to tracel-ai/burn#2435, but hard to say.
@AsherJingkongChen
Copy link
Contributor Author

Status Update

After merging the latest cubecl into my branch,
I have executed ./run-checks.sh all on my macOS (Metal 3+, Vulkan 1.2+), it looks no error here.

[2024-11-25T19:59:54Z INFO  xtask] Time elapsed for the current execution: 00:08:29

After running CI on tracel/burn, I see there are 8 errors in linux-std-tests. Nevertheless, this change passing macos-std-tests is a good news.

Thanks to the contributions of @ArthurBrussee.

@AsherJingkongChen
Copy link
Contributor Author

AsherJingkongChen commented Nov 25, 2024

Miscellaneous

Continued with the previous post, I paste the error logs on CI linux-std-tests here:


  failures:
  
  ---- tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_batched stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_batched' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 72: 270 != 0 | difference 270 > tolerance 0.2700000000000001
    => Position 73: 277.5 != 0 | difference 277.5 > tolerance 0.2775000000000001
    => Position 74: 285 != 0 | difference 285 > tolerance 0.2850000000000001
    => Position 75: 292.5 != 0 | difference 292.5 > tolerance 0.2925000000000001
    => Position 76: 67.5 != 0 | difference 67.5 > tolerance 0.06750000000000002
  67 more errors...
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  
  ---- tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_kernel_size stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_kernel_size' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 1: 5.355899810791016 != 0.007523147854954004 | difference 5.348376662936062 > tolerance 0.005355899810791017
    => Position 2: 11.715299606323242 != 0.34953704476356506 | difference 11.365762561559677 > tolerance 0.011715299606323245
    => Position 3: 0.3125 != 0.10416662693023682 | difference 0.20833337306976318 > tolerance 0.0010000000000000002
    => Position 4: 8 != 2.6666667461395264 | difference 5.333333253860474 > tolerance 0.008000000000000002
    => Position 5: 10 != 3.3333334922790527 | difference 6.666666507720947 > tolerance 0.010000000000000002
  235 more errors...
  
  ---- tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_basic stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_basic' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 1: 15 != 6 | difference 9 > tolerance 0.015000000000000003
    => Position 2: 30 != 12 | difference 18 > tolerance 0.030000000000000006
    => Position 3: 45 != 18 | difference 27 > tolerance 0.04500000000000001
    => Position 5: 3.75 != 1.5 | difference 2.25 > tolerance 0.0037500000000000007
    => Position 6: 7.5 != 3 | difference 4.5 > tolerance 0.0075000000000000015
  65 more errors...
  
  ---- tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_padding stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_padding' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 10: 0.8950999975204468 != 0 | difference 0.8950999975204468 > tolerance 0.0010000000000000002
    => Position 11: 14.760600090026855 != 0.03129822388291359 | difference 14.729301866143942 > tolerance 0.01476060009002686
    => Position 12: 17.60420036315918 != 0.5347222089767456 | difference 17.069478154182434 > tolerance 0.017604200363159184
    => Position 13: 20.69809913635254 != 1.1215760707855225 | difference 19.576523065567017 > tolerance 0.020698099136352545
    => Position 14: 22.20039939880371 != 1.624421238899231 | difference 20.57597815990448 > tolerance 0.022200399398803715
  415 more errors...
  
  ---- tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_batched stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_batched' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 72: 270 != 0 | difference 270 > tolerance 0.2700000000000001
    => Position 73: 277.5 != 0 | difference 277.5 > tolerance 0.2775000000000001
    => Position 74: 285 != 0 | difference 285 > tolerance 0.2850000000000001
    => Position 75: 292.5 != 0 | difference 292.5 > tolerance 0.2925000000000001
    => Position 76: 67.5 != 0 | difference 67.5 > tolerance 0.06750000000000002
  67 more errors...
  
  ---- tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_basic stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_basic' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 1: 15 != 6 | difference 9 > tolerance 0.015000000000000003
    => Position 2: 30 != 12 | difference 18 > tolerance 0.030000000000000006
    => Position 3: 45 != 18 | difference 27 > tolerance 0.04500000000000001
    => Position 5: 3.75 != 1.5 | difference 2.25 > tolerance 0.0037500000000000007
    => Position 6: 7.5 != 3 | difference 4.5 > tolerance 0.0075000000000000015
  65 more errors...
  
  ---- tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_kernel_size stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_kernel_size' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 1: 5.355899810791016 != 0.007523147854954004 | difference 5.348376662936062 > tolerance 0.005355899810791017
    => Position 2: 11.715299606323242 != 0.34953704476356506 | difference 11.365762561559677 > tolerance 0.011715299606323245
    => Position 3: 0.3125 != 0.10416662693023682 | difference 0.20833337306976318 > tolerance 0.0010000000000000002
    => Position 4: 8 != 2.6666667461395264 | difference 5.333333253860474 > tolerance 0.008000000000000002
    => Position 5: 10 != 3.3333334922790527 | difference 6.666666507720947 > tolerance 0.010000000000000002
  235 more errors...
  
  ---- tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_padding stdout ----
  Testing bias
  Testing input
  Testing offset
  thread 'tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_padding' panicked at crates/burn-wgpu/src/lib.rs:108:5:
  Tensors are not approx eq:
    => Position 10: 0.8950999975204468 != 0 | difference 0.8950999975204468 > tolerance 0.0010000000000000002
    => Position 11: 14.760600090026855 != 0.03129822388291359 | difference 14.729301866143942 > tolerance 0.01476060009002686
    => Position 12: 17.60420036315918 != 0.5347222089767456 | difference 17.069478154182434 > tolerance 0.017604200363159184
    => Position 13: 20.69809913635254 != 1.1215760707855225 | difference 19.576523065567017 > tolerance 0.020698099136352545
    => Position 14: 22.20039939880371 != 1.624421238899231 | difference 20.57597815990448 > tolerance 0.022200399398803715
  415 more errors...
  
  
  failures:
      tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_basic
      tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_batched
      tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_kernel_size
      tests::jit::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_padding
      tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_basic
      tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_batched
      tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_kernel_size
      tests::jit_fusion::autodiff::f32_ty::ad_deform_conv2d::tests::test_deform_conv2d_different_padding
  
  test result: FAILED. 2039 passed; 8 failed; 16 ignored; 0 measured; 0 filtered out; finished in 110.66s
  
  error: test failed, to rerun pass `-p burn-wgpu --lib`

@ArthurBrussee
Copy link
Contributor

That's awesome! Thanks for checking again :) I think the remaining errors are the same as here: #2539

The deformable convolutions might need some attention from Genna, for the others, maybe @nathanielsimard knows?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants