Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Establish concept of a computing device #52

Closed
wants to merge 3 commits into from

Conversation

oschulz
Copy link

@oschulz oschulz commented May 22, 2022

CUDA.jl, oneAPI.jl, etc. all provide a ...Device type, but without a common supertype.

Likewise, our GPU packages all provide functionality to get the device that a given array lives on, but each defines it's own function for it. The latter was partially addressed in (JuliaGPU/KernelAbstractions.jl#269) but it's not an elegant solution (all Requires-based) and KernelAbstractions is a heavy dependency. This makes it tricky to address issues like JuliaArrays/ElasticArrays.jl#44. Some of the code mentioned in JuliaGPU/GPUArrays.jl#409 could also lighten it's dependencies with common "get-device" functionality (LinearSolve.jl, for example, seems to need GPUArray.jl only for a b isa GPUArrays.AbstractGPUArray, similar for DiffEqSensitivity.jl

This PR and supporting PRs for CUDA.jl, AMDGPU.jl, oneAPI.jl and KernelAbstractions.jl attempt to establish a common supertype for computing devices, and support for

  • get_computing_device(x)::AbstractComputingDevice: Get the device x lives on, not limited to arrays (could e.g. be a whole ML model)

  • Adapt.adapt_storage(dev, x): Move x to device dev.

  • Sys.total_memory(dev): Get the total memory on dev

  • Sys.free_memory(dev): Get the free memory on dev

I think this will make it much easier to write generic device-independent code:

  • Being able to query if data lives on a CPU or GPU without taking on heavy dependencies should comes in useful in many packages.

  • Writing adapt(CuDevice(n), x) as an alternative to adapt(CuArray, x) seems very natural (esp. in multi-GPU scenarios): It corresponds to the user saying "let's run it on that GPU" instead of "with a different array type".

  • Having the ability to query total and available memory can help with finding the right data chunk sizes before sending data to a device with adapt(dev, data_partition).

This PR defines AbstractComputingDevice, AbstractGPUDevice and implements GPUDevices. It's very little code, there should be no load-time impact.

CC @vchuravy, @maleadt, @jpsamaroo, @ChrisRackauckas, @findmyway

Status:

src/computedevs.jl Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member

@tkf does this jive with what you need for Loops/Executor?

@vchuravy
Copy link
Member

Maybe instead of ComputingDevice we call it ComputeUnit? We are moving towards heterogeneous system in general and they might not be separated devices

@oschulz
Copy link
Author

oschulz commented May 22, 2022

Maybe instead of ComputingDevice we call it ComputeUnit? We are moving towards heterogeneous system in general and they might not be separated devices

Sure, absolutely!

Since this touches several PRs maybe I should wait for feedback on the general idea from @maleadt and @jpsamaroo before doing renames and so on?

src/computedevs.jl Outdated Show resolved Hide resolved
src/computedevs.jl Outdated Show resolved Hide resolved
src/computedevs.jl Outdated Show resolved Hide resolved
@oschulz
Copy link
Author

oschulz commented May 23, 2022

@maleadt and @jpsamaroo do you have some initial feedback? And sorry for the state that the AMDGPU part of this is still in @jpsamaroo, I'll need a few pointers on that. :-)

@maleadt
Copy link
Member

maleadt commented May 24, 2022

This seems sensible to me, but I don't understand why it belongs in Adapt.jl. The only purpose of this package is to provide utilities to convert complex object structures, and is (in principle) unrelated to GPU/array programming. Now, if the proposed device identification were based on the existing Adapt.jl infrastructure (adapt_structure/adapt_storage, although it would probably need to be generalized) I could understand putting it here, but it currently isn't (i.e. echoing @tfk's comments, https://github.com/JuliaGPU/Adapt.jl/pull/52/files#r879031411).

@codecov
Copy link

codecov bot commented May 24, 2022

Codecov Report

Merging #52 (76c686d) into master (d9f852a) will decrease coverage by 30.31%.
The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           master      #52       +/-   ##
===========================================
- Coverage   81.48%   51.16%   -30.32%     
===========================================
  Files           5        6        +1     
  Lines          54       86       +32     
===========================================
  Hits           44       44               
- Misses         10       42       +32     
Impacted Files Coverage Δ
src/Adapt.jl 100.00% <ø> (ø)
src/computedevs.jl 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9f852a...76c686d. Read the comment docs.

@oschulz
Copy link
Author

oschulz commented May 24, 2022

@maleadt: This seems sensible to me, but I don't understand why it belongs in Adapt.jl.

It seemed to be a natural place, both semantically and dependency-wise.

Semantically, adapt deals with moving storage between devices (at least that's what we use it for, mostly), so it seems natural to have a concept of what a computing device is in here.

Dependency-wise, pretty much all code that will need to define/use get_computing_device depends on Adapt already. We could create a super-lightweight package ComputingDevices.jl - but Adapt.jl would need to depend on it, so we can define adapt_storage(::CPUDevice, x). I'd be fine with that, but I thought people might prefer not to add such a package since Adapt.jl is already super-lightweight, and code that would need ComputingDevices.jl would very likely need Adapt.jl as well anyway.

Now, if the proposed device identification were based on the existing Adapt.jl infrastructure (adapt_structure/adapt_storage

With adapt_...(dev, x) (and I think that would be very good to have, see above) it's already integrated, in a way. And even if we only have a simple default implementation of get_computing_device that uses parent/buffer for now, it would certainly be good to expand in the direction that @tkf suggested ("generalizing Adapt.adapt_structure"). But I think that can be done as a second step (and would benefit Adapt as a whole).

We can always spin-off a separate package ComputingDevices.jl later on, if the need arises.

Copy link
Member

@jpsamaroo jpsamaroo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me generally, aside from existing comments by others. I agree with @maleadt that some part of this could possibly go in GPUArrays, but having it in a more lightweight package instead is definitely appealing.

src/computedevs.jl Outdated Show resolved Hide resolved
src/computedevs.jl Show resolved Hide resolved
src/computedevs.jl Outdated Show resolved Hide resolved
@oschulz
Copy link
Author

oschulz commented May 26, 2022

some part of this could possibly go in GPUArrays, but having it in a more lightweight package instead is definitely appealing

Yes, one important motivation here is to enable generic code to be aware of devices without taking on a heavy dependency like GPUArrays.

oschulz and others added 2 commits May 26, 2022 09:38
* Rename to AbstractComputeUnit and get_compute_unit (suggested by
  @vchuravy)

* Add AbstractComputeAccelerator (suggested by @ChrisRackauckas)

* Bottom value instead of exception if compute device can't be resolved

* Rename select_computing_device and make dispatch more robust
  (suggestions by @tkf and @jpsamaroo)

* Defend against reference cycle in generic implementation of
  get_compute_unit (pointed out by @tkf)
@oschulz
Copy link
Author

oschulz commented May 26, 2022

@vchuravy , @ChrisRackauckas , @tkf , @maleadt , @jpsamaroo thanks for all the inital feedback!

I've updated this PR and tried to include your suggestions - since a lot has changed I've marked the discussions above as resolved, but please do reopen them if issues haven't been addressed in the PR changes and the remarks below:

  • Renamed AbstractComputingDevice to AbstractComputeUnit (suggested by @vchuravy) and renamed get_compute_device to get_compute_unit accordingly.

  • Added AbstractComputeAccelerator (suggested by @ChrisRackauckas) in between AbstractComputeUnit and AbstractGPUDevice (is the type hierarchy too deep now?).

  • Using bottom values if compute device can't be determined or unified/merged (suggested by @tkf)

  • Renamed select_computing_device (suggestions by @tkf and @jpsamaroo) - it's called merge_compute_units now. It can actually be more of a combination than a promotion, I think - for example, multiple CUDA devices in the same box could be merged to a MultiCUDASystem([0,1,2,3]) or so in the future. Currently, get_compute_unit and merge_compute_units will return MixedComputeSystem() if different CPU/GPU devices are involved.

  • The generic implementation of get_compute_unit defends against reference cycles (pointed out by @tkf) now.

I think having an automatic recursive compute unit resolution will be important so that get_compute_unit_impl won't have to be specialized for most types - and we should support closures. This works:

julia> mllayer = let A = cu(rand(5,5)), b = cu(rand(5)), f = x -> x < 0 ? zero(x) : x
           x -> f.(A * b .+ b)
       end;

julia> x = cu(rand(5));

julia> get_compute_unit((mllayer, x))
CuDevice(0): NVIDIA GeForce

@tkf suggested a "fold over 'relevant data source objects' in a manner generalizing Adapt.adapt_structure". The generated default get_compute_unit_impl does such a fold, but since it doesn't need to reconstruct objects it's not limited to tuples and functions like the current Adapt.adapt_structure. It currently uses all fields - to my knowledge we don't have a widely supported way to get the "relevant data source objects" yet (that would be neat, though!). I suggest we pursue establishing such a standard separately (I suggested something along those lines in JuliaObjects/ConstructionBase.jl#54) and then use it in get_compute_unit_impl when available (even fewer types would need to specialize get_compute_unit_impl then).

@maleadt raised the question whether the compute unit concept belongs into Adapt. I strongly feel that it needs to be in a very lightweight package (so that generic code can be aware of compute devices without heavy deps, i.e. GPUArrays.jl is definitely too heavy for it). I would argue that Adapt is a good place, as adapt(::AbstractComputeUnit, x) is part of it. We could create a package ComputeUnits.jl, but Adapt.jl would have to depend on it. My suggestion would be to add the compute unit concept to Adapt.jl - we can still split off a ComputeUnits.jl in the future if necessary.

@oschulz oschulz closed this May 26, 2022
@oschulz oschulz reopened this May 26, 2022
@oschulz
Copy link
Author

oschulz commented May 26, 2022

(Closed by accident.)

UnknownComputeUnitOf(x)
end

@generated function get_compute_unit_impl(::Type{TypeHistory}, x) where TypeHistory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really not a fan of this @generated implementation. I think it would be preferable to follow the Adapt.jl pattern here and to perform a tree-walk. Now it is probably the case that the current infrastructure is not general enough, but if this goes into Adapt.jl we should have one mechanism to do this.

@maleadt and I briefly spoke about this and in general we are not opposed to this functionality being in Adapt.jl

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maleadt and I briefly spoke about this and in general we are not opposed to this functionality being in Adapt.jl

Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really not a fan of this @generated implementation [...] would be preferable to follow the Adapt.jl pattern here and to perform a tree-walk [...] if this goes into Adapt.jl we should have one mechanism to do this.

Do you mean that it's @generated, or what it does? In principle it does a tree walk, I would say (and object's encountered on the way can use a different method/specialization of get_compute_unit_impl. The function could be written without @generated, it could be done with just getfield, map and ntuple. I only wrote it this way to minimize the resulting code and increase type stability. But maybe I'm missing your point here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two points:

  1. I personally aim to reduce reliance on @generated as much as possible. So if we can write this without staged functions that would be great, it often makes the intent clearer and simplifies extendability (and is better for compile times, also better type-stability)
  2. This duplicates the core functionality of Adapt.jl which is essentially a tree walk over structs, so instead of two implementations it would be great to have one, but Tim and I acknowledge that this might be a trickier design.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll try to rewrite without @generated.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without staged functions [...] better type-stability

Type stability was actually one of the reasons I went for @generated here. :-)

Ok I tried without @generated @vchuravy, but type stability suffers. One implementation I came up with is

function get_compute_unit_impl_nogen(::Type{TypeHistory}, x::T) where {TypeHistory,T}
    NewTypeHistory = Union{TypeHistory, T}
    fields = ntuple(Base.Fix1(getfield, x), Val(fieldcount(T)))
    merge_compute_units(map(Base.Fix1(get_compute_unit_impl, NewTypeHistory), fields)...)
end

Nice and short, but:

julia> x = (a = 4.2, b = (c = rand(2,3)))
(a = 4.2, b = [0.7927795137326867 0.8930673224466184 0.15921059563423712; 0.6399176439174568 0.4501022168243579 0.3951239506670382])

julia> @inferred get_compute_unit_impl(Union{}, x)
Adapt.CPUDevice()

julia> @inferred get_compute_unit_impl_nogen(Union{}, x)
ERROR: return type Adapt.CPUDevice does not match inferred return type Any

And in truth, this version also uses generated code underneath, because ntuple does. Is there a way to do this without ntuple, efficiently?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vchuravy how would you prefer that I proceed here?

@oschulz
Copy link
Author

oschulz commented Jun 5, 2022

@ChrisRackauckas how do you think this should be integrated with ArrayInterfaceCore: AbstractDevice, AbstractCPU?

@ChrisRackauckas
Copy link
Member

I'm not sure. @Tokazama might have opinions.

@Tokazama
Copy link

Tokazama commented Jun 6, 2022

This seems a lot like what we do in ArrayInterfaceCore. We don't have anything that merges device types and we account for a difference in CPU types that are built on different memory objects like a tuple

@oschulz
Copy link
Author

oschulz commented Jun 7, 2022

The main advantage is we work pretty hard to navigate type wrappers in the most robust way we can so that if you grab a pointer from something you also can figure out what indices/strides/etc are valid.

Nice! I guess we'll have to combine that with the Adapt-style "structural descend", since we don't just want to cover arrays, but also deeply nested objects in general (MC and statistics model, complex data and so on)?

I guess the question now is will Adapt depend on ArrayInterfaceCore or the other way round. Both have a load time under 1 ms now, so from a user perspective it won't matter much, and both are pretty much unavoidable dependencies in any real project anyway. Depending on preferences of the Adapt and ArrayInterface maintainers I can then try to merge this PR with what's in AI, either here or there.

@oschulz
Copy link
Author

oschulz commented Jun 8, 2022

@Tokazama : ArrayInterfaceCore. We don't have anything that merges device types and we account for a difference in CPU types that are built on different memory objects like a tuple

The equivalent to ArrayInterface.CPUTuple would be Adapt.ComputingDeviceIndependent in this PR (just by looking at, e.g., a StaticArray we can't know if we're currently on a CPU or GPU without additional context, and bitstypes don't really have to be "adapted" between computing devices).

@oschulz
Copy link
Author

oschulz commented Jul 4, 2022

I'd like to get this moving again - @Tokazama can ArrayInterfaceCore.AbstractDevice evolve/change a bit as part of this effort?

@Tokazama
Copy link

Tokazama commented Jul 4, 2022

I think the GPU side of things might be able to change without any breaking changes for us. We could probably justify some straightforward breaking changes if absolutely necessary if it unified the LoopVectorization and GPU ecosystems.

@oschulz
Copy link
Author

oschulz commented Jul 4, 2022

Thanks @Tokazama . Ok I'll take a look - otherwise we can just provide a conversion between the device API in this PR and ArrayInterfaceCore.AbstractDevice.

@mcabbott
Copy link

mcabbott commented Jul 22, 2022

Some of the proposed benefits are already available from the very lightweight GPUArraysCore.jl (which did not exist when the issue was opened). You can test x isa AbstractGPUArray, and call adapt of course.

I don't think you can "query total and available memory". But perhaps that could be something like GPUArraysCore.total_memory(::Type{<:Array}) = Sys.total_memory() with a method for the type CuArray?

One thing you can't do is CUDA.functional. Flux uses this so that calling model |> gpu does nothing if there's no GPU. Edit: I wonder how often that could be replaced with total_memory(CuArray) > 0?

@Tokazama
Copy link

It would be nice if we could have the GPU ecosystem and CPU/SIMD/LoopVectorization ecosystems share a common interface though.

@oschulz
Copy link
Author

oschulz commented Jul 23, 2022

Also, this PR is not just about checking if an array is a GPU array, but about checking what kind of device(s) a whole data structure is located on. Plus the ability to use adapt to move data structures to a specific device. I'm very happy that we have GPUArraysCore.jl now, but I think we definitely need a common super-type for accelerator (and non-accelerator!) computing devices as well as for arrays. KernelAbstractions.jl, for example, doesn't need to define an extra device type for each supported backend anymore then.

As for pushing this forward: Maybe for now we can just work on fixing this up and merging this, and then look at possible connections to ArrayInterfaceCore.jl?

@ToucheSir
Copy link

Since the pending review comments appear to be about tree traversal, would it be possible to try landing a reduced version of this PR with just Adapt.adapt_storage(dev, x)? Maybe a constrained get_computing_device(x::Union{Array,AbstractGPUArray})::AbstractComputingDevice as a stretch goal. The former would be especially helpful on the Flux side.

@oschulz
Copy link
Author

oschulz commented Aug 11, 2022

would it be possible to try landing a reduced version of this PR with just Adapt.adapt_storage(dev, x)?

Fine with me - we can do the tree traversal in a second step. Maintainers, what do you think?

@maleadt
Copy link
Member

maleadt commented Aug 11, 2022

I'm still skeptical (this wasn't the scope of Adapt.jl and I feel like functionality is being tacked on because it's a common lightweight dependency), but if people are waiting for something like this I don't want to hold it up any longer 🙂 I guess it's fine to extend the scope to what the package is most commonly used for anyway. With respect to the actual functionality, we can always start with a getfield-based get_computing_device and try to figure out a better system to unify it with adapt's tree traversal later.

@oschulz
Copy link
Author

oschulz commented Aug 11, 2022

this wasn't the scope of Adapt.jl

I do get that point, though maybe one justification would be that the current draft also specialized adapt() for devices?

With respect to the actual functionality, we can always start with a getfield-based get_computing_device

Would the current recursive get_compute_unit in this draft fall under "getfield-based", or should the recursion/traversal be removed?

@maleadt
Copy link
Member

maleadt commented Aug 17, 2022

Would the current recursive get_compute_unit in this draft fall under "getfield-based", or should the recursion/traversal be removed?

It's getfield-based as it recurses based on the struct's fields, as opposed to the current "rule"-based approach that Adapt takes (with its plenty adapt_structure methods). But that's fine for now, if we even want to change it, because the API doesn't restrict it to be getfield-based.

@vchuravy any final thoughts?

@Tokazama
Copy link

I'm not sure if it makes a difference, but all of the device methods have been moved to ArrayInterfaceCore so if there are any reservations regarding integrating this because of start up latency, those should be mostly resolved.

@oschulz
Copy link
Author

oschulz commented Aug 18, 2022

It would be nice not to have divergent APIs here for Adapt and ArrayInterfaceCore - but Adapt.adapt_storage(devive, obj) (very useful on the user side I believe) could only be implemented if Adapt would depend on ArrayInterfaceCore (or the other way round). Could I get a quick maintainer opinion on whether that would be acceptable?

@maleadt
Copy link
Member

maleadt commented Aug 22, 2022

I see no reason not to; it's a dependency with minimal load time and no compatibility constraints.

@Tokazama
Copy link

Also, load time is likely to stay that way. Since we got rid of requires I've been testing load time for nearly every PR to ArrayInterfaceCore. Probably should work that into automated testing for each PR at some point though...

@oschulz
Copy link
Author

oschulz commented Aug 23, 2022

@maleadt: I see no reason not to; it's a dependency with minimal load time and no compatibility constraints.
@Tokazama: Also, load time is likely to stay that way.

Ok, great - I'll redraft this PR-complex then, including ArrayInterfaceCore. I'm on holiday, gimme a bit of time ..

@Jutho
Copy link

Jutho commented Oct 28, 2022

Bit late to the party, but if two packages (Adapt and ArrayInterfaceCore) would need this, and there is a discussion over which of the two should depend on the other, then why not have this functionality in a stand-alone package as was contemplated at the early stage of this PR?

@oschulz
Copy link
Author

oschulz commented Oct 28, 2022

Hi,

sorry for the long silence, been a bit overloaded with other stuff, but this is most definitely still on my to-do list. In fact, I've recently come across some use cases that really need something like adapt(computing_device, object).

why not have this functionality in a stand-alone package

we could, but as I understand there's no objection to Adapt depending on ArrayInterfaceCore, and since both are already very lightweight it might be overkill to introduce a third package for this. We can still do so later on if we need to, after all.

@Tokazama
Copy link

Bit late to the party, but if two packages (Adapt and ArrayInterfaceCore) would need this, and there is a discussion over which of the two should depend on the other, then why not have this functionality in a stand-alone package as was contemplated at the early stage of this PR?

It's not so much that two packages need this as much that one package already has this partially implemented

@Jutho
Copy link

Jutho commented Oct 28, 2022

Ok, it was a question out of curiosity. Knowing what device an array is living on is a very basic primitive. For example, I have some custom implementations for DenseArray types, but these should only be used for host / CPU stuff, not GPU arrays. I am not particularly opposed to depending ArrayInterfaceCore, but there is also quite a bit of stuff in there that I definitely do not need and seems perhaps SciML / Linsolve.jl specific (lu_instance), and some things that I don't really understand from the documentation what it is about and is probably matrix specific (matrix_colors?)

Certainly, at the moment, ArrayInterfaceCore.device(::CuArray) is producing the wrong result, so I will anyway go with depending on GPUArraysCore and using the AbstractGPUArray type.

@Tokazama
Copy link

I'm not saying it can't be a third package. I just want to be clear that there's a bit more work there than one might assume because we need to avoid disrupting things already in production

@oschulz
Copy link
Author

oschulz commented Mar 25, 2023

Just a live sign, I haven't forgotten about this one. Due to recent changes in the interdependence between KernelAbstractions and the GPU packages it will have to evolve a bit though, I think. I'll play around with a few ideas.

@vchuravy
Copy link
Member

@oschulz also thank you for all the thinking about this. My long-term plan is to add KA as a dependency to GPUArrays, and anchor the notion of the compute device in KernelAbstractions e.g. as the backend.

I also took some inspiration and added an allocate function.

@oschulz
Copy link
Author

oschulz commented Mar 28, 2023

My long-term plan is to add KA as a dependency to GPUArrays, and anchor the notion of the compute device in KernelAbstractions e.g. as the backend.

Thanks, that sounds great!

@oschulz
Copy link
Author

oschulz commented May 4, 2023

I'll explore the concepts in here in a standalone package HeterogeneousComputing.jl for a while. I have a current need to use this kind of functionality in two or three real-life applications (so I'll register HeterogeneousComputing), and at least one of them will involve determining if data is on disk (DiskArrays/HDF) in addition to distinguishing just between CPU and GPU types. So I think I'll need to experiment and iterate this a bit more, before proposing a new incarnation of this in packages as central as Adapt and ArrayInterfaceCore.

Experimentation will be easier in a package that can go through versions faster without affecting half the ecosystem. :-) And with Julia v1.9 weakdeps is easier now without a huge Requires load-time hit.

I'll close this and open a "reminder issue" - I still believe it would be great to have common supertypes for compute units, adapt integration for them and a central way of determining what (nested) data is on which device.

@oschulz oschulz closed this May 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.