Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incosistency with async mode when using wrapper filesystems (e.g. dir and cache) #1709

Open
orenl opened this issue Oct 7, 2024 · 1 comment

Comments

@orenl
Copy link
Contributor

orenl commented Oct 7, 2024

The documentation says:

The class attribute async_impl can be used to test whether an implementation is async of not.

DirFileSystem inherits from AsyncFileSystem, so it sets async_impl = True. As a wrapper filesystem it requires to match the sync/async operation mode to that of the underlying filesystem. That is reflected by the dirfs.asynchronous attribute (true if set so when instantiated).

Therefore, the language above is misleading: async_impl indicates whether the class has async capabilities. But to test whether a specific instance is sync/async, one needs to test the dirfs.asynchronous attribute.

But even that can be tricky - since CachingFileSystem (base for all caches) is not capable of async operation, so it does not set async_impl = False; And it does not care about its cache.asynchromous attrtibute to match -or not- that of the unerlying filesystem. Instead, it report both async_impl and asynchronous from that underlying filesystem. This is not only misleading, but also means that using e.g. open_async would actually bypass the cache implementation and invoke the underlying filesystem's method instead (if that filesystem is async).

Consdier this example:

s3_fs = fsspec.filesystem('s3', asynchronous=True, ...)
dir_fs = fsspec.filesystem('dir', asynchronous=True, fs=s3_fs)
cache_fs = fsspec.filesystem('simplecache', asynchronous=False, fs=dir_fs)

print(s3_fs.async_impl, dir_fs.async_impl, cache_fs.async_impl)
print(s3_fs.asynchronous, dir_fs.asynchronous, cache_fs.asynchronous)

# the output:
# True, True, Ture
# True, True, Ture

So to recap:

  1. The documentation should be improved to clarify the difference between capable and enabled.
  2. Wrapper filesystems (like caching variants) that do not support async should report async_impl and asynchronous both false, and should not accept asynchronous = True argument, and should complain if the underlying filesystem has asynchronous = True, and consider overriding underlying filesystem's async implementation with NotImplementedError. (Alternatively, add support for async operation to CachingFileSystem).
@martindurant
Copy link
Member

asynchronous=True doesn't mean that this is an async filesystem, but that it is being initialised and run from within an async context (coroutine).

Most async implementations translate something like cp_file() to sync(_cp_file()), where _cp_file is a coroutine. dirfs is special in that cp_file calls the internal filesystem's cp_file, so sync() only happens if the inner fs is async. If it is sync, then the inner _cp_file doesn't exist at all.

In summary, I think it's reasonable to define .async_impl on dirfs instances (not the class) to depend on whatever the backend implementation is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants