Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External bucket support for file_column detection, dataloader post_hooks, bugfixes #332

Merged
merged 7 commits into from
Sep 19, 2023

Conversation

jinensetpal
Copy link
Contributor

Bugfix: Handled RepoApi single-file edge case for external buckets

dagshub/data_engine/client/loaders/base.py Outdated Show resolved Hide resolved
Comment on lines +89 to +98
Path(
"/".join(
list(self.datasource.source.path_parts().values())[
:2
]
)
)
/ self.datasource_root
/ str(value)
).as_posix()
Copy link
Member

@kbolashev kbolashev Jul 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I already did this same thing once.
Check out self.datasource.source.root_content_path. This will give you https://dagshub.com/api/v1/repos/jinen/test-repo/storage/content/s3/bucket/prefix, and then you'll need to add the path of the datapoint itself to it.
The function also handles repos and buckets, so no need for the if/else

def root_content_path(self) -> str:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, even better, there's a self.datasource.source.content_path, that also appends the path!

def content_path(self, path: Union[str, Datapoint, Mapping[str, Any]]) -> str:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized, this won't actually work, since the idea behind it is not getting the value, but checking if it exists in the remote. Those functions get me the target url, but obtaining the final path (just from 's3/bucket/path/to/file') still requires string manipulation similar to what's already on.

I cannot directly pass the URL to list_storage_path or list_path in RepoAPI, and I do not want to send queries within the dataloader class because of DRY and I'll have to manage authentication.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function I wrote just generates the path though, it doesn't do a request. Can you demonstrate an example where you know it won't work for some reason?

dagshub/common/api/repo.py Show resolved Hide resolved
@jinensetpal jinensetpal changed the title External bucket support for file_column detection + Bugfix External bucket support for file_column detection, dataloader post_hooks, bugfix Aug 4, 2023
@jinensetpal jinensetpal changed the title External bucket support for file_column detection, dataloader post_hooks, bugfix External bucket support for file_column detection, dataloader post_hooks, bugfixes Aug 4, 2023
Copy link
Member

@kbolashev kbolashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting lost in this code now if I can be honest.
Do you want to maybe try writing some unit tests together for this? Because I doubt anyone coming into this code fresh that will have to maintain it in the future would have any idea what's going on.

Comment on lines +89 to +98
Path(
"/".join(
list(self.datasource.source.path_parts().values())[
:2
]
)
)
/ self.datasource_root
/ str(value)
).as_posix()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function I wrote just generates the path though, it doesn't do a request. Can you demonstrate an example where you know it won't work for some reason?

Comment on lines 42 to +45
class PyTorchDataLoader(torch.utils.data.DataLoader):
def __init__(self, *args, **kwargs):
def __init__(self, *args, post_hook=lambda x: x, **kwargs):
super().__init__(*args, **kwargs)
self.post_hook = post_hook
Copy link
Member

@kbolashev kbolashev Aug 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of the post hook?
I don't see any code in here that has custom post-hooks, so I assume this is accomodating for something you wrote in your own code.
Can you add some docs for how and why to use it?

Comment on lines 23 to +25
def generator(self):
for idx in range(len(self)):
yield self[idx]
yield tuple(self[idx])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't understand much of how TF/PT works and maybe it's time for me to investigate, but why are you changing it to return tuples now?

@kbolashev kbolashev added the bug Something isn't working label Aug 13, 2023
@kbolashev
Copy link
Member

Merging it in skipping the comments because we need it sooner rather than later

@kbolashev kbolashev merged commit c91954a into master Sep 19, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants