-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External bucket support for file_column detection, dataloader post_hooks, bugfixes #332
Conversation
…api single-file edge case
Path( | ||
"/".join( | ||
list(self.datasource.source.path_parts().values())[ | ||
:2 | ||
] | ||
) | ||
) | ||
/ self.datasource_root | ||
/ str(value) | ||
).as_posix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I already did this same thing once.
Check out self.datasource.source.root_content_path
. This will give you https://dagshub.com/api/v1/repos/jinen/test-repo/storage/content/s3/bucket/prefix
, and then you'll need to add the path of the datapoint itself to it.
The function also handles repos and buckets, so no need for the if/else
def root_content_path(self) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, even better, there's a self.datasource.source.content_path
, that also appends the path!
def content_path(self, path: Union[str, Datapoint, Mapping[str, Any]]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized, this won't actually work, since the idea behind it is not getting the value, but checking if it exists in the remote. Those functions get me the target url, but obtaining the final path (just from 's3/bucket/path/to/file') still requires string manipulation similar to what's already on.
I cannot directly pass the URL to list_storage_path
or list_path
in RepoAPI, and I do not want to send queries within the dataloader class because of DRY and I'll have to manage authentication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function I wrote just generates the path though, it doesn't do a request. Can you demonstrate an example where you know it won't work for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting lost in this code now if I can be honest.
Do you want to maybe try writing some unit tests together for this? Because I doubt anyone coming into this code fresh that will have to maintain it in the future would have any idea what's going on.
Path( | ||
"/".join( | ||
list(self.datasource.source.path_parts().values())[ | ||
:2 | ||
] | ||
) | ||
) | ||
/ self.datasource_root | ||
/ str(value) | ||
).as_posix() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function I wrote just generates the path though, it doesn't do a request. Can you demonstrate an example where you know it won't work for some reason?
class PyTorchDataLoader(torch.utils.data.DataLoader): | ||
def __init__(self, *args, **kwargs): | ||
def __init__(self, *args, post_hook=lambda x: x, **kwargs): | ||
super().__init__(*args, **kwargs) | ||
self.post_hook = post_hook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of the post hook?
I don't see any code in here that has custom post-hooks, so I assume this is accomodating for something you wrote in your own code.
Can you add some docs for how and why to use it?
def generator(self): | ||
for idx in range(len(self)): | ||
yield self[idx] | ||
yield tuple(self[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't understand much of how TF/PT works and maybe it's time for me to investigate, but why are you changing it to return tuples now?
Merging it in skipping the comments because we need it sooner rather than later |
Bugfix: Handled RepoApi single-file edge case for external buckets