External bucket support for file_column detection, dataloader post_hooks, bugfixes #332

jinensetpal · 2023-07-26T00:50:11Z

Bugfix: Handled RepoApi single-file edge case for external buckets

…api single-file edge case

dagshub/data_engine/client/loaders/base.py

kbolashev · 2023-07-26T06:33:09Z

dagshub/data_engine/client/loaders/base.py

+                            Path(
+                                "/".join(
+                                    list(self.datasource.source.path_parts().values())[
+                                        :2
+                                    ]
+                                )
+                            )
+                            / self.datasource_root
+                            / str(value)
+                        ).as_posix()


I think I already did this same thing once.
Check out self.datasource.source.root_content_path. This will give you https://dagshub.com/api/v1/repos/jinen/test-repo/storage/content/s3/bucket/prefix, and then you'll need to add the path of the datapoint itself to it.
The function also handles repos and buckets, so no need for the if/else

client/dagshub/data_engine/model/datasource_state.py

Line 124 in 11f90c3

def root_content_path(self) -> str:

Wait, even better, there's a self.datasource.source.content_path, that also appends the path!

client/dagshub/data_engine/model/datasource_state.py

Line 89 in 11f90c3

def content_path(self, path: Union[str, Datapoint, Mapping[str, Any]]) -> str:

Awesome, thank you!

I just realized, this won't actually work, since the idea behind it is not getting the value, but checking if it exists in the remote. Those functions get me the target url, but obtaining the final path (just from 's3/bucket/path/to/file') still requires string manipulation similar to what's already on.

I cannot directly pass the URL to list_storage_path or list_path in RepoAPI, and I do not want to send queries within the dataloader class because of DRY and I'll have to manage authentication.

The function I wrote just generates the path though, it doesn't do a request. Can you demonstrate an example where you know it won't work for some reason?

dagshub/common/api/repo.py

kbolashev

I'm getting lost in this code now if I can be honest.
Do you want to maybe try writing some unit tests together for this? Because I doubt anyone coming into this code fresh that will have to maintain it in the future would have any idea what's going on.

kbolashev · 2023-08-13T10:23:08Z

dagshub/data_engine/client/loaders/base.py

+                            Path(
+                                "/".join(
+                                    list(self.datasource.source.path_parts().values())[
+                                        :2
+                                    ]
+                                )
+                            )
+                            / self.datasource_root
+                            / str(value)
+                        ).as_posix()


The function I wrote just generates the path though, it doesn't do a request. Can you demonstrate an example where you know it won't work for some reason?

kbolashev · 2023-08-13T10:31:18Z

dagshub/data_engine/client/loaders/torch.py

 class PyTorchDataLoader(torch.utils.data.DataLoader):
-    def __init__(self, *args, **kwargs):
+    def __init__(self, *args, post_hook=lambda x: x, **kwargs):
        super().__init__(*args, **kwargs)
+        self.post_hook = post_hook


What's the purpose of the post hook?
I don't see any code in here that has custom post-hooks, so I assume this is accomodating for something you wrote in your own code.
Can you add some docs for how and why to use it?

kbolashev · 2023-08-13T10:32:39Z

dagshub/data_engine/client/loaders/tf.py

    def generator(self):
        for idx in range(len(self)):
-            yield self[idx]
+            yield tuple(self[idx])


I really don't understand much of how TF/PT works and maybe it's time for me to investigate, but why are you changing it to return tuples now?

kbolashev · 2023-09-19T11:14:34Z

Merging it in skipping the comments because we need it sooner rather than later

jinensetpal added 2 commits July 26, 2023 06:17

added external bucket support for file_column detection, handled repo…

e3eaf69

…api single-file edge case

removed redundant flag

594d91d

jinensetpal requested a review from kbolashev July 26, 2023 00:50

linter fixes

dc7d2b2

kbolashev requested changes Jul 26, 2023

View reviewed changes

jinensetpal added 4 commits July 31, 2023 21:25

merged upstream

ecd3c03

added tensorflow dataloader hooks

d4333c1

fixed tensor signature

b135def

added torch post-hook, enum cleanup

dc90d61

jinensetpal requested a review from kbolashev August 4, 2023 19:25

jinensetpal changed the title ~~External bucket support for file_column detection + Bugfix~~ External bucket support for file_column detection, dataloader post_hooks, bugfix Aug 4, 2023

jinensetpal changed the title ~~External bucket support for file_column detection, dataloader post_hooks, bugfix~~ External bucket support for file_column detection, dataloader post_hooks, bugfixes Aug 4, 2023

kbolashev requested changes Aug 13, 2023

View reviewed changes

kbolashev assigned jinensetpal Aug 13, 2023

kbolashev added the bug Something isn't working label Aug 13, 2023

kbolashev merged commit c91954a into master Sep 19, 2023
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

External bucket support for file_column detection, dataloader post_hooks, bugfixes #332

External bucket support for file_column detection, dataloader post_hooks, bugfixes #332

jinensetpal commented Jul 26, 2023

kbolashev Jul 26, 2023 •

edited

Loading

kbolashev Jul 26, 2023

jinensetpal Jul 26, 2023

jinensetpal Jul 26, 2023

kbolashev Aug 13, 2023

kbolashev left a comment •

edited

Loading

kbolashev Aug 13, 2023

kbolashev Aug 13, 2023 •

edited

Loading

kbolashev Aug 13, 2023

kbolashev commented Sep 19, 2023

External bucket support for file_column detection, dataloader post_hooks, bugfixes #332

External bucket support for file_column detection, dataloader post_hooks, bugfixes #332

Conversation

jinensetpal commented Jul 26, 2023

kbolashev Jul 26, 2023 • edited Loading

Choose a reason for hiding this comment

kbolashev Jul 26, 2023

Choose a reason for hiding this comment

jinensetpal Jul 26, 2023

Choose a reason for hiding this comment

jinensetpal Jul 26, 2023

Choose a reason for hiding this comment

kbolashev Aug 13, 2023

Choose a reason for hiding this comment

kbolashev left a comment • edited Loading

Choose a reason for hiding this comment

kbolashev Aug 13, 2023

Choose a reason for hiding this comment

kbolashev Aug 13, 2023 • edited Loading

Choose a reason for hiding this comment

kbolashev Aug 13, 2023

Choose a reason for hiding this comment

kbolashev commented Sep 19, 2023

kbolashev Jul 26, 2023 •

edited

Loading

kbolashev left a comment •

edited

Loading

kbolashev Aug 13, 2023 •

edited

Loading