-
Notifications
You must be signed in to change notification settings - Fork 622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficiently loading Waymo raw data? #856
Comments
In case anyone is wondering: One possible solution, depending on your use case is to use "push down filtering". Too bad that the Waymo v2 example/tutorial never mentioned the use of Dask's filtering. The filtering should be done when you first read your parquet file: e.g.
This approach works for me because my training loader only expects one timestamp and a certain camera. The idea is to make sure you only load part of the parquet. I will leave the issue open for visibility in case the Waymo team wants to update their documentation. |
Yes push down filtering is a good way for efficiency. We mentioned this a bit in the "A relational database-like structure" section in the example/tutorial yet we should discuss more in the aspect of efficiency. Thanks for the advice! |
@JingweiJ Thanks for checking this issue. Yes you're correct pushdown filtering there in the short comment. So technically, it was "mentioned". My point being, in the actual example code where it only uses single frame, it makes sense to use push filtering by default. 🙂 |
In case it can help anyone, I have written a library which can load any given data sample from Waymo (and also KITTI, NuScenes or ZOD). As we discussed in #841, one needs to re-encode the parquet files to make random access fast. The library is here: https://github.com/CEA-LIST/tri3d It is a bit opinionated because I needed to settle on common conventions across datasets, but I think you'll find it does what you expect it to do most of the time. Notably, it has sane defaults to interpolate poses (ego car, boxes, sensors) such that when you request something at, say, LiDAR frame 12, that something will actually overlap well with the point cloud. |
Hello,
I'm using v2.0.0 dataset and successfully followed the example on loading Waymo data using dask.
This is all fine for quick testing, but when I use the same method on my data loader things do not scale so well. Dask is nice but when I actually call Dask's compute() to get the data, it takes sometime even with fast disk.
Data loader when shuffled will sample the frame randomly so I can't have eager loading of each parquet file by looking at its order.
Example: when it happens that the the loader samples 10 different frames from 10 different parquet files, then it becomes an I/O bottleneck, even with multiple workers.
Preloading the whole dataset is out of question due to memory constraint.
I have been looking at how other frameworks (e.g. Mmdetect) and 3rd party libraries (e.g. Pytorch Waymo loader) are using Waymo: they pre-convert the training frames (e.g. to pickle) so the access time is fast even when frames are randomly sampled.
Is this the recommend way? I feel the use of parquet file + dask is meant to address this exact issue.
Thanks in advance for the insight.
The text was updated successfully, but these errors were encountered: