Slow performance when using CloudBucketMount as a dataset for model training #1839

AbodSinan · 2024-05-21T03:42:32Z

Firstly, thanks for the splendid work the team has done for this project, amazing stuff.

I was trying to figure out a way to train my ML models using modal, my dataset is in s3 (10k+ images, high res) so I thought it would be interesting to try out the CloudBucketMount to read from the dataset. It worked well, but it seemed to be considerably slow compared to using an L4 in another server (using an AWS server finishes the yolov8n model in like 3 minutes, in modal it timed out at 50 mins with less than 50% progress).

I was wondering, would it be faster if the mountpoint caching was utilized to cache object data? in case that wasn't used so far. This should lead to less requests for the files and faster performance.

I'm suspecting it would be better for me to just sync my s3 data with a connected modal volume on startup.

The text was updated successfully, but these errors were encountered:

thundergolfer · 2024-05-21T11:29:33Z

Hey, thanks for the issue report.

using an AWS server finishes the yolov8n model in like 3 minutes, in modal it timed out at 50 mins with less than 50% progress

If the code is simple it'd be great to have it for a reproduction.

I was wondering, would it be faster if the mountpoint caching was utilized to cache object data?

We are using this as of a few days ago 🙂. The cache is not preserved across Function executions though, so I'd expect the caching benefit is only for the 2nd and subsequent N reads on a file, not the first.

I'm guessing you ran this test within the last 24 hours and thus should have had the caching?

Overall modal.Volume benefits from Modal's custom caching behavior and should be much higher read throughput for first read, but I also think modal.CloudBucketMount should have high enough performance if the dataset loader is configured with read-ahead.

Is your dateset loader using readahead? Here's an example of what I mean, where it's called "predownload": https://docs.mosaicml.com/projects/streaming/en/latest/_modules/streaming/base/dataset.html.

AbodSinan · 2024-05-23T08:09:15Z

Seems like the training is still very slow, I've tested on both L4 and A100.

from pathlib import Path

from modal import Image, App, enter, method, Secret, CloudBucketMount, Mount

s3_bucket_name = 'test-bucket'

app = App("test-yolo-object-detection").
image = (
    Image.debian_slim()
    .apt_install(
        "fonts-freefont-ttf",
        "libgl1-mesa-glx",
        "ffmpeg",
        "libsm6",
        "libxext6",
    )
    .copy_local_dir("./ultralytics", remote_path="/app/ultralytics") # I'm forking the YOLO repo to fix a small bug it has with mountpoints
    .pip_install(
        "GitPython",
        "neptune",
        "/app/ultralytics"
    )
    .env({"NEPTUNE_API_TOKEN": Secret.from_name("my-neptune-secret")}) # For model tracking, feel free to remove the neptune functions
)

with image.imports():
    from ultralytics import YOLO
    import neptune
    from ultralytics.utils.callbacks.neptune import on_train_epoch_end


@app.cls(
    image=image,
    volumes={
        "/my-mount": CloudBucketMount(s3_bucket_name, secret=Secret.from_name("my-aws-secret"))
    },
    gpu="l4",
    timeout=6000,
)
class ObjectDetection:
    @enter()
    def download_model(self):
        self.model = YOLO('yolov8n.pt')  # load a pretrained model (recommended for training)
        self.model.add_callback('on_train_epoch_end', on_train_epoch_end)

        self.run = neptune.init_run(
            project="hms/hms-test-model"
        )

        self.model_version = neptune.init_model_version(
            model="HMSTES-MOD",
            project="hms/hms-test-model"
        )
    @method()
    def copy_data_locally(self):
        local_data_dir = Path("/app/local_data")
        local_data_dir.mkdir(parents=True, exist_ok=True)
        shutil.copytree("/my-mount", local_data_dir, dirs_exist_ok=True)
        return str(local_data_dir)

    @method()
    def train(self):
        return self.model.train(
            project='test-project,
            #data=f'/app/cvatYoloModal.yaml',
            data='/my-mount/cvatYoloModal.yaml', # Some YOLO config, you can use coco.yaml
            epochs=100
        )

@app.local_entrypoint()
def main():
    results = ObjectDetection().train.remote()

for the dataset, you can use the COCO dataset: https://cocodataset.org/#download and store it in s3.

for the ultralytics fork, you can use: [email protected]:AbodSinan/ultralytics.git

P.S: I've tried to remove the neptune code, didn't seem to make it any noticeably faster.

thundergolfer added the performance Something is slow. label Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance when using CloudBucketMount as a dataset for model training #1839

Slow performance when using CloudBucketMount as a dataset for model training #1839

AbodSinan commented May 21, 2024 •

edited

Loading

thundergolfer commented May 21, 2024

AbodSinan commented May 23, 2024 •

edited

Loading

Slow performance when using CloudBucketMount as a dataset for model training #1839

Slow performance when using CloudBucketMount as a dataset for model training #1839

Comments

AbodSinan commented May 21, 2024 • edited Loading

thundergolfer commented May 21, 2024

AbodSinan commented May 23, 2024 • edited Loading

AbodSinan commented May 21, 2024 •

edited

Loading

AbodSinan commented May 23, 2024 •

edited

Loading