Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance when using CloudBucketMount as a dataset for model training #1839

Open
AbodSinan opened this issue May 21, 2024 · 2 comments
Labels
performance Something is slow.

Comments

@AbodSinan
Copy link

AbodSinan commented May 21, 2024

Firstly, thanks for the splendid work the team has done for this project, amazing stuff.

I was trying to figure out a way to train my ML models using modal, my dataset is in s3 (10k+ images, high res) so I thought it would be interesting to try out the CloudBucketMount to read from the dataset. It worked well, but it seemed to be considerably slow compared to using an L4 in another server (using an AWS server finishes the yolov8n model in like 3 minutes, in modal it timed out at 50 mins with less than 50% progress).

I was wondering, would it be faster if the mountpoint caching was utilized to cache object data? in case that wasn't used so far. This should lead to less requests for the files and faster performance.

I'm suspecting it would be better for me to just sync my s3 data with a connected modal volume on startup.

@thundergolfer
Copy link
Contributor

Hey, thanks for the issue report.

using an AWS server finishes the yolov8n model in like 3 minutes, in modal it timed out at 50 mins with less than 50% progress

If the code is simple it'd be great to have it for a reproduction.

I was wondering, would it be faster if the mountpoint caching was utilized to cache object data?

We are using this as of a few days ago 🙂. The cache is not preserved across Function executions though, so I'd expect the caching benefit is only for the 2nd and subsequent N reads on a file, not the first.

I'm guessing you ran this test within the last 24 hours and thus should have had the caching?

Overall modal.Volume benefits from Modal's custom caching behavior and should be much higher read throughput for first read, but I also think modal.CloudBucketMount should have high enough performance if the dataset loader is configured with read-ahead.

Is your dateset loader using readahead? Here's an example of what I mean, where it's called "predownload": https://docs.mosaicml.com/projects/streaming/en/latest/_modules/streaming/base/dataset.html.

@AbodSinan
Copy link
Author

AbodSinan commented May 23, 2024

Seems like the training is still very slow, I've tested on both L4 and A100.

from pathlib import Path

from modal import Image, App, enter, method, Secret, CloudBucketMount, Mount

s3_bucket_name = 'test-bucket'

app = App("test-yolo-object-detection").
image = (
    Image.debian_slim()
    .apt_install(
        "fonts-freefont-ttf",
        "libgl1-mesa-glx",
        "ffmpeg",
        "libsm6",
        "libxext6",
    )
    .copy_local_dir("./ultralytics", remote_path="/app/ultralytics") # I'm forking the YOLO repo to fix a small bug it has with mountpoints
    .pip_install(
        "GitPython",
        "neptune",
        "/app/ultralytics"
    )
    .env({"NEPTUNE_API_TOKEN": Secret.from_name("my-neptune-secret")}) # For model tracking, feel free to remove the neptune functions
)

with image.imports():
    from ultralytics import YOLO
    import neptune
    from ultralytics.utils.callbacks.neptune import on_train_epoch_end


@app.cls(
    image=image,
    volumes={
        "/my-mount": CloudBucketMount(s3_bucket_name, secret=Secret.from_name("my-aws-secret"))
    },
    gpu="l4",
    timeout=6000,
)
class ObjectDetection:
    @enter()
    def download_model(self):
        self.model = YOLO('yolov8n.pt')  # load a pretrained model (recommended for training)
        self.model.add_callback('on_train_epoch_end', on_train_epoch_end)

        self.run = neptune.init_run(
            project="hms/hms-test-model"
        )

        self.model_version = neptune.init_model_version(
            model="HMSTES-MOD",
            project="hms/hms-test-model"
        )
    @method()
    def copy_data_locally(self):
        local_data_dir = Path("/app/local_data")
        local_data_dir.mkdir(parents=True, exist_ok=True)
        shutil.copytree("/my-mount", local_data_dir, dirs_exist_ok=True)
        return str(local_data_dir)

    @method()
    def train(self):
        return self.model.train(
            project='test-project,
            #data=f'/app/cvatYoloModal.yaml',
            data='/my-mount/cvatYoloModal.yaml', # Some YOLO config, you can use coco.yaml
            epochs=100
        )

@app.local_entrypoint()
def main():
    results = ObjectDetection().train.remote()

for the dataset, you can use the COCO dataset: https://cocodataset.org/#download and store it in s3.

for the ultralytics fork, you can use: [email protected]:AbodSinan/ultralytics.git

P.S: I've tried to remove the neptune code, didn't seem to make it any noticeably faster.

@thundergolfer thundergolfer added the performance Something is slow. label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Something is slow.
Projects
None yet
Development

No branches or pull requests

2 participants