-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update worker images to optimize IO performance using local data #675
base: master
Are you sure you want to change the base?
Update worker images to optimize IO performance using local data #675
Conversation
ea6992c
to
2da36b8
Compare
12d97d6
to
1f50ce3
Compare
docker/main/ngen/Dockerfile
Outdated
# https://dl.min.io/client/mc/release/linux-amd64/archive/mc.${MINIO_CLIENT_RELEASE} | ||
|
||
# Setup minio client; also update path and make sure dataset directory is there | ||
RUN curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/linux-amd64/mc \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RUN curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/linux-amd64/mc \ | |
RUN ARCHITECTURE=`echo $(uname -s)-$(uname -m) | tr '[:upper:]' '[:lower:]'`; \ | |
curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/${ARCHITECTURE}/mc \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm ... good catch, but the suggestion won't work as needed with Linux on an Intel machine (it produces linux-x86_64
instead of linux-amd64
). I will make an adjustment, but I'll need to think more about exactly how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That wasn't the only case that would have been a problem, but I've made a change now that I think should catch the platform types we can reasonably expect that would need to be transformed to something else for purposes of that URL, and done the adjustment accordingly.
Updating usages of DataRequirement so that whenever the fulfilled_by attribute of an instance is set - creation time or otherwise - the new needs_data_local is also set.
Add 2 new directories - /dmod/local_volumes and /dmod/cluster_volumes - to ngen image directory structure, meant for mount points of different types of volumes containing necessary data for the job; also, adding README with some initial documentation on this directory structure.
Updating Launcher to prepare services with local volume mounts when some data requirements must be fulfilled by local data on the physical node, and to update the relevant other args for starting worker services so that one worker on each node makes sure data gets prepared in local volumes as needed as part of job startup.
Making MinIO CLI client available within ngen worker image and derivatives (e.g., calibration worker), though without a pre-configured alias for connected to the object store service.
Adding functionality to py_funcs.py to support making DMOD dataset data local (not just be locally accessible from remote storage).
Updating main entrypoint scripts for ngen and calibration worker images for local data handling.
Fixing script so that GUI services do not get stopped and updated unless that is actually asked for with the available CLI option.
Moving call to this Python function so that it happens before sanity checks (at the entrypoint level) ensuring dataset directories exist, as they won't exist until any data is made local.
- Order minio client args properly (config dir must come first) - Cleanup output handling during minio client subprocess - Correct a few logical mistakes with how conditionals should behave - Fix issue with path object creation when copying from cluster volume - Adding some helpful logging messages - Make sure we actually create symlinks
- Fixing handling of symlink for output dataset so it points to cluster volume as needed (i.e., so output can actually make it out of the worker) - Fixing some issues with keyword args coming in from CLI that certain functions weren't set up to disregard properly - Adding a bit more helpful logging in places
Adding logic and reordering certain things to make sure that, given local writing initially of job outputs, etc., that process to then move the results to backing dataset storage works properly and does not run into permissions issues.
Update dependencies on core and scheduler to 0.21.0 and 0.14.0 respectively.
Update dependencies on core and scheduler to 0.21.0 and 0.14.0 respectively.
Updating dependency on core to 0.21.0.
Updating dependencies on core and scheduler to 0.21.0 and 0.14.0 respectively.
Account for building in environments other than Linux X86_64 when downloading the MinIO client for the ngen worker images.
f85ea7b
to
25dbf8e
Compare
Note: do not review until #671 is complete.Note: blocked by #673, as testing cannot yet be completed.Note: blocked by #678, as testing still cannot be completed.Note: testing now blocked by a bug being addressed in #697.Optimizing job execution by sometimes locally copying DMOD dataset data to local, per-node Docker volumes when jobs are starting.
Changes
needs_data_local
attributeneeds_data_local
is set wheneverfulfilled_by
is set (i.e., de facto couple this as a part of fulfillment details)/dmod/datasets/**
) at startupTesting
Jobs started via the CLI execute successfully and in a reasonable amount of time. The specific test was for a month of VPU 1 and took less than 5 minutes from start to finish.