Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update worker images to optimize IO performance using local data #675

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

robertbartel
Copy link
Contributor

@robertbartel robertbartel commented Jul 12, 2024

Note: do not review until #671 is complete.
Note: blocked by #673, as testing cannot yet be completed.
Note: blocked by #678, as testing still cannot be completed.
Note: testing now blocked by a bug being addressed in #697.

Optimizing job execution by sometimes locally copying DMOD dataset data to local, per-node Docker volumes when jobs are starting.

Changes

  • Update DataRequirements to track if the data is required to be local with new needs_data_local attribute
  • Updating usages of DataRequirements to ensure needs_data_local is set whenever fulfilled_by is set (i.e., de facto couple this as a part of fulfillment details)
  • Add new directory structures within worker image for mounted datasets and for local data, and have image symlink into previous directory (/dmod/datasets/**) at startup
  • Update scheduler Launcher to create local volumes in job services for those requirements needing local data, in addition to the volumes mounting the dataset-backed data
  • Update worker images to include the MinIO CLI client for optimal copying when making data local (10x speedup)
  • Updating worker entrypoints and scripts to support detecting when data needs to be local by the prescence of a mounted local volume corresponding to the dataset volume, and then making the data local (one worker per physical node) on that node

Testing

Jobs started via the CLI execute successfully and in a reasonable amount of time. The specific test was for a month of VPU 1 and took less than 5 minutes from start to finish.

@robertbartel robertbartel added enhancement New feature or request maas MaaS Workstream labels Jul 12, 2024
@robertbartel robertbartel force-pushed the f/bmi_cfg_gen_io/worker_changes branch from ea6992c to 2da36b8 Compare July 12, 2024 18:20
@robertbartel robertbartel force-pushed the f/bmi_cfg_gen_io/worker_changes branch 2 times, most recently from 12d97d6 to 1f50ce3 Compare August 6, 2024 18:04
@robertbartel robertbartel marked this pull request as ready for review August 16, 2024 17:33
# https://dl.min.io/client/mc/release/linux-amd64/archive/mc.${MINIO_CLIENT_RELEASE}

# Setup minio client; also update path and make sure dataset directory is there
RUN curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/linux-amd64/mc \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RUN curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/linux-amd64/mc \
RUN ARCHITECTURE=`echo $(uname -s)-$(uname -m) | tr '[:upper:]' '[:lower:]'`; \
curl -L -o /dmod/bin/mc https://dl.min.io/client/mc/release/${ARCHITECTURE}/mc \

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ... good catch, but the suggestion won't work as needed with Linux on an Intel machine (it produces linux-x86_64 instead of linux-amd64). I will make an adjustment, but I'll need to think more about exactly how.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wasn't the only case that would have been a problem, but I've made a change now that I think should catch the platform types we can reasonably expect that would need to be transformed to something else for purposes of that URL, and done the adjustment accordingly.

Updating usages of DataRequirement so that whenever the fulfilled_by
attribute of an instance is set - creation time or otherwise - the new
needs_data_local is also set.
Add 2 new directories - /dmod/local_volumes and /dmod/cluster_volumes -
to ngen image directory structure, meant for mount points of different
types of volumes containing necessary data for the job; also, adding
README with some initial documentation on this directory structure.
Updating Launcher to prepare services with local volume mounts when some
data requirements must be fulfilled by local data on the physical node,
and to update the relevant other args for starting worker services so
that one worker on each node makes sure data gets prepared in local
volumes as needed as part of job startup.
Making MinIO CLI client available within ngen worker image and
derivatives (e.g., calibration worker), though without a pre-configured
alias for connected to the object store service.
Adding functionality to py_funcs.py to support making DMOD dataset data
local (not just be locally accessible from remote storage).
Updating main entrypoint scripts for ngen and calibration worker images
for local data handling.
Fixing script so that GUI services do not get stopped and updated unless
that is actually asked for with the available CLI option.
Moving call to this Python function so that it happens before sanity
checks (at the entrypoint level) ensuring dataset directories exist, as
they won't exist until any data is made local.
- Order minio client args properly (config dir must come first)
- Cleanup output handling during minio client subprocess
- Correct a few logical mistakes with how conditionals should behave
- Fix issue with path object creation when copying from cluster volume
- Adding some helpful logging messages
- Make sure we actually create symlinks
- Fixing handling of symlink for output dataset so it points to cluster
  volume as needed (i.e., so output can actually make it out of the
  worker)
- Fixing some issues with keyword args coming in from CLI that certain
  functions weren't set up to disregard properly
- Adding a bit more helpful logging in places
Adding logic and reordering certain things to make sure that, given
local writing initially of job outputs, etc., that process to then move
the results to backing dataset storage works properly and does not run
into permissions issues.
Update dependencies on core and scheduler to 0.21.0 and 0.14.0
respectively.
Update dependencies on core and scheduler to 0.21.0 and 0.14.0
respectively.
Updating dependency on core to 0.21.0.
Updating dependencies on core and scheduler to 0.21.0 and 0.14.0
respectively.
Account for building in environments other than Linux X86_64 when
downloading the MinIO client for the ngen worker images.
@robertbartel robertbartel force-pushed the f/bmi_cfg_gen_io/worker_changes branch from f85ea7b to 25dbf8e Compare October 30, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request maas MaaS Workstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants