-
Notifications
You must be signed in to change notification settings - Fork 15
Draft: DMOD Design Overview
DMOD is an extensible suite of software tools for creating and running specialized compute environments (and in some sense the environments themselves). The primary goal for DMOD is to make it easier to develop, test, and experiment with scientific models, with particular emphasis on models run through the NextGen framework.
It helps to describe the architecture and implementation of DMOD by using three design dimensions:
- Infrastructure
- Services/Execution
- Automation/Facilitation
Individual pieces of the DMOD suite generally are involved with a primary dimension but also typically tie in closely with one or both of the others.
The infrastructure dimension starts conceptually with the physical components needed for computation and the software tools for making those work. It extends from there to the virtualized analogs of physical components and other abstractions used to create a scalable compute environment.
The Services/Execution dimension encapsulates the actual running of both models and specialized DMOD application services. It is the details of how things operate to make DMOD a useful tool for someone writing and/or experimenting with models.
The final design dimension is Automation/Facilitation. The purpose of DMOD is to make things easier for users. The goal is to ease the mental and manual burdens involved with maintaining a model development/testing environment and with common model development/testing tasks.
The core technologies used in the current implementation of DMOD Infrastructure pieces are Docker Swarm orchestration and Docker containerization. These allow for abstracting infrastructure to code and configuration, loosening a DMOD deployment’s relationship to physical hardware, while taking responsibility for things that had been implicitly dictated by directly running on a physical machine.
At the application level, a Docker Swarm consists of a number of Swarm stacks, collections of Swarm services with each service running some desired software process. Several statically defined stacks are configured in the subdirectories of docker/
in the DMOD repo. While Swarm supports more complex usage, the current DMOD implementation uses services having a single Docker container, which is an isolated, virtual compute entity essentially used like a virtual machine. At the hardware level, a Docker Swarm consists of one or more physical compute nodes. Nodes can be anything from a typical laptop to high-end servers.
Individual stacks in Docker Swarm are by default isolated from the “outside world” and from other stacks. A stack can expose ports to serve application functionality externally, with Docker Swarm able to direct network communication to/from this port on any Swarm physical node appropriately within the stack. This is utilized in particular with request-service to make sure all external client communication to request job execution and other DMOD behavior travels through this service, creating a central (yet scalable) point for securing access.
Stack services can also communicate through configured Docker networks. This is used to connect different stacks with each other while keeping them isolated from the outside world. It also is used to isolate certain traffic on certain physical interfaces. In particular, this is used to reserve a faster backend physical network for use with model execution workers to help ensure fast worker MPI communication across different physical nodes.
This approach grants many benefits, including:
- Allowing DMOD to use heterogenous combinations of off-the-shelf hardware to build high performance compute environments
- Enabling customizable allocations of resources for any given job (e.g., a model execution job)
- Supporting execution transparently across multiple physical devices
- Allowing for physical hardware to be added at any time to increase capacity
A Docker container must be started from a Docker image, a snapshot of a running environment built to be easily distributed and reused. DMOD starts from standard Docker community images but builds custom images for all started services. This includes Nextgen framework worker images, which bootstrap an isolated compilation of the NextGen framework, framework dependencies, and OWP models.
The configuration of a Docker image is contained within a Dockerfile. DMOD organizes its Dockerfiles according to stack and service, with worker image Dockerfiles currently being located under the main stack directory. DMOD utilizes another Docker technology - Docker Compose - to organize the configuration for building a collection of stack images and to execute image builds. Compose and Docker Swarm use the same YAML-based file format for configurations, making it easier to manage either combine build and deployment configurations for a single stack or more easily manage separate but similar configurations.
In order to make sure these images can be distributed across several physical machines without having to be built individually on each, DMOD expects an image registry to be defined within the local environment configuration, so images can be pushed to and pulled from this registry. DMOD supplies configurations for an internal Docker image registry as part of the deployment, as discussed here, although a separate, externally managed registry can be used as well.
The most essential, pre-configured stack in DMOD is the main stack. Dynamically created stacks are all essential to job execution. Beyond that, there are several other stacks that can make up a DMOD deployment.
In the main DMOD stack (docker/main/
), the services configured directly correspond to and execute the necessary DMOD system service applications:
- request-service
- scheduler-service
- data-service
- partitioner-service
- monitor-service
- myredis (internal Redis instance; dependency for other services)
Note
The docker/main/
directory also includes subdirectories for currently deprecated, not-in-use services
- subset-service
Important
The docker/main/
directory also includes subdirectories for other images that do not correspond to stack services but placed there so that the images are built at the same time as the main service images:
- base - a base image for some of the other images built in the main stack
- ngen - directory with a Dockerfile and artifacts used to build several NextGen-related images: the NextGen model job worker, NextGen calibration job worker, and NextGen partition config generator worker
- nwm - an image for running pre-NextGen NWM model jobs
- s3fs-volume-helper - an image required by DMOD services for mounting data in a DMOD-internal object store service into model job worker containers
When scheduling a model execution job, scheduler-service starts a new, dynamically constructed stack, with each service being an individual worker for the job.
To facilitate storage infrastructure, a separate object_store stack configuration is available, based on the MinIO application. Currently this is required for operation of data-service, but will likely be made optional in the future. It stores the raw data of DMOD datasets and provides the underlying mechanisms for making the data (and storage location) accessible to executing Docker services that may be on a different physical node than the data. To enable communication with the main stack, it is attached to the Docker network configured for use between modeling worker containers, which also connects to several of the DMOD application service containers.
DMOD supplies the configuration for a development registry stack, which can be optionally used (i.e., separately from the main stack) as a custom registry. This is named dev_registry_stack and found under docker/dev_registry_stack/
.
DMOD also provides a stack for a GUI application under docker/nwm_gui/
. This is kept separate to isolate it from the main stack for security and to keep its use optional.
A specialized script exists at scripts/control-stack.sh
to facilitate working with these stacks and building the required Docker images. It is discussed again below, but it's importance to preparing and managing stacks and images for DMOD deployments makes it worthy of mention here.
DMOD employs a services architecture that deploys several specialized internal applications essential to its operation. These services support and manage the execution of user-requested DMOD jobs on the compute resources available within the DMOD deployment.
The above service applications support the running of DMOD jobs: requested executions of generally larger, generally model-related operations. The main examples are a NextGen model execution job or NextGen calibration job. Future plans include additional types of jobs, such as those for pre-processing input data. Within DMOD application services, job objects are used to represent details of the activity within the system.
Prior to resource allocation, a job possesses an allocation paradigm property indicating which of several supported patterns should be used for allocation (e.g., balanced round robin, single node, fill nodes). Before entering a RUNNING
state, a job receives an allocation of reserved compute resources - at present, CPU and memory - on which it will be executed.
Each job maintains status properties to represent what step it is in within the DMOD workflow for that job type. (e.g., resources are allocated, execution has started, execution has failed, etc.). Each of the various DMOD services watches for jobs reaching status values for which that service is responsible, applies appropriate operations to jobs of such status values, and transitions jobs to the next workflow status once those operations are completed. For example, the scheduler-service waits for jobs that reach the awaiting scheduling status. The service then starts the workers for such jobs to begin execution, after which it moves the jobs to the running status.
The underlying operations represented by jobs require input data and produce output data. The data is not arbitrary; this is obvious to a human but something that must be deliberately declared and defined in software. DMOD accomplishes this by creating and managing DMOD datasets. A dataset encapsulates the location for some backing data (or to which some data may be stored) and certain important pieces of metadata. In particular, it includes a data domain property that formally describes the format and constraints of the dataset. These constraints include details of key indices such as time ranges or catchment id sets covered by the dataset. When combined with a collection of formally defined DMOD data requirements contained as a property of a job, it supplies a means for determining if a dataset is required by and/or useful to a job, and whether a job has access to sufficient datasets to provide it with all necessary data. This approach extends to all data involved with jobs, including hydrofabric and configuration files. That way, hydrofabric and configuration files can be both marked as required by a job and made accessible to a job, using the same mechanisms employed when dealing with any other type of data.
As discussed as part of Infrastructure overview, several Docker Swarm services are started within the main DMOD stack to execute analogous DMOD service applications:
- request-service
- scheduler-service
- data-service
- partitioner-service
- monitor-service
- evaluation-service (incomplete)
- subset-service (deprecated, may be revisited)
Note
The subset service is currently deprecated and does not start with the main DMOD Docker stack. It is not necessary for operation at this time.
Note
The evaluation service is currently still being developed and does not start with the rest of the main Docker stack. Currently the evaluation service application must be run manually.
The Data service is the primary manager of datasets. It provides creation, mutation, metadata querying, access, and deletion functionality. For datasets stored within the object_store stack (the primary mechanism at present), it also initializes and associates specialized Docker storage drivers with job worker service containers to provide access to the backing storage location and contents of the dataset. In the future, it will also be able to retrieve external data from known sources and perform preprocessing to derive datasets with alternative domains (e.g., different format or time range).
The Request service works as an entry point for external interaction and a proxy for communication. It receives incoming messages, requesting information or the triggering of activity, from external clients, like a GUI service or a command line client program. For each received request, it processes whether it is sufficiently authorized, routes the request to the appropriate service for fulfilling such requests, and relays response from the fulfilling service back to the requesting entity.
The Scheduler service handles the scheduling and launching of jobs. It checks for the availability of resources, inspects job priority to determine which job should be scheduled in times of resource scarcity, assigns resource allocations, and kicks off the execution of job workers.
The Partitioner service is exclusively responsible for creating NextGen framework partition configuration when necessary for requested jobs.
Once finished, the evaluation service will facilitate additional tasks after the execution of a NextGen (and potentially other) model job that provide metrics describing the performance of model executions.
All DMOD application services are themselves contained within their own individual Python packages. The code for these is within the subdirectories under python/services/
in the DMOD repo.
Additionally, there are several library packages under python/lib/
. Each of these seeks to contain common code used and reused in multiple places among the services (or other DMOD Python libraries). One of the most important of these is the dmod.communication package, which defines the protocols for various request messages and how they are internally represented within DMOD service applications.
The most apparent pieces within this scope are user interfaces. A simple command line user interface is provided by the dmod.client library package. While less functionally intuitive, it does provide a simple means of interaction that lends itself to customized, user-developed scripts. The more powerful option is that of the graphical user interface supplied by the separate code under python/gui/
and run within the separated GUI Swarm stack. This provides graphical views for creating and modifying datasets, selecting subsets of a region on which to run models, and assembling configurations for a model run. Many additional views are in development to further enhance user experience and make tasks easier.
A second important piece is that of the job workflow itself. This concept is essentially the explicitly defined process for performing common, useful tasks. By standardizing and deliberately codifying what those steps are, DMOD can automate the required steps. It can also inject other useful analysis into the process that benefits the user.
Perhaps the most obvious example of this is the scheduling of a job. The user need only use the GUI or CLI client to provide the details of the job to be run. DMOD determines exactly where the job will execute. If there are not enough resources currently available, DMOD monitors itself to wait until there are sufficient resources available, and handles contention if necessary. DMOD then actually starts the necessary job workers and model processes once resources for the job are available.
The design of the Data service and its use of DMOD datasets also provide automation that simplifies the user experience, letting DMOD utilize codified logic where previously human intelligence and effort were required. By defining both the coverage details of available data and the data requirements for a job, DMOD relieves the user from ensuring data is available for a given request. If it is not, the job does not start, and the request fails before any resources are tied up. Combined with the functionality of the Scheduler, the Data service also makes it easy to run a job utilizing scalable resources across multiple physical machines, without the user needing to worry about things like making sure forcing data or configurations have been copied to all the right places. DMOD handles this, as well as ensuring output results are written somewhere that is easily accessible after the job is finished.
DMOD contains many helper scripts to simplify certain tasks. These are located under the scripts/
directory of the repo.
Helper scripts exist for building and locally updating various Python packages. These can be manually utilized and are also part of the process for building the custom service Docker containers.
Scripts also exists for facilitating running of automated unit and integration tests.
In general, all scripts within scripts/
include at least a reasonable help message accessible with the -h
or --help
flags, with some including second, more verbose options also.
As mentioned earlier, a particularly important script is scripts/control-stack.sh
. It facilitates working with the statically configured Docker stacks. This includes options for building the required Docker images and pushing them to the appropriate Docker image registry.
It provides several help options to display information on its usage:
# Simplest help output:
./scripts/control_stack.sh -h
# More descriptive help output:
./scripts/control_stack.sh -hh
# Descriptive help output plus additional details:
./scripts/control_stack.sh -hhh