This module provisions a highly available HTCondor access point using a Managed Instance Group (MIG) with auto-healing.
Although this provisions an HTCondor access point with standard configuration, for a functioning node, you must supply Toolkit runners as described below:
- var.access_point_runner
- Runner must download or otherwise create an IDTOKEN with ADVERTISE_MASTER, ADVERTISE_SCHEDD, and DAEMON scopes
- var.autoscaler_runner
- 1 runner for each set of execute points to add to the pool
Reference implementations for each are included in the Toolkit modules htcondor-pool-secrets and htcondor-execute-point. You may substitute implementations (e.g. alternative secret management) so long as they duplicate the functionality in these references. Their usage is demonstrated in the HTCondor example.
A regional MIG is used to provision the Access Point, although only 1 node will ever be active at a time. By default, the node will be provisioned in any of the zones available in that region, however, it can be constrained to run in fewer zones (or a single zone) using var.zones.
When the configuration of the Central Manager is changed, the MIG can be configured to replace the VM using a "proactive" or "opportunistic" policy. By default, the Access Point replacement policy is opportunistic. In practice, this means that the Access Point will NOT be automatically replaced by Terraform when changes to the instance template / HTCondor configuration are made. The Access Point is NOT safe to replace automatically as its local storage contains the state of the job queue. By default, the Access Point will be replaced only when:
- intentionally by issuing an update via Cloud Console or using gcloud (below)
- the VM becomes unhealthy or is otherwise automatically replaced (e.g. regular Google Cloud maintenance)
For example, to manually update all instances in a MIG:
gcloud compute instance-groups managed update-instances \
<<NAME-OF-MIG>> --all-instances --region <<REGION>> \
--project <<PROJECT_ID>> --minimal-action replace
This mode can be switched to proactive (automatic) replacement by setting var.update_policy to "PROACTIVE". In this case we recommend the use of Filestore to store the job queue state ("spool") and setting [var.spool_parent_dir][#input_spool_parent_dir] to its mount point:
- id: spoolfs
source: modules/file-system/filestore
use:
- network1
settings:
filestore_tier: ENTERPRISE
local_mount: /shared
...
- id: htcondor_access
source: community/modules/scheduler/htcondor-access-point
use:
- network1
- spoolfs
- htcondor_secrets
- htcondor_setup
- htcondor_cm
- htcondor_execute_point_group
settings:
spool_parent_dir: /shared
Copyright 2023 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Name | Version |
---|---|
terraform | >= 1.1 |
>= 3.83 | |
random | ~> 3.6 |
Name | Version |
---|---|
>= 3.83 | |
random | ~> 3.6 |
Name | Source | Version |
---|---|---|
access_point_instance_template | github.com/terraform-google-modules/terraform-google-vm//modules/instance_template | 73dc845 |
htcondor_ap | terraform-google-modules/vm/google//modules/mig | 10.1.1 |
startup_script | github.com/GoogleCloudPlatform/hpc-toolkit//modules/scripts/startup-script | v1.39.0&depth=1 |
Name | Type |
---|---|
google_compute_disk.spool | resource |
google_compute_region_disk.spool | resource |
google_storage_bucket_object.ap_config | resource |
random_shuffle.zones | resource |
google_compute_image.htcondor | data source |
google_compute_instance.ap | data source |
google_compute_region_instance_group.ap | data source |
google_compute_zones.available | data source |
Name | Description | Type | Default | Required |
---|---|---|---|---|
access_point_runner | A list of Toolkit runners for configuring an HTCondor access point | list(map(string)) |
[] |
no |
access_point_service_account_email | Service account for access point (e-mail format) | string |
n/a | yes |
allow_automatic_updates | If false, disables automatic system package updates on the created instances. This feature is only available on supported images (or images derived from them). For more details, see https://cloud.google.com/compute/docs/instances/create-hpc-vm#disable_automatic_updates |
bool |
true |
no |
autoscaler_runner | A list of Toolkit runners for configuring autoscaling daemons | list(map(string)) |
[] |
no |
central_manager_ips | List of IP addresses of HTCondor Central Managers | list(string) |
n/a | yes |
default_mig_id | Default MIG ID for HTCondor jobs; if unset, jobs must specify MIG id | string |
"" |
no |
deployment_name | Cluster Toolkit deployment name. HTCondor cloud resource names will include this value. | string |
n/a | yes |
disk_size_gb | Boot disk size in GB | number |
32 |
no |
disk_type | Boot disk size in GB | string |
"pd-balanced" |
no |
distribution_policy_target_shape | Target shape acoss zones for instance group managing high availability of access point | string |
"ANY_SINGLE_ZONE" |
no |
enable_high_availability | Provision HTCondor access point in high availability mode | bool |
false |
no |
enable_oslogin | Enable or Disable OS Login with "ENABLE" or "DISABLE". Set to "INHERIT" to inherit project OS Login setting. | string |
"ENABLE" |
no |
enable_public_ips | Enable Public IPs on the access points | bool |
false |
no |
enable_shielded_vm | Enable the Shielded VM configuration (var.shielded_instance_config). | bool |
false |
no |
htcondor_bucket_name | Name of HTCondor configuration bucket | string |
n/a | yes |
instance_image | Custom VM image with HTCondor and Toolkit support installed." Expected Fields: name: The name of the image. Mutually exclusive with family. family: The image family to use. Mutually exclusive with name. project: The project where the image is hosted. |
map(string) |
n/a | yes |
labels | Labels to add to resources. List key, value pairs. | map(string) |
n/a | yes |
machine_type | Machine type to use for HTCondor central managers | string |
"n2-standard-4" |
no |
metadata | Metadata to add to HTCondor central managers | map(string) |
{} |
no |
mig_id | List of Managed Instance Group IDs containing execute points in this pool (supplied by htcondor-execute-point module) | list(string) |
[] |
no |
network_self_link | The self link of the network in which the HTCondor central manager will be created. | string |
null |
no |
network_storage | An array of network attached storage mounts to be configured | list(object({ |
[] |
no |
project_id | Project in which HTCondor pool will be created | string |
n/a | yes |
region | Default region for creating resources | string |
n/a | yes |
service_account_scopes | Scopes by which to limit service account attached to central manager. | set(string) |
[ |
no |
shielded_instance_config | Shielded VM configuration for the instance (must set var.enabled_shielded_vm) | object({ |
{ |
no |
spool_disk_size_gb | Boot disk size in GB | number |
32 |
no |
spool_disk_type | Boot disk size in GB | string |
"pd-ssd" |
no |
spool_parent_dir | HTCondor access point configuration SPOOL will be set to subdirectory named "spool" | string |
"/var/lib/condor" |
no |
subnetwork_self_link | The self link of the subnetwork in which the HTCondor central manager will be created. | string |
null |
no |
update_policy | Replacement policy for Access Point Managed Instance Group ("PROACTIVE" to replace immediately or "OPPORTUNISTIC" to replace upon instance power cycle) | string |
"OPPORTUNISTIC" |
no |
zones | Zone(s) in which access point may be created. If not supplied, defaults to 2 randomly-selected zones in var.region. | list(string) |
[] |
no |
Name | Description |
---|---|
access_point_ips | IP addresses of the access points provisioned by this module |
access_point_name | Name of the access point provisioned by this module |