Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Driver persistence changes #810

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions api/nvidia/v1alpha1/nvidiadriver_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ type NVIDIADriverSpec struct {
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:booleanSwitch"
UseOpenKernelModules *bool `json:"useOpenKernelModules,omitempty"`

// PersistDriver indicates if the driver install should be persisted across restarts
PersistDriver *bool `json:"persist,omitempty"`

// InstallDirectory is the install location for the driver
InstallDirectory string `json:"installDirectory,omitempty"`

// NVIDIA Driver container startup probe settings
StartupProbe *ContainerProbeSpec `json:"startupProbe,omitempty"`

Expand Down
5 changes: 5 additions & 0 deletions api/nvidia/v1alpha1/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 7 additions & 0 deletions config/crd/bases/nvidia.com_nvidiadrivers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,9 @@ spec:
items:
type: string
type: array
installDirectory:
description: InstallDirectory is the install location for the driver
type: string
kernelModuleConfig:
description: 'Optional: Kernel module configuration parameters for
the NVIDIA Driver'
Expand Down Expand Up @@ -511,6 +514,10 @@ spec:
description: NodeSelector specifies a selector for installation of
NVIDIA driver
type: object
persist:
description: PersistDriver indicates if the driver install should
be persisted across restarts
type: boolean
priorityClassName:
description: 'Optional: Set priorityClassName'
type: string
Expand Down
6 changes: 4 additions & 2 deletions config/samples/nvidia_v1alpha1_nvidiadriver.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ spec:
# use pre-compiled packages for NVIDIA driver installation.
usePrecompiled: false
driverType: gpu
repository: nvcr.io/nvidia
repository: "1837582943" #nvcr.io/nvidia
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please revert this, this was required only for testing using your private image.

image: driver
version: "550.90.07"
version: "550.54.17"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this too

imagePullPolicy: IfNotPresent
imagePullSecrets: []
nodeSelector: {}
Expand All @@ -18,6 +18,8 @@ spec:
useHostMofed: false
gds:
enabled: false
persist: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't want to enable this by default but only when CDI is enabled. Need to handle this in the code to error out accordingly.

installDirectory: '/opt/nvidia/driver'
# Private mirror repository configuration
repoConfig:
name: ""
Expand Down
7 changes: 7 additions & 0 deletions deployments/gpu-operator/crds/nvidia.com_nvidiadrivers.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,9 @@ spec:
items:
type: string
type: array
installDirectory:
description: InstallDirectory is the install location for the driver
type: string
kernelModuleConfig:
description: 'Optional: Kernel module configuration parameters for
the NVIDIA Driver'
Expand Down Expand Up @@ -511,6 +514,10 @@ spec:
description: NodeSelector specifies a selector for installation of
NVIDIA driver
type: object
persist:
description: PersistDriver indicates if the driver install should
be persisted across restarts
type: boolean
priorityClassName:
description: 'Optional: Set priorityClassName'
type: string
Expand Down
51 changes: 51 additions & 0 deletions manifest.diff
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
diff --git a/manifests/state-driver/0500_daemonset.yaml b/manifests/state-driver/0500_daemonset.yaml
index 8ceb7820c..8716d147d 100644
--- a/manifests/state-driver/0500_daemonset.yaml
+++ b/manifests/state-driver/0500_daemonset.yaml
@@ -205,6 +205,12 @@ spec:
# always use runc for driver containers
- name: NVIDIA_VISIBLE_DEVICES
value: void
+ {{- if .Driver.Spec.PersistDriver }}
+ - name: RESTARTS_ENABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update this env to PERSIST_DRIVER

+ value: "true"
+ - name: INSTALL_DIR
+ value: {{ .Driver.Spec.InstallDirectory | default "/opt/nvidia/driver" }}
+ {{- end }}
{{- if deref .Driver.Spec.UseOpenKernelModules }}
- name: OPEN_KERNEL_MODULES_ENABLED
value: "true"
@@ -254,6 +260,14 @@ spec:
{{- end }}
{{- end }}
volumeMounts:
+ {{- if .Driver.Spec.PersistDriver }}
+ - name: install-dir
+ mountPath: {{ .Driver.Spec.InstallDirectory | default "/opt/nvidia/driver" }}
+ - name: lib-modules
+ mountPath: /lib/modules
+ - name: dev
+ mountPath: /dev
+ {{- end }}
- name: run-nvidia
mountPath: /run/nvidia
mountPropagation: Bidirectional
@@ -574,6 +588,18 @@ spec:
readOnly: true
{{- end }}
volumes:
+ {{- if .Driver.Spec.PersistDriver }}
+ - name: install-dir
+ hostPath:
+ path: {{ .Driver.Spec.InstallDirectory | default "/opt/nvidia/driver" }}
+ type: DirectoryOrCreate
+ - name: lib-modules
+ hostPath:
+ path: /lib/modules
+ - name: dev
+ hostPath:
+ path: /dev
+ {{- end }}
- name: run-nvidia
hostPath:
path: /run/nvidia
26 changes: 26 additions & 0 deletions manifests/state-driver/0500_daemonset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,12 @@ spec:
# always use runc for driver containers
- name: NVIDIA_VISIBLE_DEVICES
value: void
{{- if .Driver.Spec.PersistDriver }}
- name: RESTARTS_ENABLED
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. update env.

value: "true"
- name: INSTALL_DIR
value: {{ .Driver.Spec.InstallDirectory | default "/opt/nvidia/driver" }}
{{- end }}
{{- if deref .Driver.Spec.UseOpenKernelModules }}
- name: OPEN_KERNEL_MODULES_ENABLED
value: "true"
Expand Down Expand Up @@ -254,6 +260,14 @@ spec:
{{- end }}
{{- end }}
volumeMounts:
{{- if .Driver.Spec.PersistDriver }}
- name: install-dir
mountPath: {{ .Driver.Spec.InstallDirectory | default "/opt/nvidia/driver" }}
- name: lib-modules
mountPath: /lib/modules
- name: dev
mountPath: /dev
{{- end }}
- name: run-nvidia
mountPath: /run/nvidia
mountPropagation: Bidirectional
Expand Down Expand Up @@ -574,6 +588,18 @@ spec:
readOnly: true
{{- end }}
volumes:
{{- if .Driver.Spec.PersistDriver }}
- name: install-dir
hostPath:
path: {{ .Driver.Spec.InstallDirectory | default "/opt/nvidia/driver" }}
type: DirectoryOrCreate
- name: lib-modules
hostPath:
path: /lib/modules
- name: dev
hostPath:
path: /dev
{{- end }}
- name: run-nvidia
hostPath:
path: /run/nvidia
Expand Down